簡易檢索 / 詳目顯示

研究生: 俞尚毅
Yu, Shang-Yi
論文名稱: 探尋影片中的承擔特質
Affordance Detection in Videos
指導教授: 陳煥宗
Chen, Hwann-Tzong
口試委員: 邱維辰
Chiu, Wei-Chen
胡敏君
Hu, Min-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 27
中文關鍵詞: 影片承擔特質偵測
外文關鍵詞: Video, Affordance, Detection
相關次數: 點閱:60下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本文提出了一種全新有關於承擔特質的任務:在影片中的每一幀畫面中,尋
    找具有承擔特質的區域以及判斷承擔特質的有無。以往有關承擔特質的研究
    中,只著重在圖像的偵測;為了此項在影片中偵測承擔特質的新任務,我們
    提出一個新的承擔特質資料庫Support Affordance Video (SAV) dataset,蒐
    集支撐承擔特質的影片並設計一系列的動作情境,使得承擔特質有無的狀態
    隨著情境中的動作及環境而改變。我們提出了網路架構,使用兩條不同的分
    支加上專注於時間序的模組,預測在影片中的承擔特質的關注區域、承擔特
    質的區域、及承擔特質有無的標籤。我們檢驗在SAV 資料集上測試的結果,
    以驗證此方法的有效性。


    This thesis proposes a new task on affordance: detecting the affordance
    region and predicting the existence of affordance for each frame in a video
    sequence. In the past, researches about affordance only focus on detection
    for a single image. For this new task about affordance detection in videos,
    we build a new affordance dataset, Support Affordance Video (SAV) dataset.
    The dataset consists of support affordance videos that exhibit a series of action
    scenarios to make the affordance existence status change as actions and environments
    change in scenarios. We propose a network architecture that uses
    two different branches and temporal modules to predict affordance attention
    area, affordance region, and affordance existence label in a video. The experimental
    results on SAV dataset provide a baseline of the new task and validate
    the effectiveness of our method.

    List of Tables 5 List of Figures 6 摘要7 Abstract 8 1 Introduction 9 2 Related work 11 3 Dataset 13 4 Our Approach 17 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Affordance Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Affordance Attention Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4 Affordance Existence Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.5 Affordance Region Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Experiments 21 5 Experiments 21 5.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2.1 Evaluation on SAV Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 Conclusion and FutureWork 24 Bibliography 25

    [1] F.-J. Chu, R. Xu, and P. A. Vela. Detecting robotic affordances on novel objects with
    regional attention and attributes. 2019.
    [2] C. Chuang, J. Li, A. Torralba, and S. Fidler. Learning to act properly: Predicting and
    explaining affordances from images. In 2018 IEEE/CVF Conference on Computer
    Vision and Pattern Recognition, pages 975–983, 2018.
    [3] T. Do, A. Nguyen, and I. Reid. Affordancenet: An end-to-end deep learning approach
    for object affordance detection. In 2018 IEEE International Conference on Robotics
    and Automation (ICRA), pages 5882–5889, 2018.
    [4] K. Fang, T. Wu, D. Yang, S. Savarese, and J. J. Lim. Demo2vec: Reasoning object
    affordances from online videos. In 2018 IEEE/CVF Conference on Computer Vision
    and Pattern Recognition, pages 2139–2147, 2018.
    [5] D. F. Fouhey, X. Wang, and A. Gupta. In defense of the direct perception of affordances.
    2015.
    [6] J. Gibson. The Ecological Approach to Visual Perception. Resources for ecological
    psychology. Lawrence Erlbaum Associates, 1986.
    [7] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR,
    abs/1703.06870, 2017.
    [8] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks.
    IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
    [9] R. Karlsson and E. Sjoberg. Learning a directional soft lane affordance model for road
    scenes using self-supervision. In IEEE Conference on Computer Vision and Pattern
    Recognition, CVPR 2020, Las Vegas, NV, United States, October 20-23, 2020, 2020.
    [10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Largescale
    video classification with convolutional neural networks. In 2014 IEEE Conference
    on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
    [11] J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding.
    In 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
    pages 7082–7092, 2019.
    [12] L. Manuelli,W. Gao, P. R. Florence, and R. Tedrake. kpam: Keypoint affordances for
    category-level robotic manipulation. CoRR, abs/1903.06684, 2019.
    [13] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos. Affordance detection of tool
    parts from geometric features. In 2015 IEEE International Conference on Robotics
    and Automation (ICRA), pages 1374–1381, 2015.
    [14] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation.
    CoRR, abs/1603.06937, 2016.
    [15] M. Rahman and Y. Wang. Optimizing intersection-over-union in deep neural networks
    for image segmentation. volume 10072, pages 234–244, 12 2016.
    [16] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical
    image segmentation. CoRR, abs/1505.04597, 2015.
    [17] A. Roy and S. Todorovic. A multi-scale cnn for affordance segmentation in rgb images.
    2016.
    [18] E. Ruiz and W. W. Mayol-Cuevas. Egocentric affordance detection with the one-shot
    geometry-driven interaction tensor. CoRR, abs/1906.05794, 2019.
    [19] J. Sawatzky, Y. Souri, C. Grund, and J. Gall. What object should I use? - task driven
    object detection. CoRR, abs/1904.03000, 2019.
    [20] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo. Convolutional
    LSTM network: A machine learning approach for precipitation nowcasting. CoRR,
    abs/1506.04214, 2015.
    [21] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for
    weakly-supervised object and action localization. CoRR, abs/1704.04232, 2017.
    [22] A. Srikantha and J. Gall. Weakly supervised learning of affordances. 2016.
    [23] C. Sun, J. M. U. Vianney, and D. Cao. Affordance learning in direct perception for
    autonomous driving. 2019.
    [24] S. Thermos, P. Daras, and G. Potamianos. A deep learning approach to object affordance
    segmentation. In ICASSP 2020 - 2020 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP), pages 2358–2362, 2020.
    [25] S. Thermos, G. T. Papadopoulos, P. Daras, and G. Potamianos. Deep affordancegrounded
    sensorimotor object recognition. CoRR, abs/1704.02787, 2017.
    [26] M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement
    learning for urban driving using implicit affordances. In IEEE Conference on Computer
    Vision and Pattern Recognition, CVPR, 2020.
    [27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at
    spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF Conference
    on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
    [28] Q.Wang, L. Zhang, L. Bertinetto,W. Hu, and P. H. S. Torr. Fast online object tracking
    and segmentation: A unifying approach. In 2019 IEEE/CVF Conference on Computer
    Vision and Pattern Recognition (CVPR), pages 1328–1338, 2019.
    [29] X. Wang, R. B. Girshick, A. Gupta, and K. He. Non-local neural networks. CoRR,
    abs/1711.07971, 2017.
    [30] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. L. Price, S. Cohen, and
    T. S. Huang. Youtube-vos: Sequence-to-sequence video object segmentation. CoRR,
    abs/1809.00461, 2018.
    [31] L. Yang, Y. Fan, and N. Xu. Video instance segmentation. In 2019 IEEE/CVF International
    Conference on Computer Vision (ICCV), pages 5187–5196, 2019.
    [32] L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y. Lin. Learning to see before learning
    to act: Visual pre-training for manipulation. In IEEE International Conference on
    Robotics and Automation (ICRA), 2020.

    QR CODE