探尋影片中的承擔特質｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	俞尚毅 Yu, Shang-Yi
論文名稱：	探尋影片中的承擔特質 Affordance Detection in Videos
指導教授：	陳煥宗 Chen, Hwann-Tzong
口試委員:	邱維辰 Chiu, Wei-Chen 胡敏君 Hu, Min-Chun
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	27
中文關鍵詞：	影片、承擔特質、偵測
外文關鍵詞：	Video, Affordance, Detection
相關次數：	點閱：60 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本文提出了一種全新有關於承擔特質的任務：在影片中的每一幀畫面中，尋
找具有承擔特質的區域以及判斷承擔特質的有無。以往有關承擔特質的研究
中，只著重在圖像的偵測；為了此項在影片中偵測承擔特質的新任務，我們
提出一個新的承擔特質資料庫Support Affordance Video (SAV) dataset，蒐
集支撐承擔特質的影片並設計一系列的動作情境，使得承擔特質有無的狀態
隨著情境中的動作及環境而改變。我們提出了網路架構，使用兩條不同的分
支加上專注於時間序的模組，預測在影片中的承擔特質的關注區域、承擔特
質的區域、及承擔特質有無的標籤。我們檢驗在SAV 資料集上測試的結果，
以驗證此方法的有效性。

This thesis proposes a new task on affordance: detecting the affordance
region and predicting the existence of affordance for each frame in a video
sequence. In the past, researches about affordance only focus on detection
for a single image. For this new task about affordance detection in videos,
we build a new affordance dataset, Support Affordance Video (SAV) dataset.
The dataset consists of support affordance videos that exhibit a series of action
scenarios to make the affordance existence status change as actions and environments
change in scenarios. We propose a network architecture that uses
two different branches and temporal modules to predict affordance attention
area, affordance region, and affordance existence label in a video. The experimental
results on SAV dataset provide a baseline of the new task and validate
the effectiveness of our method.

List of Tables 5
List of Figures 6
摘要7
Abstract 8
Introduction 9
Related work 11
Dataset 13
Our Approach 17
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Affordance Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Affordance Attention Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Affordance Existence Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Affordance Region Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Experiments 21
Experiments 21
1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Evaluation on SAV Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Conclusion and FutureWork 24
Bibliography 25
                                

[1] F.-J. Chu, R. Xu, and P. A. Vela. Detecting robotic affordances on novel objects with
regional attention and attributes. 2019.
[2] C. Chuang, J. Li, A. Torralba, and S. Fidler. Learning to act properly: Predicting and
explaining affordances from images. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 975–983, 2018.
[3] T. Do, A. Nguyen, and I. Reid. Affordancenet: An end-to-end deep learning approach
for object affordance detection. In 2018 IEEE International Conference on Robotics
and Automation (ICRA), pages 5882–5889, 2018.
[4] K. Fang, T. Wu, D. Yang, S. Savarese, and J. J. Lim. Demo2vec: Reasoning object
affordances from online videos. In 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2139–2147, 2018.
[5] D. F. Fouhey, X. Wang, and A. Gupta. In defense of the direct perception of affordances.
2015.
[6] J. Gibson. The Ecological Approach to Visual Perception. Resources for ecological
psychology. Lawrence Erlbaum Associates, 1986.
[7] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. CoRR,
abs/1703.06870, 2017.
[8] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks.
IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019.
[9] R. Karlsson and E. Sjoberg. Learning a directional soft lane affordance model for road
scenes using self-supervision. In IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2020, Las Vegas, NV, United States, October 20-23, 2020, 2020.
[10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Largescale
video classification with convolutional neural networks. In 2014 IEEE Conference
on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[11] J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding.
In 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
pages 7082–7092, 2019.
[12] L. Manuelli,W. Gao, P. R. Florence, and R. Tedrake. kpam: Keypoint affordances for
category-level robotic manipulation. CoRR, abs/1903.06684, 2019.
[13] A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos. Affordance detection of tool
parts from geometric features. In 2015 IEEE International Conference on Robotics
and Automation (ICRA), pages 1374–1381, 2015.
[14] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation.
CoRR, abs/1603.06937, 2016.
[15] M. Rahman and Y. Wang. Optimizing intersection-over-union in deep neural networks
for image segmentation. volume 10072, pages 234–244, 12 2016.
[16] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical
image segmentation. CoRR, abs/1505.04597, 2015.
[17] A. Roy and S. Todorovic. A multi-scale cnn for affordance segmentation in rgb images.
2016.
[18] E. Ruiz and W. W. Mayol-Cuevas. Egocentric affordance detection with the one-shot
geometry-driven interaction tensor. CoRR, abs/1906.05794, 2019.
[19] J. Sawatzky, Y. Souri, C. Grund, and J. Gall. What object should I use? - task driven
object detection. CoRR, abs/1904.03000, 2019.
[20] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo. Convolutional
LSTM network: A machine learning approach for precipitation nowcasting. CoRR,
abs/1506.04214, 2015.
[21] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for
weakly-supervised object and action localization. CoRR, abs/1704.04232, 2017.
[22] A. Srikantha and J. Gall. Weakly supervised learning of affordances. 2016.
[23] C. Sun, J. M. U. Vianney, and D. Cao. Affordance learning in direct perception for
autonomous driving. 2019.
[24] S. Thermos, P. Daras, and G. Potamianos. A deep learning approach to object affordance
segmentation. In ICASSP 2020 - 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 2358–2362, 2020.
[25] S. Thermos, G. T. Papadopoulos, P. Daras, and G. Potamianos. Deep affordancegrounded
sensorimotor object recognition. CoRR, abs/1704.02787, 2017.
[26] M. Toromanoff, E. Wirbel, and F. Moutarde. End-to-end model-free reinforcement
learning for urban driving using implicit affordances. In IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, 2020.
[27] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at
spatiotemporal convolutions for action recognition. In 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
[28] Q.Wang, L. Zhang, L. Bertinetto,W. Hu, and P. H. S. Torr. Fast online object tracking
and segmentation: A unifying approach. In 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 1328–1338, 2019.
[29] X. Wang, R. B. Girshick, A. Gupta, and K. He. Non-local neural networks. CoRR,
abs/1711.07971, 2017.
[30] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. L. Price, S. Cohen, and
T. S. Huang. Youtube-vos: Sequence-to-sequence video object segmentation. CoRR,
abs/1809.00461, 2018.
[31] L. Yang, Y. Fan, and N. Xu. Video instance segmentation. In 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), pages 5187–5196, 2019.
[32] L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y. Lin. Learning to see before learning
to act: Visual pre-training for manipulation. In IEEE International Conference on
Robotics and Automation (ICRA), 2020.

簡易檢索 / 詳目顯示

相關論文