研究生: |
何品萱 Ho, Pin-Hsuan |
---|---|
論文名稱: |
應用於影片中動作辨識之運動和外觀的分解表示法學習 Decomposed Representation Learning of Motion and Appearance for Video Action Recognition |
指導教授: |
許秋婷
Hsu, Chiou-Ting |
口試委員: |
彭文孝
Peng, Wen-Hsiao 王聖智 Wang, Sheng-Jyh |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 24 |
中文關鍵詞: | 動作辨識 、分解表示法學習 、運動表示法 |
外文關鍵詞: | Action Recognition, Decomposed Representation Learning, Motion Representation |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
影片中時空的表示法學習對於電腦視覺領域中的動作辨識研究是至關重要的。為了解決影片中動作識別和理解的表示法學習問題,我們通過學習:(1)每張影片中的靜態外觀,以及(2)連續影片中的時間上的運動資訊。在本篇論文中,我們提出了一種基於光流的運動和外觀網絡(FMA-Net),包括一個生成器網絡,一個分類網絡和一個鑑別器網絡,以學習影片中運動和外觀的分解表示法。此外,為了捕捉運動細節的資訊,我們建議從光流預測中學習,且無需在測試階段進行流量計算。我們提出的FMA-Net是一個端到端的架構,可以同時學習分類網絡,並在對抗訓練中生成準確的光流。我們在兩個動作識別的資料集上進行實驗:UCF101和HMDB51。在相同的設置下,我們的實驗結果表明,提出的FMA-Net不僅優於基線網絡,而且與其他採用最先進的方法相比我們方法獲得了有競爭力的結果。
Spatiotemporal representation learning in videos is essential to action recognition in computer vision. We address the problem of video representation learning for video action recognition and understanding through learning: (1) static appearance in each frame, and (2) temporal motion across consecutive frames. In this thesis, we propose a Flow-based Motion and Appearance Network (FMA-Net), which includes a generator network, a classification network and a discriminator network, to learn a decomposed representation of motion and appearance in videos. Furthermore, in order to capture motion details, we propose to learn the model from optical flow prediction without flow computation at test time. The proposed FMA-Net is an end-to-end framework and can simultaneously learn the classification network and generate accurate optical flow in adversarial training. We perform experiments on two action recognition benchmarks: UCF101 and HMDB51. Under the same setting, our experimental results show that the proposed FMA-Net not only outperforms the baseline network but also achieves competitive results with state-of-the-art methods on these two datasets.
[1] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.
[2] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, pages 4489–4497, 2015.
[3] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
[4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
[5] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
[7] L. Wang, W. Li, W. Li, L. V. Gool. Appearance-and-Relation Networks for Video Classification. In CVPR, 2018.
[8] N. Crasto, P. Weinzaepfel, K. Alahari, C. Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition. In CVPR, 2019.
[9] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[10] S. Tulyakov, M. Liu, X. Yang, J. Kautz. MoCoGAN: Decomposing Motion and Content for Video Generation. In CVPR, 2018.
[11] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
[13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[14] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017.
[15] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
[16] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, W. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28. 2015.
[17] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv:1610.02391, 2017.
[18] K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.
[20] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[22] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
[23] Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman. Convolutional Two-Stream Network Fusion for Video Action Recognition. In CVPR, 2016.
[24] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. CoRR, abs/1604.04494, 2016.
[25] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.