研究生: |
龔致珩 Kung, Chih-Heng |
---|---|
論文名稱: |
從連續照片中學習對應演員之動作特徵以進行演員和動作語義分割 Learning Motion Features Corresponding to Actors from Images for Joint Actor-Action Semantic Segmentation |
指導教授: |
林嘉文
Lin, Chia-Wen |
口試委員: |
胡敏君
Hu, Min-Chun 林彥宇 Lin, Yen-Yu 黃敬群 Huang, Ching-Chun |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 33 |
中文關鍵詞: | 演員與動作語義分割 、全域與區域動作特徵融合 |
外文關鍵詞: | Actor-Action Semantic Segmentaiton, Global and Local Motion Feature Aggregation |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
影片理解是在電腦視覺領域中重要的課題,在這其中有許多的任務,例如人類動作識別、影片物件分割等等。然而,物件語意分割和動作偵測可以合併成演員和動作語意分割。這個任務是在於同時提供有演員和動作的語意分割遮罩。這是一個具有挑戰性的任務但對於許多應用方面是十分必要的。在近期他人的做法中,通常會使用雙流的架構和使用照片和光流圖當輸入去抓取外觀特徵和運動特徵。在最先進的做法中,他們使用時間卷積層,兩個提取器,兩個分類器和上述架構去達到最好的性能。在我們的做法中,我們提出在只使用照片當輸入的情況下,利用可學習的運動表示模塊和修改過的特徵金字塔網絡去抓取物體部分的運動特徵,並且和帶有演員資訊的特徵做融合,藉此學習跟演員物體相關的運動特徵。我們的做法可以有效的平衡上下文信息,外觀信息和多任務學習並提供演員和動作的語意分割遮罩。我們把網路訓練和測試在演員-動作資料集(Actor-Action Dataset)並且我們的類別平均精度和平均交集分之聯集超越最先進的做法。在動作識別的方面,我們的類別平均精度也超越了最先進的做法許多。由此可見,在演員和動作語意分割的任務上,我們的做法提供更銳利、完整的遮罩和更精準的演員動作預測。
Video understanding is the important topic in computer vision. There are the several tasks of this topic, like human action recognition, video object segmentation, etc. However, the combination of action recognition task and object semantic segmentation task, called joint actor-action semantic segmentation which provides the segmentation mask with actor classes and action classes simultaneously, is still the challenge task and also necessary for many application. In recent works, they
usually use two types of input and two-stream network to capture the appearance feature and motion feature. Ji et al. [1] proposed the two-stream Mask-RCNN [2] based network with inputs of RGB clips and optical flow clips. They use temporal convolutional layers, two extractors, and two classifiers to capture the appearance feature and the temporal feature and achieve the state-of-the-art performance. In our work, we propose the architecture with learning based motion representation module and modified FPN to capture partial motion feature of object for single input of RGB clips. And then, we fuse parital motion feature of object and the features contain actor information to learn the motion feature which corresponding to actor. Our model efficiently leverages contextual information, appearance information, and multitask learning to provide the semantic segmentation mask with joint actor-action label. We train and test on Actor-Action Dataset [3] (A2D) and outperform the state-of-the-art model with 3.1% of ave, 9.7% of mIoU in joint part and 6.9% of ave in action part. Thus, our approach provides the sharper mask and more precise actor-action prediction in joint actor-action semantic segmentation task.
[1] J. Ji, S. Buch, A. Soto, and J. C. Niebles, “End-to-end joint semantic segmentation of actors and actions in video,” IEEE European Conference on Computer Vision (ECCV),2018.
[2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” IEEE International Conference on Computer Vision (ICCV), 2017.
[3] C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso, “Can humans fly? action understanding with multiple classes of actors,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[4] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[5] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Joint learning of object and action detectors,” IEEE International Conference on Computer Vision (ICCV), 2017.
[6] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollàr, “Learning to refine object segments,”IEEE European Conference on Computer Vision (ECCV), 2016.
[7] K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. M. Snoek, “Actor and action video segmentation from a sentence,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[8] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
[9] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam, “Masklab: Instance segmentation by refining object detection with semantic and direction features,”in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[10] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool,“One-shot video object segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2017.
[11] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal by guided anchoring,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[12] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra r-cnn: Towards balanced learning for object detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[13] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Dec 2014.
[14] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[15] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” IEEE European Conference on Computer Vision (ECCV), 2018.
[16] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang, “End-to-end learning of motion representation for video understanding,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[17] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l1 optical flow,” Joint Pattern Recognition Symposium, pp. 214–223, 2007.
[18] M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network: Fixed motion filter for action recognition,” in The European Conference on Computer Vision (ECCV), September 2018.
[19] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, “Optical flow guided feature: A fast and robust motion representation for video action recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[20] A. Piergiovanni and M. S. Ryoo, “Representation flow for action recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[21] C. Xu and J. J. Corso, “Actor-action semantic segmentation with grouping process models,” arXiv:1512.09041, 2015.
[22] Y. Yan, C. Xu, D. Cai, and J. Corso, “Weakly supervised actor-action segmentation via robust multi-task ranking,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[23] Z. Qiu, T. Yao, and T. Mei, “Learning deep spatio-temporal dependency for semantic video segmentation,” IEEE Transactions on Multimedia, 2018.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[25] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015.
[26] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” IEEE International Conference on Computer Vision (ICCV), 2017.
[27] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, pp. 91–99, 2015.
[29] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” arXiv:1811.11168, 2018.
[30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE International Conference on Computer Vision (ICCV), 2017.
[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, 2017.
[32] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” IEEE European Conference on Computer Vision (ECCV), 2014.
[33] Y. Wu and K. He, “Group normalization,” Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[34] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille, “Weight standardization,” arXiv preprint arXiv:1903.10520, 2019.