研究生: |
陳星宇 Chen, Hsing-Yu |
---|---|
論文名稱: |
基於人體姿勢動態歷史的多個體動作以及群體活動辨識 Pose Motion History for Multi-person Individual Action and Collective Activity Recognition |
指導教授: |
賴尚宏
Lai, Shang-Hong |
口試委員: |
陳煥宗
Chen, Hwann-Tzong 劉庭祿 Liu, Tyng-Luh 江振國 Chiang, Chen-Kuo |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 40 |
中文關鍵詞: | 人體姿勢 、多人動作辨識 、群體活動辨識 |
外文關鍵詞: | Human-Pose, Multi-person-Action-Recognition, Group-Activity-Recognition |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在這篇論文中,我們提出了一種基於深度學習的方法,該方法利用來自影像序列的多人姿態估測來預測群體場景中的個體動作以及群體活動。我們首先應用多人姿態估測來從影像序列中提取人體姿勢資訊,而不需使用人體偵測。其次,我們提出了一種稱為「人體姿勢動態歷史」(Pose Motion History, PMH)的新穎表徵,它將整個場景中多人人體關節的時空動態聚合成一個特徵圖堆疊。然後,從對應整個場景的這個特徵圖堆疊中裁剪出對應每個個體的「單人人體姿勢動態歷史」(Individual Pose Motion History, Indi-PMH),並將其輸入到簡單的 CNN 以獲得個體動作的預測結果。基於這些個體動作預測結果,我們建構了另一種稱為「群體圖」(Collective Map)的新穎表徵,它將群體場景中每個個體的位置和動作編碼成一個簡單的特徵圖堆疊。接著通過融合兩個簡單 CNN 的輸出來獲得最終的群體活動預測結果:一個採用對應整個場景的「人體姿勢動態歷史」,另一個採用「群體圖」作為輸入。我們在一個具有挑戰性的多人群體活動資料集 Volleyball 上進行評估,並透過比前人的方法更簡單的深度網絡架構,達到了具有競爭力的表現。
In this thesis, we propose a deep learning based approach that exploits multi-person pose estimation from an image sequence to predict individual actions as well as the collective activity for a group scene. We first apply multi-person pose estimation to extract pose information from the image sequence without using human detection. Second we propose a novel representation called “pose motion history”(PMH), that aggregates spatio-temporal dynamics of multi-person human joints in the whole scene into a single stack of feature maps. Then, individual pose motion history stacks(Indi-PMH) are cropped from the whole scene stack and sent into a simple CNN to obtain individual action predictions. Based on these individual predictions, we construct another novel representation called “collective map”, that encodes both the position and action of each individual in the group scene into a simple feature map stack. The final collective activity predictions are then obtained by fusing results of two simple CNNs. One takes the whole scene pose motion history stack, and the other takes collective map stack as input. We evaluate the proposed approach on a challenging multi-person activity dataset, “Volleyball”, and achieve competitive performance with a much simpler network architecture than the previous works.
1. Timur Bagautdinov, Alexandre Alahi, Francois Fleuret, Pascal Fua, and Silvio Savarese. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, 2017.
2. Sovan Biswas and Juergen Gall. Structural Recurrent Neural Network (SRNN) for Group Activity Analysis. In WACV, 2018.
3. Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates. TPAMI, 2001.
4. Congqi Cao, Yifan Zhang, Chunjie Zhang, and Hanqing Lu. Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors. In IJCAI, 2016.
5. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
6. Guilhem Ch´eron and Ivan Laptev. P-CNN: Pose-based CNN Features for Action Recognition. In ICCV, 2015.
7. Wongun Choi and Silvio Savarese. A Unified Framework for Multi-target Tracking and Collective Activity Recognition. In ECCV, 2012.
8. Wongun Choi, Khuram Shahid, and Silvio Savarese. Learning Context for Collective Activity Recognition. In CVPR, 2011.
9. Vasileios Choutas, Philippe Weinzaepfel, J´erˆome Revaud, and Cordelia Schmid. PoTion: Pose MoTion Representation for Action Recognition. In CVPR, 2018.
10. Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In CVPR, 2016.
11. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
12. Wenbin Du, Yali Wang, and Yu Qiao. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In ICCV, 2017.
13. Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In ICCV, 2017.
14. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
15. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In CVPR, 2018.
16. Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. Hierarchical deep temporal models for group activity recognition. TPAMI, 2016.
17. Umar Iqbal, Martin Garbade, and Juergen Gall. Pose for Action – Action for Pose. In FG, 2017.
18. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. TPAMI, 2013.
19. Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14(2):201 – 211, 1973.
20. Sameh Khamis, Vlad I Morariu, and Larry S Davis. A Flow Model for Joint Action Recognition and Identity Maintenance. In CVPR, 2012a.
21. Sameh Khamis, Vlad I Morariu, and Larry S Davis. Combining Per-Frame and Per-Track Cues for Multi-Person Action Recognition. In ECCV, 2012b.
22. Tian Lan, Leonid Sigal, and Greg Mori. Social Roles in Hierarchical Models for Human Activity Recognition. In CVPR, 2012a.
23. Tian Lan, Yang Wang, Weilong Yang, Stephen N. Robinovitch, and Greg Mori. Discriminative latent models for recognizing contextual group activities. TPAMI, 2012b.
24. Mengyuan Liu and Junsong Yuan. Recognizing Human Actions as the Evolution of Pose Estimation Maps. In CVPR, 2018.
25. Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, 2017.
26. Joe Yue Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
27. George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
28. Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei-Fei. Detecting events and key actors in multi-person videos. In CVPR, 2016.
29. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
30. Moumita Roy Tora, Jianhui Chen, and James J Little. Classification of Puck Possession Events in Ice Hockey. In CVPR Workshop, 2017.
31. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
32. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
33. Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. Differential recurrent neural networks for action recognition. In ICCV, 2015.
34. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
35. Di Wu, Nabin Sharma, and Michael Blumenstein. Recent advances in video-based human action recognition using deep learning: A review. In IJCNN, 2017.
36. Zuxuan Wu, Yu-gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. Multi-stream multi-class fusion of deep networks for video classification. In ACM MM, 2016.