研究生: |
葉政賢 Yeh, Jheng-Hsien |
---|---|
論文名稱: |
以人體動作增強特徵融合與文字提示進行零樣本空間時間動作偵測 M-CLIP: Zero-Shot Spatio Temporal Action Detection with Motion-Enhanced Text Prompting and Feature Fusion |
指導教授: |
賴尚宏
Lai, Shang-Hong |
口試委員: |
胡敏君
Hu, Min-Chun 邱瀞德 Chiu, Ching-Te 陳敏弘 Chen, Min-Hung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 35 |
中文關鍵詞: | 動作識別 、動作偵測 、電腦視覺 |
外文關鍵詞: | action, detection, vision |
相關次數: | 點閱:76 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在當前的零樣本動作檢測框架中,特徵會從幀中提取出來,並在空間和時間上進行結合。除了場景分析外,演員的身體變化也應作為動作判斷的依據。然而,儘管許多有監督的方法結合了基於骨架的演員姿態特徵,但尚無研究在這一背景下探索零樣本動作檢測。我們提出了M-CLIP方法,通過增強演員與場景的關係並利用骨骼運動來判斷動作。我們的方法提取了分別來自物體、場景和連續畫面之動作線索,並結合彼此關聯性生成每個演員獨特的特徵。從演員骨架中提取運動特徵,我們將其結合至演員特徵,並將其與互動和文本特徵結合,以得出動作語義,從而更好的達到雙方的匹配性。實驗結果顯示,我們的方法和特徵處理在零樣本情境下的表現優於當前的最先進方法。
In current zero-shot action detection frameworks, features are extracted from frames both spatially and temporally and combined. Besides scene analysis, the actor's body changes over time should also be a basis for action determination. However, while many supervised methods combine skeleton-based actor pose features, no work has explored zero-shot action detection in this context. We propose \nickname, which enhances actor-scene relationships and exploits skeletal movements for action determination. Our approach strengthens action cues from objects, scenes, and adjacent frames. We extract motion features from actor skeletons and combine them with interaction and textual features to derive action semantics. Experimental results show that our method and feature processing outperform current state-of-the-art approaches in zero-shot scenarios.
[1] M. Jain, J. C. Van Gemert, T. Mensink, and C. G. Snoek, “Objects2action: Classifying and
localizing actions without any video example,” in Proceedings of the IEEE international
conference on computer vision, pp. 4588–4596, 2015.
[2] P. Mettes and C. G. Snoek, “Spatial-aware object embeddings for zero-shot localization
and classification of actions,” in Proceedings of the IEEE international conference on
computer vision, pp. 4443–4452, 2017.
[3] P. Mettes, “Universal prototype transport for zero-shot action recognition and localiza-
tion,” International Journal of Computer Vision, vol. 131, no. 11, pp. 3060–3073, 2023.
[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language
supervision,” in International conference on machine learning, pp. 8748–8763, PMLR,
2021.
[5] G. J. Faure, M.-H. Chen, and S.-H. Lai, “Holistic interaction transformer network for
action detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pp. 3340–3350, 2023.
[6] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on
learning human motion representations,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 15085–15099, 2023.
[7] M. Wang, J. Xing, J. Mei, Y. Liu, and Y. Jiang, “Actionclip: Adapting language-image
pretrained models for video action recognition,” IEEE Transactions on Neural Networks
and Learning Systems, 2023.
[8] S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text
adaptive clip via multimodal prompting,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 23034–23044, 2023.
[9] Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, “X-CLIP:: End-to-end multi-grained
contrastive learning for video-text retrieval,” arXiv preprint arXiv:2207.07285, 2022.
[10] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-
language prompting,” in Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 681–697, Springer, 2022.
[11] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language models for
efficient video understanding,” in Computer Vision–ECCV 2022: 17th European Con-
ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 105–124,
Springer, 2022.
[12] C. Deng, Q. Chen, P. Qin, D. Chen, and Q. Wu, “Prompt switch: Efficient clip adaptation
for text-video retrieval,” in Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 15648–15658, 2023.
[13] D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi, “Align and prompt: Video-and-language
pre-training with entity prompts,” in Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pp. 4953–4963, 2022.
[14] X. Chen, P. Peng, and H. Tang, “Stlgcn: Spatial-temporal graph convolutional network
for long term traffic forecasting,” in International Conference on Big Data Technologies
and Applications, pp. 49–61, Springer, 2023.
[15] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, “Generative action description prompts
for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 10276–10285, 2023.
[16] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, “Zero-shot action recognition
with error-correcting output codes,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2833–2842, 2017.
[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representa-
tions in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[18] T. Lin, X. Zhao, and Z. Shou, “Single shot temporal action detection,” in Proceedings of
the 25th ACM international conference on Multimedia, pp. 988–996, 2017.
[19] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and A. G. Hauptmann, “Exploring semantic inter-
class relationships (sir) for zero-shot action recognition,” in Proceedings of the AAAI Con-
ference on Artificial Intelligence, vol. 29, 2015.
[20] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding
language-image pretrained models for general video recognition,” in Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed-
ings, Part IV, pp. 1–18, Springer, 2022.
[21] S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recognition,” in Pro-
ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13638–
13647, 2021.
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the
ACM, vol. 63, no. 11, pp. 139–144, 2020.
[23] D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, and L. Shao, “Out-
of-distribution detection for generalized zero-shot action recognition,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9985–9993,
2019.
[24] W.-J. Huang, J.-H. Yeh, M.-H. Chen, G. J. Faure, and S.-H. Lai, “Interaction-
aware prompting for zero-shot spatio-temporal action detection,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 284–293, 2023.
[25] H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, “Alphapose: Whole-
body regional multi-person pose estimation and tracking in real-time,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2022.
[26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detec-
tion with region proposal networks,” Advances in neural information processing systems,
vol. 28, 2015.
[27] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
networks for object detection,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2117–2125, 2017.
[28] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action
recognition,” in International Conf. on Computer Vision (ICCV), pp. 3192–3199, Dec.
2013.
[29] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes
from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[30] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action
recognition,” in Proceedings of the IEEE international conference on computer vision,
pp. 3192–3199, 2013