簡易檢索 / 詳目顯示

研究生: 葉政賢
Yeh, Jheng-Hsien
論文名稱: 以人體動作增強特徵融合與文字提示進行零樣本空間時間動作偵測
M-CLIP: Zero-Shot Spatio Temporal Action Detection with Motion-Enhanced Text Prompting and Feature Fusion
指導教授: 賴尚宏
Lai, Shang-Hong
口試委員: 胡敏君
Hu, Min-Chun
邱瀞德
Chiu, Ching-Te
陳敏弘
Chen, Min-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 35
中文關鍵詞: 動作識別動作偵測電腦視覺
外文關鍵詞: action, detection, vision
相關次數: 點閱:76下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在當前的零樣本動作檢測框架中,特徵會從幀中提取出來,並在空間和時間上進行結合。除了場景分析外,演員的身體變化也應作為動作判斷的依據。然而,儘管許多有監督的方法結合了基於骨架的演員姿態特徵,但尚無研究在這一背景下探索零樣本動作檢測。我們提出了M-CLIP方法,通過增強演員與場景的關係並利用骨骼運動來判斷動作。我們的方法提取了分別來自物體、場景和連續畫面之動作線索,並結合彼此關聯性生成每個演員獨特的特徵。從演員骨架中提取運動特徵,我們將其結合至演員特徵,並將其與互動和文本特徵結合,以得出動作語義,從而更好的達到雙方的匹配性。實驗結果顯示,我們的方法和特徵處理在零樣本情境下的表現優於當前的最先進方法。


    In current zero-shot action detection frameworks, features are extracted from frames both spatially and temporally and combined. Besides scene analysis, the actor's body changes over time should also be a basis for action determination. However, while many supervised methods combine skeleton-based actor pose features, no work has explored zero-shot action detection in this context. We propose \nickname, which enhances actor-scene relationships and exploits skeletal movements for action determination. Our approach strengthens action cues from objects, scenes, and adjacent frames. We extract motion features from actor skeletons and combine them with interaction and textual features to derive action semantics. Experimental results show that our method and feature processing outperform current state-of-the-art approaches in zero-shot scenarios.

    摘要 1 Abstract 2 1 Introduction 7 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Related Work 10 2.1 Vision-Language Models for Video Understanding . . . . . . . . . . . . . . . 10 2.2 Skeleton-Based Action Recognition . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Zero-Shot Action Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Methodology 12 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Multimodal Feature Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Multi-Context Aggregator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Motion-Aware Prompting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Experiment 20 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1 Label group splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Adapting CLIP-Based Models for Detection . . . . . . . . . . . . . . . . . . . 23 4.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.5 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.6 More experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.6.1 Effects of Unclear Data . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.6.2 Motion-aware Prompting Clusters . . . . . . . . . . . . . . . . . . . . 29 4.7 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Conclusion 32 References 32

    [1] M. Jain, J. C. Van Gemert, T. Mensink, and C. G. Snoek, “Objects2action: Classifying and
    localizing actions without any video example,” in Proceedings of the IEEE international
    conference on computer vision, pp. 4588–4596, 2015.
    [2] P. Mettes and C. G. Snoek, “Spatial-aware object embeddings for zero-shot localization
    and classification of actions,” in Proceedings of the IEEE international conference on
    computer vision, pp. 4443–4452, 2017.
    [3] P. Mettes, “Universal prototype transport for zero-shot action recognition and localiza-
    tion,” International Journal of Computer Vision, vol. 131, no. 11, pp. 3060–3073, 2023.
    [4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
    P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language
    supervision,” in International conference on machine learning, pp. 8748–8763, PMLR,
    2021.
    [5] G. J. Faure, M.-H. Chen, and S.-H. Lai, “Holistic interaction transformer network for
    action detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of
    Computer Vision, pp. 3340–3350, 2023.
    [6] W. Zhu, X. Ma, Z. Liu, L. Liu, W. Wu, and Y. Wang, “Motionbert: A unified perspective on
    learning human motion representations,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision, pp. 15085–15099, 2023.
    [7] M. Wang, J. Xing, J. Mei, Y. Liu, and Y. Jiang, “Actionclip: Adapting language-image
    pretrained models for video action recognition,” IEEE Transactions on Neural Networks
    and Learning Systems, 2023.
    [8] S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text
    adaptive clip via multimodal prompting,” in Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition, pp. 23034–23044, 2023.
    [9] Y. Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji, “X-CLIP:: End-to-end multi-grained
    contrastive learning for video-text retrieval,” arXiv preprint arXiv:2207.07285, 2022.
    [10] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-
    language prompting,” in Computer Vision–ECCV 2022: 17th European Conference, Tel
    Aviv, Israel, October 23–27, 2022, Proceedings, Part III, pp. 681–697, Springer, 2022.
    [11] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language models for
    efficient video understanding,” in Computer Vision–ECCV 2022: 17th European Con-
    ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 105–124,
    Springer, 2022.
    [12] C. Deng, Q. Chen, P. Qin, D. Chen, and Q. Wu, “Prompt switch: Efficient clip adaptation
    for text-video retrieval,” in Proceedings of the IEEE/CVF International Conference on
    Computer Vision, pp. 15648–15658, 2023.
    [13] D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi, “Align and prompt: Video-and-language
    pre-training with entity prompts,” in Proceedings of the IEEE/CVF Conference on Com-
    puter Vision and Pattern Recognition, pp. 4953–4963, 2022.
    [14] X. Chen, P. Peng, and H. Tang, “Stlgcn: Spatial-temporal graph convolutional network
    for long term traffic forecasting,” in International Conference on Big Data Technologies
    and Applications, pp. 49–61, Springer, 2023.
    [15] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, “Generative action description prompts
    for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International
    Conference on Computer Vision, pp. 10276–10285, 2023.
    [16] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, “Zero-shot action recognition
    with error-correcting output codes,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, pp. 2833–2842, 2017.
    [17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representa-
    tions in vector space,” arXiv preprint arXiv:1301.3781, 2013.
    [18] T. Lin, X. Zhao, and Z. Shou, “Single shot temporal action detection,” in Proceedings of
    the 25th ACM international conference on Multimedia, pp. 988–996, 2017.
    [19] C. Gan, M. Lin, Y. Yang, Y. Zhuang, and A. G. Hauptmann, “Exploring semantic inter-
    class relationships (sir) for zero-shot action recognition,” in Proceedings of the AAAI Con-
    ference on Artificial Intelligence, vol. 29, 2015.
    [20] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding
    language-image pretrained models for general video recognition,” in Computer Vision–
    ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceed-
    ings, Part IV, pp. 1–18, Springer, 2022.
    [21] S. Chen and D. Huang, “Elaborative rehearsal for zero-shot action recognition,” in Pro-
    ceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13638–
    13647, 2021.
    [22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
    A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the
    ACM, vol. 63, no. 11, pp. 139–144, 2020.
    [23] D. Mandal, S. Narayan, S. K. Dwivedi, V. Gupta, S. Ahmed, F. S. Khan, and L. Shao, “Out-
    of-distribution detection for generalized zero-shot action recognition,” in Proceedings of
    the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9985–9993,
    2019.
    [24] W.-J. Huang, J.-H. Yeh, M.-H. Chen, G. J. Faure, and S.-H. Lai, “Interaction-
    aware prompting for zero-shot spatio-temporal action detection,” in Proceedings of the
    IEEE/CVF International Conference on Computer Vision, pp. 284–293, 2023.
    [25] H.-S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y.-L. Li, and C. Lu, “Alphapose: Whole-
    body regional multi-person pose estimation and tracking in real-time,” IEEE Transactions
    on Pattern Analysis and Machine Intelligence, 2022.
    [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detec-
    tion with region proposal networks,” Advances in neural information processing systems,
    vol. 28, 2015.
    [27] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid
    networks for object detection,” in Proceedings of the IEEE conference on computer vision
    and pattern recognition, pp. 2117–2125, 2017.
    [28] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action
    recognition,” in International Conf. on Computer Vision (ICCV), pp. 3192–3199, Dec.
    2013.
    [29] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes
    from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
    [30] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action
    recognition,” in Proceedings of the IEEE international conference on computer vision,
    pp. 3192–3199, 2013

    QR CODE