研究生: |
黃偉哲 Huang, Wei-Jhe |
---|---|
論文名稱: |
基於時空背景提示的零樣本動作偵測 ST-CLIP: Spatio-Temporal Context Prompting for Zero-Shot Action Detection |
指導教授: |
賴尚宏
Lai, Shang-Hong |
口試委員: |
邱瀞德
Chiu, Ching-Te 胡敏君 Hu, Min-Chun 陳敏弘 Chen, Min-Hung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 41 |
中文關鍵詞: | 動作偵測 、影片理解 、視覺語言模型 、電腦視覺 |
外文關鍵詞: | Action Detection, Video Understanding, Visual-Language Model, Computer Vision |
相關次數: | 點閱:60 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
時空動作偵測涵蓋了在影片中定位和分類個人動作的任務。最近的研究通過引入互動建模來捕捉人物及其周圍環境之間的關係,以幫助判斷動作。然而,這些方法主要集中在完全監督學習上,目前的限制在於缺乏判斷未見過動作的泛化能力。在本研究中,我們旨在調整預訓練的圖像-語言模型來偵測未見過的動作。為此,我們提出了一種方法,可以有效利用視覺-語言模型的豐富知識來執行人物-背景互動。同時,我們的背景提示模組會利用背景資訊來提示標籤,從而生成更具代表性的文本特徵。此外,為了在同一時間點中識別多個人的不同動作,我們設計了興趣點定位機制,該機制利用預訓練的視覺知識來找到每個人感興趣的背景部分,接著使用這些資訊來生成針對個人量身打造的文本特徵。為了評估偵測未見過動作的能力,我們在J-HMDB、UCF101-24和AVA數據集上提出了一個綜合基準。實驗顯示我們的方法相比於先前的方法取得了優異的結果,並且可以進一步擴展到多動作的影片,使其更接近於現實應用。
Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this work, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications.
[1] S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23034–23044, 2023.
[2] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in European conference on computer vision, pp. 1–21, Springer, 2022.
[3] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[4] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
[5] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459, 2018.
[6] J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, “Asynchronous interaction aggregation for action detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pp. 71–87, Springer, 2020.
[7] J. Pan, S. Chen, M. Z. Shou, Y. Liu, J. Shao, and H. Li, “Actor-context-actor relation network for spatio-temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 464–474, 2021.
[8] G. J. Faure, M.-H. Chen, and S.-H. Lai, “Holistic interaction transformer network for action detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3340–3350, 2023.
[9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
[10] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning, pp. 4904–4916, PMLR, 2021.
[11] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
[12] M. Wang, J. Xing, and Y. Liu, “Actionclip: A new paradigm for video action recognition,” arXiv preprint arXiv:2109.08472, 2021.
[13] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in European Conference on Computer Vision, pp. 1–18, Springer, 2022.
[14] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language models for efficient video understanding,” in European Conference on Computer Vision, pp. 105–124, Springer, 2022.
[15] W. Wu, X. Wang, H. Luo, J. Wang, Y. Yang, and W. Ouyang, “Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6620–6630, 2023.
[16] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-language prompting,” in European Conference on Computer Vision, pp. 681–697, Springer, 2022.
[17] W.-J. Huang, J.-H. Yeh, M.-H. Chen, G. J. Faure, and S.-H. Lai, “Interaction-aware prompting for zero-shot spatio-temporal action detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 284–293, 2023.
[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[19] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
[20] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334, 2018.
[21] X. Yang, X. Yang, M.-Y. Liu, F. Xiao, L. S. Davis, and J. Kautz, “Step: Spatio-temporal progressive learning for video action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 264–272, 2019.
[22] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 244–253, 2019.
[23] S. Chen, P. Sun, E. Xie, C. Ge, J. Wu, L. Ma, J. Shen, and P. Luo, “Watch only once: An end-to-end video action detection framework,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8178–8187, 2021.
[24] J. Zhao, Y. Zhang, X. Li, H. Chen, B. Shuai, M. Xu, C. Liu, K. Kundu, Y. Xiong, D. Modolo, et al., “Tuber: Tubelet transformer for video action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13598–13607, 2022.
[25] T. Wu, M. Cao, Z. Gao, G. Wu, and L. Wang, “Stmixer: A one-stage sparse action detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14720–14729, 2023.
[26] R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: Clip prefix for image captioning,” arXiv preprint arXiv:2111.09734, 2021.
[27] H. Fang, P. Xiong, L. Xu, and Y. Chen, “Clip2video: Mastering video-text retrieval via image clip,” arXiv preprint arXiv:2106.11097, 2021.
[28] W. Yu, Y. Liu, W. Hua, D. Jiang, B. Ren, and X. Bai, “Turning a clip model into a scene text detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6978–6988, 2023.
[29] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[30] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 3192–3199, 2013.
[31] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[32] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6047–6056, 2018.
[33] Y. Li, S. Xie, X. Chen, P. Dollar, K. He, and R. Girshick, “Benchmarking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021.
[34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017.
[35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755, Springer, 2014.
[36] Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y. Wang, et al., “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” arXiv preprint arXiv:2307.06942, 2023.