電影問答之交互注意力推論機制｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	劉兆寧 Liu, Chao-Ning
論文名稱：	電影問答之交互注意力推論機制 A2A: Attention to Attention Reasoning for Movie Question Answering
指導教授：	陳煥宗 Chen, Hwann-Tzong 劉庭祿 Liu, Tyng-Luh
口試委員:	邱維辰 Chiu, Wei-Chen 李哲榮 Lee, Che-Rung
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	34
中文關鍵詞：	電影問答、注意力機制、視覺問答、影片問答、電腦視覺、自然語言處理、物體偵測
外文關鍵詞：	movie_question_answering, attention, visual_question_answering, video_question_answering, computer_vision, natural_language_process, object_detection
相關次數：	點閱：122 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本篇論文提出交互注意力推論機制，旨在建立一套電影問答系統，這是極具挑戰的研究。我們實驗提出兩項分析注意力的方法，第一是注意力傳播機制，這個方法會藉由關聯提示片段與題示片外段外的其他內容來發現潛在有用的信息。第二是問答注意力機制，這個機制可以找出與問題和答案相關連的所有內容。再者，我們提出的交互注意力推論機制可以有效參考視覺與文字的內容來作答，並且也可以利用神經網絡架構方便地構建。為了解決電影中稀有名字所造成的的詞彙外問題，我們採用 GloVe 字向量作為教師模型，並基於 n-gram 象徵向量建立一個新穎且靈活的字向量。我們的方法在MovieQA 基準數據集上進行評估，並且達成影片與字幕這個子任務的最好的表現。

This thesis presents the Attention to Attention (A2A) reasoning mecha-nism to address the challenging task of movie question answering (MQA). By focusing on the various aspects of attention cues, we establish the tech-nique of attention propagation to uncover latent but useful information to solving the underlying QA task. In addition, the proposed A2A reasoning seamlessly leads to effective fusion of different representation modalities about the data, and also can be conveniently constructed with popular neural network architectures. To tackle the out-of-vocabulary issue caused by the diverse language usages in nowadays movies, we adopt the GloVe mapping as a teacher model and establish a new and flexible word embed-ding based on character n-grams learning. Our method is evaluated on the MovieQA benchmark dataset and achieves the state-of-the-art accuracy for the ‘Video+Subtitles’ entry.

Contents
Introduction 7
Related Work 10
1 Visual Captioning and Question Datasets . . . . . . . . . . . . . . . 10
2 Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Proposed Method 14
1 Visual and Linguistic Embedding . . . . . . . . . . . . . . . . . . . . 14
2 Joint Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Attention Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 QA Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Optimal Answer Response . . . . . . . . . . . . . . . . . . . . . . . . 20
Experiments and Discussions 22
1 Ablation Study on Key Components . . . . . . . . . . . . . . . . . . 23
2 Leaderboard Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Question Types Comparison . . . . . . . . . . . . . . . . . . . . . . . 29
Conclusion 31
                                

Bibliography
[1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra.
VQA: visual question answering - www.visualqa.org. International Journal of
Computer Vision, 123(1):4–31, 2017.
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.
Bottom-up and top-down attention for image captioning and visual question
answering. In CVPR, 2018.
[3] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence
embeddings. In ICLR, 2017.
[4] M. Azab, M. Wang, M. Smith, N. Kojima, J. Deng, and R. Mihalcea. Speaker
naming in movies. In NAACL-HLT, pages 2206–2216, 2018.
[5] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with
reinforcement learning. In ICML, pages 459–468, 2017.
[6] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach.
Multimodal compact bilinear pooling for visual question answering and visual
grounding. In EMNLP, pages 457–468, 2016.
[7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
neural networks. In AISTATS, pages 249–256, 2010.
[8] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V
in VQA matter: Elevating the role of image understanding in Visual Question
Answering. In CVPR, 2017.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
arXiv preprint arXiv:1512.03385, 2015.
[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation,
9(8):1735–1780, 1997.
[11] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs
for modern convolutional object detectors. In CVPR, pages 3296–3297, 2017.
[12] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. TGIF-QA: toward spatiotemporal
reasoning in visual question answering. In CVPR, 2017.
[13] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick,
and R. B. Girshick. Inferring and executing programs for visual reasoning.
In ICCV, pages 3008–3017, 2017.
[14] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A strong baseline
for visual question answering. CoRR, abs/1704.03162, 2017.
[15] K. Kim, M. Heo, S. Choi, and B. Zhang. Deepstory: Video story QA by deep
embedded memory networks. In IJCAI, pages 2016–2022, 2017.
[16] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages
740–755, 2014.
[17] F. Liu and J. Perez. Gated end-to-end memory networks. In EACL, 2017.
[18] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention
for visual question answering. In NIPS, pages 289–297, 2016.
[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. CoRR, 2013.
[20] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston. Key-value
memory networks for directly reading documents. In EMNLP, pages 1400–1409,
2016.
[21] J. Mun, P. H. Seo, I. Jung, and B. Han. Marioqa: Answering questions by
watching gameplay videos. In ICCV, pages 2886–2894, 2017.
[22] S. Na, S. Lee, J. Kim, and G. Kim. A read-write memory network for movie
story understanding. In ICCV, 2017.
[23] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word
representation. In EMNLP, pages 1532–1543, 2014.
[24] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100, 000+ questions
for machine comprehension of text. In EMNLP, pages 2383–2392, 2016.
[25] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time
object detection with region proposal networks. IEEE Trans. Pattern Anal.
Mach. Intell., 39(6):1137–1149, 2017.
[26] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle,
A. Courville, and B. Schiele. Movie description. International Journal of Computer
Vision, 2017.
[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2015.
[28] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks.
In NIPS, 2015.
[29] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler.
Movieqa: Understanding stories in movies through question-answering.
In CVPR, 2016.
[30] B. Wang, Y. Xu, Y. Han, and R. Hong. Movie question answering: Remembering
the textual cues for layered visual contents. In AAAI, 2018.
[31] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question
answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.
[32] J. Weston, S. Chopra, and A. Bordes. Memory networks. CoRR, abs/1410.3916,
2014.
[33] Q. Wu, D. Teney, P. Wang, C. Shen, A. R. Dick, and A. van den Hengel. Visual
question answering: A survey of methods and datasets. Computer Vision and
Image Understanding, 163:21–40, 2017.

簡易檢索 / 詳目顯示

相關論文