簡易檢索 / 詳目顯示

研究生: 劉兆寧
Liu, Chao-Ning
論文名稱: 電影問答之交互注意力推論機制
A2A: Attention to Attention Reasoning for Movie Question Answering
指導教授: 陳煥宗
Chen, Hwann-Tzong
劉庭祿
Liu, Tyng-Luh
口試委員: 邱維辰
Chiu, Wei-Chen
李哲榮
Lee, Che-Rung
學位類別: 碩士
Master
系所名稱:
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 34
中文關鍵詞: 電影問答注意力機制視覺問答影片問答電腦視覺自然語言處理物體偵測
外文關鍵詞: movie_question_answering, attention, visual_question_answering, video_question_answering, computer_vision, natural_language_process, object_detection
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本篇論文提出交互注意力推論機制,旨在建立一套電影問答系統,這是極具挑戰的研究。我們實驗提出兩項分析注意力的方法,第一是注意力傳播機制,這個方法會藉由關聯提示片段與題示片外段外的其他內容來發現潛在有用的信息。第二是問答注意力機制,這個機制可以找出與問題和答案相關連的所有內容。再者,我們提出的交互注意力推論機制可以有效參考視覺與文字的內容來作答,並且也可以利用神經網絡架構方便地構建。為了解決電影中稀有名字所造成的的詞彙外問題,我們採用 GloVe 字向量作為教師模型,並基於 n-gram 象徵向量建立一個新穎且靈活的字向量。我們的方法在MovieQA 基準數據集上進行評估,並且達成影片與字幕這個子任務的最好的表現。


    This thesis presents the Attention to Attention (A2A) reasoning mecha-nism to address the challenging task of movie question answering (MQA). By focusing on the various aspects of attention cues, we establish the tech-nique of attention propagation to uncover latent but useful information to solving the underlying QA task. In addition, the proposed A2A reasoning seamlessly leads to effective fusion of different representation modalities about the data, and also can be conveniently constructed with popular neural network architectures. To tackle the out-of-vocabulary issue caused by the diverse language usages in nowadays movies, we adopt the GloVe mapping as a teacher model and establish a new and flexible word embed-ding based on character n-grams learning. Our method is evaluated on the MovieQA benchmark dataset and achieves the state-of-the-art accuracy for the ‘Video+Subtitles’ entry.

    Contents 1 Introduction 7 2 Related Work 10 2.1 Visual Captioning and Question Datasets . . . . . . . . . . . . . . . 10 2.2 Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Proposed Method 14 3.1 Visual and Linguistic Embedding . . . . . . . . . . . . . . . . . . . . 14 3.2 Joint Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Attention Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 QA Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Optimal Answer Response . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Experiments and Discussions 22 4.1 Ablation Study on Key Components . . . . . . . . . . . . . . . . . . 23 4.2 Leaderboard Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Question Types Comparison . . . . . . . . . . . . . . . . . . . . . . . 29 5 Conclusion 31

    Bibliography
    [1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra.
    VQA: visual question answering - www.visualqa.org. International Journal of
    Computer Vision, 123(1):4–31, 2017.
    [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.
    Bottom-up and top-down attention for image captioning and visual question
    answering. In CVPR, 2018.
    [3] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence
    embeddings. In ICLR, 2017.
    [4] M. Azab, M. Wang, M. Smith, N. Kojima, J. Deng, and R. Mihalcea. Speaker
    naming in movies. In NAACL-HLT, pages 2206–2216, 2018.
    [5] I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with
    reinforcement learning. In ICML, pages 459–468, 2017.
    [6] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach.
    Multimodal compact bilinear pooling for visual question answering and visual
    grounding. In EMNLP, pages 457–468, 2016.
    [7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward
    neural networks. In AISTATS, pages 249–256, 2010.
    [8] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V
    in VQA matter: Elevating the role of image understanding in Visual Question
    Answering. In CVPR, 2017.
    [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
    arXiv preprint arXiv:1512.03385, 2015.
    [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation,
    9(8):1735–1780, 1997.
    [11] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,
    Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs
    for modern convolutional object detectors. In CVPR, pages 3296–3297, 2017.
    [12] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. TGIF-QA: toward spatiotemporal
    reasoning in visual question answering. In CVPR, 2017.
    [13] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick,
    and R. B. Girshick. Inferring and executing programs for visual reasoning.
    In ICCV, pages 3008–3017, 2017.
    [14] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A strong baseline
    for visual question answering. CoRR, abs/1704.03162, 2017.
    [15] K. Kim, M. Heo, S. Choi, and B. Zhang. Deepstory: Video story QA by deep
    embedded memory networks. In IJCAI, pages 2016–2022, 2017.
    [16] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and
    C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages
    740–755, 2014.
    [17] F. Liu and J. Perez. Gated end-to-end memory networks. In EACL, 2017.
    [18] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention
    for visual question answering. In NIPS, pages 289–297, 2016.
    [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
    representations in vector space. CoRR, 2013.
    [20] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston. Key-value
    memory networks for directly reading documents. In EMNLP, pages 1400–1409,
    2016.
    [21] J. Mun, P. H. Seo, I. Jung, and B. Han. Marioqa: Answering questions by
    watching gameplay videos. In ICCV, pages 2886–2894, 2017.
    [22] S. Na, S. Lee, J. Kim, and G. Kim. A read-write memory network for movie
    story understanding. In ICCV, 2017.
    [23] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word
    representation. In EMNLP, pages 1532–1543, 2014.
    [24] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100, 000+ questions
    for machine comprehension of text. In EMNLP, pages 2383–2392, 2016.
    [25] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time
    object detection with region proposal networks. IEEE Trans. Pattern Anal.
    Mach. Intell., 39(6):1137–1149, 2017.
    [26] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle,
    A. Courville, and B. Schiele. Movie description. International Journal of Computer
    Vision, 2017.
    [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
    image recognition. In ICLR, 2015.
    [28] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks.
    In NIPS, 2015.
    [29] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler.
    Movieqa: Understanding stories in movies through question-answering.
    In CVPR, 2016.
    [30] B. Wang, Y. Xu, Y. Han, and R. Hong. Movie question answering: Remembering
    the textual cues for layered visual contents. In AAAI, 2018.
    [31] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question
    answering: A set of prerequisite toy tasks. CoRR, abs/1502.05698, 2015.
    [32] J. Weston, S. Chopra, and A. Bordes. Memory networks. CoRR, abs/1410.3916,
    2014.
    [33] Q. Wu, D. Teney, P. Wang, C. Shen, A. R. Dick, and A. van den Hengel. Visual
    question answering: A survey of methods and datasets. Computer Vision and
    Image Understanding, 163:21–40, 2017.

    QR CODE