簡易檢索 / 詳目顯示

研究生: 呂紹豪
Lu, Shao-Hao
論文名稱: 轉換偵測解碼器於語者預測
Dialogical speaker decoder with transition detection for next speaker prediction
指導教授: 李祈均
LEE, CHI-CHUN
口試委員: 冀泰石
Chi, Tai-Shih
陳冠宇
Chen, Kuan-Yu
曹昱
Tsao, Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 33
中文關鍵詞: 語者預測轉換偵測
外文關鍵詞: next speaker prediction, transition detection
相關次數: 點閱:45下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在小組互動以及人機互動中,語者預測和話權預測是非常重要的任務。 我們日常生活中常見的流暢對話以及互動,都需要我們同時去整合三個問題,現在是誰在說話,下一個語者會是誰,下一個語者什麼時候會接話。人類在對話中分析到的細微行為差異對於機器學習而言是非常困難的。有很多學者使用了不同的行為特徵來當作預測語者的資訊,例如:視線的方向、講話語調或者是身體的姿勢手勢。在這篇研究中,我提出了一個可以同時考慮語者過去講話資訊以及考慮是否會發生講話狀態改變的模型。利用語者的講話傾向以及視線方向來預測下一個語者是誰,接著結合過去語者講話的資訊和是否會發生話語權的轉換來優化預測的表現。我的模型最後在 UAR 的表現達到 78.11%,贏過了原本 MultiMediate challenge 2021 冠軍模型 3.41%。


    Next speaker prediction and turn change prediction are two important tasks in group interaction or human-agent interaction. In order to have a fluent and understandable conversation, we need to coordinate three questions, that is, who is currently speaking, who will be the next speaker and when the next speaker shall start to speak. Therefore, many researchers have investigated deeply in human’s subtle behaviors between these interactions. Behaviors such as gaze direction, speaking prosody or gestures have been utilized to become turn-taking cues for models to predict the next speaker. In this work, I proposed a decoder-based model dialogical speaker decoder (DSD) for next speaker prediction, which will coordinate speaker’s behavior features, such as talk tendency and gaze pattern, speaker’s past history of talking and speaking state transition detection model with timeawareness and behavior divergence. I have achieved next speaker prediction with UAR of 78.11%, which is 3.41% improvement over champion model in MultiMediate challenge 2021.

    Acknowledgements 摘要 i Abstract ii 1 Introduction 1 2 Data Corpus 5 2.1 Annotation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Eye contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Next speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Behavior Analysis 7 3.1 Talk Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Gaze Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Speaker gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 8 3.2.2 Listener gaze pattern and turn-changing . . . . . . . . . . . . . . . . . 9 3.2.3 Timing structure of eye contact and turn-changing . . . . . . . . . . . 9 4 Method 13 4.1 Task definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Active speaker detection model . . . . . . . . . . . . . . . . . . . . . 14 4.2.2 Gaze model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Dialogical Speaker Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3.1 Base prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.2 Transition model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3.3 Speaking assignment process . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Results 23 5.1 Dialogical speaker decoder performance . . . . . . . . . . . . . . . . . . . . . 23 5.2 Base model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3 Transition model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.3.1 Transition model analysis . . . . . . . . . . . . . . . . . . . . . . . . 26 6 Conclusion 29 References 31

    [1] B. O’Conaill, S. Whittaker, and S. Wilbur, “Conversations over video conferences: An
    evaluation of the spoken aspects of video-mediated communication,” Human-computer
    interaction, vol. 8, no. 4, pp. 389–428, 1993.
    [2] L. Mondada, “Multimodal resources for turn-taking: Pointing and the emergence of possible next speakers,” Discourse studies, vol. 9, no. 2, pp. 194–225, 2007.
    [3] S. C. Levinson and F. Torreira, “Timing in turn-taking and its implications for processing
    models of language,” Frontiers in Psychology, vol. 6, 2015.
    [4] G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,” Computer Speech & Language, vol. 67, p. 101178, 2021.
    [5] Z. Degutyte and A. Astell, “The role of eye gaze in regulating turn taking in conversations: a systematized review of methods and findings,” Frontiers in Psychology, vol. 12,
    p. 616471, 2021.
    [6] V. Srinivasan and R. Murphy, “A survey of social gaze,” in Proceedings of the 6th International Conference on Human-Robot Interaction, HRI ’11, (New York, NY, USA), p. 253–
    254, Association for Computing Machinery, 2011.
    [7] E. Calisgan, A. Haddadi, H. Van der Loos, J. A. Alcazar, and E. Croft, “Identifying nonverbal cues for automated human-robot turn-taking,” in 2012 IEEE RO-MAN: The 21st IEEE
    International Symposium on Robot and Human Interactive Communication, pp. 418–423,
    2012.
    [8] U. Malik, J. Saunier, K. Funakoshi, and A. Pauchet, “Who speaks next? turn change
    and next speaker prediction in multimodal multiparty interaction,” in 2020 IEEE 32nd
    International Conference on Tools with Artificial Intelligence (ICTAI), pp. 349–354, 2020.
    [9] J.-P. Noel, M. A. De Niear, N. S. Lazzara, and M. T. Wallace, “Uncoupling between multisensory temporal function and nonverbal turn-taking in autism spectrum disorder,” IEEE
    Transactions on Cognitive and Developmental Systems, vol. 10, no. 4, pp. 973–982, 2018.
    [10] K. Jokinen, K. Harada, M. Nishida, and S. Yamamoto, “Turn-alignment using eye-gaze
    and speech in conversational interaction,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
    [11] R. Ishii, S. Kumano, and K. Otsuka, “Predicting next speaker based on head movement in
    multi-party meetings,” in 2015 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), pp. 2319–2323, IEEE, 2015.
    [12] T. Kawahara, T. Iwatate, and K. Takanashi, “Prediction of turn-taking by combining
    prosodic and eye-gaze information in poster conversations,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
    [13] J. Yang, P. Wang, Y. Zhu, M. Feng, M. Chen, and X. He, “Gated multimodal fusion
    with contrastive learning for turn-taking prediction in human-robot dialogue,” in ICASSP
    2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP), pp. 7747–7751, IEEE, 2022.
    [14] Y. Liang and Q. Zhou, “Detect turn-takings in subtitle streams with semantic recall transformer encoder,” in 2020 International Conference on Asian Language Processing (IALP),
    pp. 1–6, IEEE, 2020.
    [15] I. De Kok and D. Heylen, “Multimodal end-of-turn prediction in multi-party meetings,”
    in Proceedings of the 2009 international conference on Multimodal interfaces, pp. 91–98,
    2009.
    [16] P. Müller, D. Schiller, D. Thomas, G. Zhang, M. Dietz, P. Gebhard, E. André, and
    A. Bulling, “Multimediate: Multi-modal group behaviour analysis for artificial mediation,” in Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
    [17] P. Müller, M. X. Huang, and A. Bulling, “Detecting low rapport during natural interactions in small groups from non-verbal behaviour,” in 23rd International Conference on
    Intelligent User Interfaces, pp. 153–164, 2018.
    [18] R. Ishii, K. Otsuka, S. Kumano, and J. Yamato, “Prediction of who will be the next speaker
    and when using gaze behavior in multiparty meetings,” ACM Trans. Interact. Intell. Syst.,
    vol. 6, may 2016.
    [19] R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in Proceedings
    of the 29th ACM International Conference on Multimedia, pp. 3927–3935, 2021.
    [20] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech
    enhancement,” arXiv preprint arXiv:1804.04121, 2018.
    [21] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and
    recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
    [22] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual
    speech recognition,” IEEE transactions on pattern analysis and machine intelligence,
    2018.
    [23] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July
    2017.
    [24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the
    IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
    [25] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J.
    Lee, and I. Han, “In defence of metric learning for speaker recognition,” arXiv preprint
    arXiv:2003.11982, 2020.
    [26] C. Birmingham, K. Stefanov, and M. J. Mataric, “Group-level focus of visual attention for
    improved next speaker prediction,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 4838–4842, 2021.
    [27] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “A dialogical emotion decoder for speech emotion
    recognition in spoken dialog,” in ICASSP 2020-2020 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP), pp. 6479–6483, IEEE, 2020.
    [28] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Deepcare: A deep dynamic memory
    model for predictive medicine,” in Pacific-Asia conference on knowledge discovery and
    data mining, pp. 30–41, Springer, 2016.
    [29] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou, “Patient subtyping
    via time-aware lstm networks,” in Proceedings of the 23rd ACM SIGKDD international
    conference on knowledge discovery and data mining, pp. 65–74, 2017.
    [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
    I. Polosukhin, “Attention is all you need,” Advances in neural information processing
    systems, vol. 30, 2017.
    [31] D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes.,” Journal
    of Machine Learning Research, vol. 12, no. 8, 2011.
    [32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.

    QR CODE