簡易檢索 / 詳目顯示

研究生: 葉松霖
Yeh, Sung-Lin
論文名稱: 利用自動語音辨識特徵開發對話中的語音情感辨識系統
ASR Dependent Speech Emotion Recognition in Spoken Dialog
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 王新民
Wang, Hsin-Min
李宏毅
Lee, Hung-Yi
曹昱
Yu, Tsao
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 109
語文別: 英文
論文頁數: 42
中文關鍵詞: 語音情感辨識語音對話前後文注意力機制情緒解碼端到端語音辨識領域自適應
外文關鍵詞: speech emotion recognition, spoken dialogs, context, attention mechanism, emotion decoding, end-to-end ASR, domain adaptation
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音情緒辨識對於未來人機互動的開發以及使用者體驗有很重要的影響,本研究從三個方向提升語音對話中語音情緒辨識系統:基於上下文的注意力神經網絡模型,對話中的情緒解碼和基於自動語音辨識的語音特徵。首先,我們改進傳統語音情感識別的建模方法,有別與以往情緒辨識用類神經網路對單一語句的語音特徵來進行辨識,我們利用語者對話資訊來進行更精確的情緒辨識。我們設計一個互動式注意力網路(Interaction-awareattention network),它有效地將對話中語者的上下文信息融進被預測的語音特徵,因此我們可以更準確地對語音情緒進行預測。我們的成果在情感語音語料庫IEMOCAP的準確率超越了基於單一語句的情緒辨識模型9%,並且有系統地分析不同情緒承接對準確率的影響,印證我們的架構可以有效地利用互動的資訊提升情緒辨識的準確率。在情緒解碼的部分,我們提出一種在推論過程中的情緒解碼演算法,可以可以沿著對話順序對對話中每一句話的情感狀態進行解碼。我們將情感識別系統抽象為兩個獨立的模塊:基於語音的識別模型和對話流情緒解碼器,類比於ASR系統中帶有語言解碼器的聲學模型。因此,我們對語音情緒的預測可以同時考慮了語者情緒的連貫性,以及對話中整體情緒的狀態來改變原本情緒辨識器預測的機率分布,避免單一語句辨識中會出現的情緒狀態在對話不一致的問題,進而達到更佳的準確率,此外,我們的演算法完全基於已經訓練好的模型,可以與不同情緒辨識的架構進行整合,而不是限制在固定的架構上。最後,我們利用端到端(E2E)自動語音辨識(ASR)系統提取語音特徵來取代SER任務的聲學特徵集。我們提出針對ASR系統的領域自適應(Domain Adaptation)方法,以減少ASR域中的語料庫與SER域中的情感語音語料庫之間的聲學差異。通過域自適應,自適應後的ASR特徵要比從預訓練ASR中提取的特徵有更好的準確率。此外,我們比較了來自不同深度的ASR層並分析來自深淺層的特徵對SER的影響,結果發現,自適應後ASR的淺層的特徵可與手工聲學特徵媲美。


    The thesis targets on improving the performance of speech emotion recognition systems in spoken dialogs in three areas: emotion modeling with context-based neural network, emotion decoding, and ASR-based features for SER.We first aim to improve traditional modeling approaches for speech emotion recognition, which mainly modeled the emotion of each utterance in a dialog in isolation. We propose an attention-based neural network called interaction-aware attention network (IAAN) that incorporates contextual information from speakers in a dialog. When evaluate on IEMOCAP, a benchmark emotional speech corpus, our baselines are able to outperform previous works for 4-class speech emotion recognition task. By considering the interaction context, we are able to improve recognition accuracy by 8% over approaches that recognize emotions based on single utterance.In the area of emotion decoding, we propose an inference algorithm that de-codes emotion states of each utterance in a dialog over time called dialogical emotion decoder (DED). We abstract emotion recognition systems as two separate modules: the utterance-based recognition engine and a conversation flow decoder, which is similar to an acoustic model with a language decoder in an ASR system.We compare this approach against recognition models rely on single utterance or context. The proposed DED further improves IAAN over 4%.
    Finally, we leverage end-to-end (E2E) automatic speech recognition (ASR)systems to extract speech representations as front end feature sets for SER task.We propose domain adaptive approaches to ASR systems to reduce the acoustic mismatch between audio read corpus in ASR domain and emotional speech corpus in SER domain. With domain adaptation, representations extracted from adapted ASR perform better than representations from pre-trained ASR. Moreover, we compare the effect of ASR representations from different depth of layers for SER, the performance of low-layer ASR-based features is comparable to handcraft acoustic features.

    Acknowledgements 摘要 Abstract Chapter 1 Introduction 1 Section 1.1 Previous Work 2 Section 1.2 Proposed Method 3 Chapter 2 Database 5 Section 2.1 IEMOCAP 5 Section 2.2 MELD 6 Section 2.3 LibriSpeech 6 Chapter 3 Task1: Context-Based Emotion Recognition 7 Section 3.1 Interaction-Aware Attention Network 7 Section 3.2 Experimental Setup and Results 11 Chapter 4 Task2: Dialogical Emotion Decoding 15 Section 4.1 Task Definition 15 Section 4.2 Dialogical Emotion Decoder 16 Section 4.3 Experimental Setup and Results 21 Chapter 5 Task3: ASR-Based Features for SER 25 Section 5.1 ASR Model 25 Section 5.2 Domain Adaptation 27 Section 5.3 SER Model 29 Section 5.4 Experimental Setup and Results 30 Chapter 6 Conclusion 37 References 39

    [1]S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “A dialogical emotion decoder for speech motionrecognition in spoken dialog,” inICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pp. 6479–6483, IEEE, 2020.
    [2]S. Poria, N. Majumder, R. Mihalcea, and E. Hovy, “Emotion recognition in conversation:Research challenges, datasets, and recent advances,”arXiv preprint arXiv:1905.02947,2019.
    [3]D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Con-versational memory network for emotion recognition in dyadic dialogue videos,” inPro-ceedings of the conference. Association for Computational Linguistics. North AmericanChapter. Meeting, vol. 2018, p. 2122, NIH Public Access, 2018.
    [4]D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “Icon: Interactiveconversationalmemorynetworkformultimodalemotiondetection,”inProceedingsofthe2018Conferenceon Empirical Methods in Natural Language Processing,pp.2594–2604,2018.
    [5]N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dia-loguernn: An attentive rnn for emotion detection in conversations,” inProceedings of theAAAI Conference on Artificial Intelligence, vol. 33, pp. 6818–6825, 2019.
    [6]E.Lakomkin,C.Weber,S.Magg,andS.Wermter,“Reusingneuralspeechrepresentationsfor auditory emotion recognition,”arXiv preprint arXiv:1803.11508, 2018.
    [7]N. Tits, K. E. Haddad, and T. Dutoit, “Asr-based features for emotion recognition: Atransfer learning approach,”arXiv preprint arXiv:1805.09197, 2018.
    [8]Z.Lu,L.Cao,Y.Zhang,C.-C.Chiu,andJ.Fan,“Speechsentimentanalysisviapre-trainedfeatures from end-to-end asr models,”arXiv preprint arXiv:1911.09762, 2019.
    [9]A.v.d.Oord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,”arXivpreprint arXiv:1609.03499, 2016.
    [10]C. Veaux, J. Yamagishi, K. MacDonald,et al., “Superseded-cstr vctk corpus: Englishmulti-speaker corpus for cstr voice cloning toolkit,” 2016.
    [11]A.Hannun,C.Case,J.Casper,B.Catanzaro,G.Diamos,E.Elsen,R.Prenger,S.Satheesh,S. Sengupta, A. Coates,et al., “Deep speech: Scaling up end-to-end speech recognition,”arXiv preprint arXiv:1412.5567, 2014.
    [12]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus basedon public domain audio books,” in2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pp. 5206–5210, IEEE, 2015.
    [13]A. Rousseau, P. Deléglise, and Y. Esteve, “Enhancing the ted-lium corpus with selecteddata for language modeling and more ted talks.,” inLREC, pp. 3935–3939, 2014.
    [14]H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm modelfor large vocabulary speech recognition,”arXiv preprint arXiv:1610.09975, 2016.
    [15]C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee,and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
    [16]S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: Amultimodal multi-party dataset for emotion recognition in conversations,”arXiv preprintarXiv:1810.02508, 2018.
    [17]S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “An interaction-aware attention network for speechemotion recognition in spoken dialogs,” inICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6685–6689, IEEE,2019.
    [18]D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning toalign and translate,”arXiv preprint arXiv:1409.0473, 2014.
    [19]F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” inProceedings of the 18th ACM international conferenceon Multimedia, pp. 1459–1462, ACM, 2010.
    [20]S. Zhou, J. Jia, Q. Wang, Y. Dong, Y. Yin, and K. Lei, “Inferring emotion from conversa-tional voice data: A semi-supervised multi-path generative neural network approach,” inThirty-Second AAAI Conference on Artificial Intelligence, 2018.
    [21]S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition usingrecurrent neural networks with local attention,” in2017 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231, IEEE, 2017.
    [22]V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, and R. Prasad, “Ensemble of svmtrees for multimodal emotion recognition,” inSignal & Information Processing Associ-ation Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, pp. 1–4, IEEE,2012.
    [23]D. M. Blei and P. I. Frazier, “Distance dependent chinese restaurant processes,”Journalof Machine Learning Research, vol. 12, no. Aug, pp. 2461–2488, 2011.
    [24]S. Huang and S. Renals, “Hierarchical pitman-yor language models for asr in meetings,”in2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU),pp. 124–129, IEEE, 2007.
    [25]A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diariza-tion,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 6301–6305, IEEE, 2019.
    [26]M. F. Medress, F. S. Cooper, J. W. Forgie, C. Green, D. H. Klatt, M. H. O’Malley, E. P.Neuburg, A. Newell, D. Reddy, B. Ritea,et al., “Speech understanding systems: Reportof a steering committee,”Artificial Intelligence, vol. 9, no. 3, pp. 307–316, 1977.
    [27]A.Metallinou,S.Lee,andS.Narayanan,“Decisionlevelcombinationofmultiplemodali-tiesforrecognitionandanalysisofemotionalexpression,”inAcoustics Speech and SignalProcessing (ICASSP), 2010 IEEE International Conference on, IEEE, 2010.
    [28]A.-M. Laukkanen, E. Vilkman, P. Alku, and H. Oksanen, “On the perception of emotionsin speech: the role of voice quality,”Logopedics Phoniatrics Vocology, vol. 22, no. 4,pp. 157–168, 1997.
    [29]K. C. Sim, A. Narayanan, A. Misra, A. Tripathi, G. Pundak, T. N. Sainath, P. Haghani,B. Li, and M. Bacchiani, “Domain adaptation using factorized hidden layer for robustautomatic speech recognition.,” inInterspeech, pp. 892–896, 2018.
    [30]J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition basedlow-footprint speaker adaptation and personalization for deep neural network,” in2014IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pp. 6359–6363, IEEE, 2014.
    [31]C.-C.L.Sung-LinYeh,Yun-ShaoLin,“Speechrepresentationlearningforemotionrecog-nition using end-to-end asr with factorized adaptation.,” inInterspeech, pp. xxxx–xxxx,2020.
    [32]W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network forlarge vocabulary conversational speech recognition,” in2016 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, IEEE, 2016.
    [33]C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,R. J. Weiss, K. Rao, E. Gonina,et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in2018 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 4774–4778, IEEE, 2018.
    [34]Y.Yin,R.Prieto,B.Wang,J.Zhou,Y.Gu,Y.Liu,andH.Lin,“Attention-basedsequence-to-sequence model for speech recognition: development of state-of-the-art system onlibrispeech and its application to non-native english,”arXiv preprint arXiv:1810.13088,2018.
    [35]M. Masana, J. van de Weijer, L. Herranz, A. D. Bagdanov, and J. M. Alvarez, “Domain-adaptive deep network compression,” inProceedings of the IEEE International Confer-ence on Computer Vision, pp. 4289–4297, 2017.
    [36]J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network acoustic models withsingular value decomposition.,” inInterspeech, pp. 2365–2369, 2013.
    [37]K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach, and P. Nguyen, “On thechoice of modeling unit for sequence-to-sequence speech recognition,”arXiv preprintarXiv:1902.01955, 2019.
    [38]D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le,“Specaugment: A simple data augmentation method for automatic speech recognition,”arXiv preprint arXiv:1904.08779, 2019.
    [39]Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks for end-to-endspeech recognition,” in2017 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 4845–4849, IEEE, 2017.

    QR CODE