簡易檢索 / 詳目顯示

研究生: 趙高逸
Chao, Gaoyi Yi
論文名稱: 利用個人與情境差異改善語音情緒辨識系統
Improve speech emotion recognition system by considering individual and context difference
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 曹昱
Tsao, Yu
簡仁宗
Chien, Jen-Tzung
陳冠宇
Chen, Kuan-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 108
語文別: 英文
論文頁數: 51
中文關鍵詞: 語音情緒辨識跨資料庫語音情緒分析多語者模型
外文關鍵詞: speech emotion recognition, multi-speaker, cross corpus SER
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人的因素在情緒感知的過程中扮演非常重要的角色,在心理學研究上,不同的人格特質或者情境下,我們對於情緒的表達方式與感受程度都不盡相同。因此,若我們只用同一種方式量測、量化這些行為, 訓練出辨識率差的模型是可以預期的。特別是在我們利用類神經網路這種以訓練資料為主的模型,這些差異性往往影響得更深。基於這些因素在感知者部分, 對不同標籤者建立子模型來提升辨識效率已經是一個很成熟的研究議題。但在跨情境(文本)與個人情緒辨識模型上,得到的效果就有限。因此我們在此篇研究內分別上提出最大差異迴歸模型(MRD)與多專家語者模型(MoE)兩種方式,分別在跨文本語音辨識以及個人差異化語音情緒辨識兩個情境下,比較於過去提出的方法,兩者都在USC-IEMOCAP和MSP-IMPROV資料庫上得到顯著的改善。另外,在多專家模型上,我們也比對輸出的權重值與預訓練模型的準確率,發現提出的模型確實可以依據不同特質的人給出相對應的權重值給予個別的子模型做最後的預測。總結實驗一與實驗二可以發現除了在情緒感知者的部分,在情境與表達者上考慮不同的因素可以讓模型更加人性化從而得到更好的效果。


    Human factors play a very important part in the process of emotional perception. In past psychological research, our expression of emotions and feelings are not the same depend in different personalities or situations. Therefore, if we measure and quantify these behaviors in the same way, it is expected to get a model with poor recognition ability. Especially as we use a neural network such a data-driven model, these differences usually have a deep impact on the extracted feature set. Based on these factors, it is a popular topic to build sub-models for different annotators to improve recognition ability in the perceptron part. However, on the cross-corpus and personal emotion recognition parts, the results obtained are limited. Therefore, in this study, we propose two methods of maximum difference regression model (MRD) compared with the related methods in a cross corpus speech recognition scenario. Further in order to improve the SER system, we proposed a multi speaker mixture of experts model (MoE). These proposed methods, both of which have a significant improvement in the USC-IEMOCAP and MSP-IMPROV databases. In addition, on the multi-expert model, we also compare the weight value of the MoE output with recognition results of the pre-training models, and we find that the proposed MoE model can give the corresponding weight value according to different speaker sample. Summarizing Experiment 1 and Experiment 2, we can find that in addition to the emotional perceiver, considering humans’ different factor in the situation (context) and the expression can make the model more humanized further let it more robust.

    Contents 誌謝 I 摘要 II ABSTRACT III CONTENTS V CHAPTER 1 INTRODUCTION 1 CHAPTER 2 DATABASE AND FEATURES 5 2.1 DATABASE 5 2.1.1 IEMOCAP 5 2.1.2 MSP-IMPROV 6 2.2 FEATURES 6 CHAPTER 3 MRD RESEARCH METHODOLOGY 8 3.1 OVERALL IDEA OF MRD APPROACH 8 3.2 ADVERSARIAL DISCREPANCY LEARNING PROCEDURE 10 3.2.1 Step 1 10 3.2.2 Step2 11 3.2.3 Step3 12 3.3 EXPERIMENTS & RESULTS 13 3.4 DISCUSSION & CONCLUSION 16 CHAPTER 4 MULTI-SPEAKER MIXTURE OF EXPERTS (MOE). 17 4.1 OVERALL IDEA OF MOE APPROACH 18 4.2 MIXTURE OF EXPERTS (MOE) 19 4.2.1 Mahalanobis distance 19 4.2.2 Mixture of Experts (MoE) 19 4.2.3 Binary classification 20 4.3 INDIVIDUAL DIFFERENCES & MMD 21 4.4 TRAINING & OBJECTIVE FUNCTION 22 4.4.1 MoE objective 22 4.4.2 MTL objective 23 4.4.3 MMD objective 23 4.4.4 Entropy (smooth) objective 24 4.4.5 Objective 25 4.5 EXPERIMENT & RESULTS 25 4.6 ANALYSIS & CONCLUSION 28 4.4.4 Analytical Method 28 4.4.5 Analysis on USC-IEMOCAP & MSP-IMPROV 30 CHAPTER 5 CONCLUSION 33 REFERENCE 35

    • [1] Phillips, Mary L., et al. "Neurobiology of emotion perception I: The neural basis of normal emotion perception." Biological psychiatry 54.5 (2003): 504-514.
    • [2] Masuda, T., Ellsworth, P. C., Mesquita, B., Leu, J., Tanida, S., & Van de Veerdonk, E. (2008). Placing the face in context: Cultural differences in the perception of facial emotion. Journal of Personality and Social Psychology, 94(3), 365-381.
    • [3] Matsumoto, David, et al. "The contribution of individualism vs. collectivism to cross‐national differences in display rules." Asian journal of social psychology 1.2 (1998): 147-165.
    • [4] Gross, J. J., & John, O. P. (2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology, 85(2), 348-362.
    • [5] Chou, Huang-Cheng, and Chi-Chun Lee. "Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
    • [6] Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton, “Who said what: Modeling individual labelers improves classification,” AAAI’18. AAAI Press, 2018., 2018.
    • [7] Jing Han, Zixing Zhang, Maximilian Schmitt, Maja Pantic, and Bjorn Schuller, “From hard to soft: To- ¨ wards more human-like emotion recognition by modelling the perception uncertainty,” in Proceedings of the 2017 ACM on Multimedia Conference. ACM, 2017, pp. 890–897.
    • [8] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon, “Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels,” in Neural Networks (IJCNN), 2016 International Joint Conference on. IEEE, 2016, pp. 566–570.
    • [9] Matsumoto, D., Yoo, S. H., Hirayama, S., & Petrova, G. (2005). Development and Validation of a Measure of Display Rule Knowledge: The Display Rule Assessment Inventory. Emotion, 5(1), 23-40.
    • [10] Gideon, John, et al. "Progressive neural networks for transfer learning in emotion recognition." arXiv preprint arXiv:1706.03256(2017).
    • [11] Abdelwahab, Mohammed, and Carlos Busso. "Domain adversarial for acoustic emotion recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing26.12 (2018): 2423-2435..
    • [12] TZENG, Eric, et al. Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 7167-7176.
    • [13] Laradji, Issam, and Reza Babanezhad. "M-ADDA: Unsupervised Domain Adaptation with Deep Metric Learning." arXiv preprint arXiv:1807.02552 (2018).
    • [14] Saito, Kuniaki, et al. "Maximum classifier discrepancy for unsupervised domain adaptation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
    • [15] Busso, Carlos, et al. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation42.4 (2008): 335.
    • [16] BUSSO, Carlos, et al. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2016, 8.1: 67-80.
    • [17] Busso, Carlos, Angeliki Metallinou, and Shrikanth S. Narayanan. "Iterative feature normalization for emotional speech detection." 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP).
    • [18] OPREA, Tudor I., et al. A crowdsourcing evaluation of the NIH chemical probes. Nature chemical biology, 2009, 5.7: 441.
    • [19] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Muller, and S. Narayanan, “The INTERSPEECH 2010 Par- ¨ alinguistic Challenge,” in Proc. INTERSPEECH, Makuhari, Japan, 2010, pp. 2794–2797.
    • [20] EYBEN, Florian; WÖLLMER, Martin; SCHULLER, Björn. Opensmile: the munich versatile and fast open-source audio feature extractor.In: Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010.
    • [21] BEN-DAVID, Shai, et al. A theory of learning from different domains. Machine learning, 2010, 79.1-2: 151-175.
    • [22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule con-verge to a local nash equilibrium,” inAdvances in Neural Infor-mation Processing Systems, 2017, pp. 6626–6637.
    • [23] ARJOVSKY, Martin; CHINTALA, Soumith; BOTTOU, Léon. Wasserstein generative adversarial networks. In: International conference on machine learning. 2017. p. 214-223.
    • [24] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
    • [25] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in European Conference on Computer Vision. Springer, 2016, pp. 443–450.
    • [26] Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66-67: 178.
    • [27] De Maesschalck, Roy, Delphine Jouan-Rimbaud, and Désiré L. Massart. "The mahalanobis distance." Chemometrics and intelligent laboratory systems 50.1 (2000): 1-18.
    • [28] Bar-Hillel, Aharon, Hertz, Tomer, Shental, Noam, and Weinshall, Daphna. Learning a mahalanobis metric from equivalence constraints. JMLR, 2005.
    • [29] Weinberger, Kilian Q., and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification." Journal of Machine Learning Research 10.Feb (2009): 207-244
    • [30] GRETTON, Arthur, et al. A kernel two-sample test. Journal of Machine Learning Research, 2012, 13.Mar: 723-773.

    QR CODE