使用多情緒專家模型偵測新進語音情緒類別｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	林曉彬 Lin, Hsiao-Pin
論文名稱：	使用多情緒專家模型偵測新進語音情緒類別 Adapt a New Emotion Class Detection by Speech using Mixture of Emotional Experts
指導教授：	李祈均 LEE, CHI-CHUN
口試委員:	冀泰石曹昱
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2022
畢業學年度：	111
語文別：	英文
論文頁數：	41
中文關鍵詞：	語音情緒辨識、多專家模型、小樣本學習
外文關鍵詞：	speech emotion recognition, mixture of experts, few shot learning
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

大多數的語音情緒辨識(SER)都專注在分這四類情緒：中性、生氣、難過、開心，但是要將語音情緒辨識實際應用在生活中，我們就不能忽略其他情緒的研究。然而，人類的情緒多達上百種，若每個情緒都要重新訓練大量的資料，會耗費太多時間。通常要解決這類問題，會利用原本就有的預訓練模型，透過少量的目標資料進行微調。但是用類別情緒標籤(categorical emotion label)當作預訓練模型，沒辦法獲得太好的效果。幸運的是，最近的研究進一步指出，維度情緒標籤(dimensional emotion label)能幫助類別情緒的分類。基於這個想法，本篇研究提出多情緒專家模型(MOEE)去解決小樣本新進語音情緒類別偵測。透過小樣本目標情緒在四類情緒和維度情緒標籤的預訓練模型上微調(fine-tune)，和能利用音訊資料結合專家間距離，學出權重的門控網路(gating network)。在IEMOCAP資料庫中，挫折的偵測達到了63.26％的UAR。在MSP-PODCAST資料庫中，驚訝、厭惡、鄙視的偵測則是只需要用10筆資料微調，就能超過全部資料訓練的結果。分析方面，利用MOEE輸出各個專家權重的特性，能將權重結果應用在分析情緒的相似度，做出與其他小樣本學習(few shot learning)的區別。

Most speech emotion recognition focuses on these four types of emotions: neutral, angry, sad, happy, but to actually apply speech emotion recognition to life, we cannot ignore other emotion studies. However, there are hundreds of emotions in humans, which take too time much if each emotion needs to retrain all the data. Usually, such problems are fine-tuned on pre-trained models with a small amount of target data. However, using the category emotion label as a pre-training model, there is hard to get a good result. Fortunately, recent research further points out that dimensional emotion labels help classify categorical emotions. This study proposes a mixture-of-emotional-experts (MOEE) to solve the new emotion class detection in few samples. Fine-tuning the pre-training model of the four types of emotion and dimensional emotion labels through a small sample of target emotions, and a gating network that learns weights using audio data combined with the distance between experts. In the IEMOCAP dataset, we achieved 63.26% UAR in the frustration detection. In the MSP-PODCAST dataset, surprise, disgust, and contempt detection, we can just fine-tune 10 training data to exceed the all data training. In analysis, using the expert weights output from MOEE, the weight results can be applied to analyze emotion similarity and make a difference from other few shot learning.

摘要    i
Abstract    ii
誌謝    iii
Contents    iv
List of Figures    vii
List of Tables    viii
Chapter 1 Introduction    1
Chapter 2 Database and Feature    5
2.1 IEMOCAP    5
2.2 MSP-PODCAST    6
2.3 Feature    8
Chapter 3 Methodology    9
3.1 Framework    9
3.1.1 Deep Neural Networks (DNN) and Gate Recurrent Unit (GRU)    10
3.1.2 Network of emotional experts    12
3.2 Training of emotional experts    14
3.3 Distance of emotional experts    15
3.4 Gating Network    16
Chapter 4 Experiment    17
4.1 Experimental Setup    17
4.1.1 Network Configurations    19
Chapter 5 Results and Analysis    20
5.1.1 Exp. 1-1 Comparison of Only Train by New Emotion and Tuning-Based Transfer    20
5.1.2 Exp. 1-2 Comparison of “An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition”    23
5.2 Exp. 2 Comparison of Different Distance of Experts    24
5.3 Exp. 3 Comparison of Different Combinations of Experts    26
5.4 Exp. 4 Comparison of Ensemble Approaches    27
5.5 Analysis    28
5.5.1 Effects on the Number of Fine-tune Samples    28
5.5.2 VAD Statistic    29
5.5.3 Weight Analysis    31
Chapter 6 Conclusions    34
Reference    36
Appendix    41


                                

[1] M. Ayadi, M. S.Kamel and F. Karray, "Survey on speech emotion recognition: Features, classification schemes, and databases," Pattern Recognition, Volume 44, Issue 3, pp. 572-587, March 2011.
[2] S. Ramakrishnan and I. M. M. E. Emary, "Speech emotion recognition approaches in human computer interaction.," Telecommunication Systems 52, p. 1467–1478 , 2 September 2011.
[3] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," IEEE Access vol. 7, p. 19143–19165, 2019.
[4] K. Oh, D. Lee, B. Ko and H. Choi, "A Chatbot for Psychiatric Counseling in Mental Healthcare Service Based on Emotional Dialogue Analysis and Sentence Generation," in IEEE International Conference on Mobile Data Management (MDM), 2017.
[5] P. Valentina and R. M. Hannah, "Alexa, she's not human but… Unveiling the drivers of consumers' trust in voice-based artificial intelligence," Psychology Marketing, 20 January 2021.
[6] B. G. C. Dellaert, S. B. Shu, T. A. Arentze, T. Baker, K. Diehl, B. Donkers, N. J. Fast, G. Häubl, H. Johnson, U. R. Karmarkar, H. Oppewal, B. H. Schmitt, J. Schroeder, S. A. Spiller and Steff, "Consumer decisions with artificially intelligent voice assistants," Marketing Letters, p. 335–347, 17 August 2020.
[7] A. B. Ingale and D. S. Chaudhari, "Speech Emotion Recognition," International Journal of Soft Computing and Engineering (IJSCE), pp. 235-238, March 2012.
[8] M. Swain, A. Routray and P. Kabisatpathy3, "Databases, features and classifiers for speech emotion recognition: a review," International Journal of Speech Technology, p. 93–120 , 19 January 2018.
[9] L. Muda, M. Begam and I. Elamvazuthi, "Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques," JOURNAL OF COMPUTING, VOLUME 2, ISSUE 3, March 2010.
[10] A. Baevski, H. Zhou, A. Mohamed and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," Advances in Neural Information Processing Systems 33, pp. 12449-12460, 2020.
[11] Y.-L. Lin and G. Wei, "Speech emotion recognition based on HMM and SVM," International Conference on Machine Learning and Cybernetics, pp. 4898-4901, 2005.
[12] L. Tarantino, P. N. Garner and A. Lazaridis., "Self-Attention for Speech Emotion Recognition," Interspeech, pp. 2578-2582, 2019.
[13] J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding and V. Tarokh, "Speech emotion recognition with dual-sequence LSTM architecture," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 6474-6478, 2020.
[14] S. Latif, R. Rana, S. Khalifa, R. Jurdak and J. Epps., "Direct Modelling of Speech Emotion from Raw Speech," Interspeech 2019, p. 3920–3924, 2019.
[15] J.-L. Li, T.-Y. Huang, C.-M. Chang and C.-C. Lee, "A waveformfeature dual branch acoustic embedding network for emotion recognition," Frontiers in Computer Science, vol. 2, p. 13, 2020.
[16] D. Wu, T. D. Parsons, E. Mower and S. Narayanan, "Speech emotion estimation in 3D space," 2010 IEEE International Conference on Multimedia and Expo, pp. 737-742, 2010.
[17] S. Gielen, E. Douglas-cowie and R. Cowie, "Acoustic correlates of emotion dimensions in view of speech synthesis," Seventh European Conference on Speech Communication and Technology, 2001.
[18] R. Kehrein, "The prosody of authentic emotions," Speech Prosody 2002, International Conference, 2002.
[19] T. W. Smith, The Book of Human Emotions: An Encyclopedia of Feeling from Anger to Wanderlust, Profile Books, 2015.
[20] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu and H. Zhu, "A Comprehensive Survey on Transfer Learning," Proceedings of the IEEE 109.1 , pp. 43-76, 2020.
[21] J.-L. Li and C.-C. Lee, "An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition," IEEE Transactions on Affective Computing, 2022.
[22] R. Xia and Y. Liu, "A multi-task learning framework for emotion recognition using 2D continuous space," EEE Transactions on affective computing, 8(1), pp. 3-14, 2015.
[23] R. Cai, K. Guo, B. Xu, X. Yang and Z. Zhang, "Meta Multi-task Learning for Speech Emotion Recognition," INTERSPEECH 2020, October 2020.
[24] G. Vrbančič and V. Podgorelec, "Transfer Learning With Adaptive Fine-Tuning," IEEE Access, vol. 8, pp. 196197-196211, 2020.
[25] S. Masoudnia and R. Ebrahimpour, "Mixture of experts: a literature survey," Artif Intell Rev 42, p. 275–293, 2014.
[26] J. M. Joyce, "Kullback-Leibler Divergence," International Encyclopedia of Statistical Science, p. 720–722, 01 January 2014.
[27] C. Zhang and Y. Ma, Ensemble Machine Learning: Methods and Applications, Springer, 2012.
[28] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee and S. S. Narayanan, "IEMOCAP: Interactive emotional dyadic motion capture database," Language resources and evaluation, vol. 42, no. 4, , pp. 335-359, December 2008.
[29] R. Lotfian and C. Busso, "Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings," IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471-483, October-December 2019.

簡易檢索 / 詳目顯示

相關論文