研究生: |
黃姿云 Huang, Tzu-Yun |
---|---|
論文名稱: |
雙重互補聲學嵌入網絡: 從原始波形挖掘與特徵集相異的語音情感識別特徵 A Dual Complementary Acoustic Embedding Network: Mining Discriminative Characteristics from Raw-waveform for Speech Emotion Recognition |
指導教授: |
李祈均
Lee, Chi-Chun |
口試委員: |
曹昱
Tsao, Yu 李宏毅 Lee, Hung-Yi 陳冠宇 Chen, Kuan-Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 43 |
中文關鍵詞: | 語音情緒辨識 、原始波型 、端對端學習 、聲學空間擴大 |
外文關鍵詞: | speech emotion recognition, raw waveform, end-to-end learning, acoustic space augmentation |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音情緒辨識近年來在眾多廣泛的領域中成為一股潮流,同時憑藉著深度學習這項技術下,達到了令人驚豔的的成果。然而,因為端對端學習使用了複雜且時變性的原始波形導致它很難超越被精細設計過的特徵集。特徵集的擴大透過互補特徵的啟發, 這個方法可以同時利用辨別能力區別兩邊。在本篇論文中,我們提出了雙重互補聲學嵌入網路(DCaEN)的方法,透過將專家裁切特徵和原始波形模組在一起進而改進情緒辨識能力。基於專家設計用來使用在情感計算的聲學特徵集,我們使得餘弦相似性優化到負值來當作互補的限制,藉此限制來挖掘原始波形中額外資訊。在我們的實驗中,我們將所提出的模型應用於IEMOCAP和MSP-IMPROV 資料庫上,其準確率分別達到59.31%和46.22%,這結果皆比單單使用原始波形或是特徵集來的高。再者,在被學習的互補空間中,我們透過視覺化分析來更進一步的突出我們所提出的互補限制所造成的效果。
Speech emotion recognition has recently become a trend in broader fields and has achieved stunning performance with deep learning technology. However, end-to-end learning using complex and time-varying raw waveform can still hardly exceed the finely-designed hand-crafted feature sets. A feature space augmentation approach by complementary feature elicitation can simultaneously leverage the discriminative power of both sides. In this study we propose a Dual Complementary Acoustic Embedding Network (DCaEN) jointly modeling hand-crafted features with raw waveform to improve emotion recognition. We specify the cosine distance to negative value as a complementary constraint to mine additional information from raw waveform in terms of expert-designed acoustic feature set for affective computing.
Experimental results of predicting emotion categories on IEMOCAP and MSP-IMPROV database show that our proposed model achieves 59.31% and 46.22% respectively which both outperform the networks using either raw waveform or feature set solely. Moreover, we present visualization analysis on the learned complementary space to further illuminate the effect of complementary constraint.
1. Björn Schuller; Gerhard Rigoll; Manfred Lang, Hidden markov model-based speech emotion recognition, in Multimedia and Expo. 2003. p. I-401.
2. Tin Lay Nwe; Say Wei Foo; Liyanage C De Silva, Speech emotion recognition using hidden markov models. Speech communication, 2003. 41(4): p. 603-623.
3. Antonio Camurri; Ingrid Lagerlof; Gualtiero Volpe, Recognizing emotion from dance movement: comparison of spectator recognition and automated techniques. International journal of human-computer studies, 2003. 59: p. 213–225.
4. Andrew J Calder, “Facial emotion recognition after bilateral amygdala damage: differentially severe impairment of fear. Cognitive Neuropsychology, 1996. 13: p. 699–745.
5. Liyanage C De Silva; Tsutomu Miyasato; Ryohei Nakatsu, Facial emotion recognition using multimodal information, in Information, Communications and Signal Processing. 1997. p. 397–401.
6. Jeong-Sik Park; Ji-Hwan Kim; Yung-Hwan Oh, Feature vector classification based speech emotion recognition for service robots. IEEE Transactions on Consumer Electronics, 2009. 55(3).
7. Cynthia Breazeal; Lijin Aryananda, Recognition of affective communicative intent in robot-directed speech. Autonomous robots, 2002. 12(1): p. 83-104.
8. Kristin Byron; Sophia Terranova; Stephen Nowicki, Nonverbal emotion recognition and salespersons: Linking ability to perceived and actual success. Journal of Applied Social Psychology, 2007. 37(11): p. 2600-2619.
9. Alex Pentland, Healthwear: medical technology becomes wearable. Computer, 2004. 37(5): p. 42-49.
10. Michael Neumann; Ngoc Thang Vu. Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech. 2017.
11. Wenming Zheng; Minghai Xin; Xiaolan Wang; Bei Wang, A Novel Speech Emotion Recognition Method via Incomplete Sparse Least Square Regression. IEEE Signal Processing Letters, 2014. 21: p. 569-572.
12. Maximilian Schmitt; Fabien Ringeval; Bjorn Schuller, At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech. INTERSPEECH, 2016: p. 495-499.
13. Bjorn Schuller; Stefan Steidl; Anton Batliner; et al, The INTERSPEECH 2013 Computational Paralinguistics Challenge: Social signals, conflict, emotion, autism, in INTERSPEECH. 2013. p. 148–152.
14. Fabien Ringeval; Björn Schuller; Michel Valstar; Shashank Jaiswal; Erik Marchi; Denis Lalanne; Roddy Cowie; Maja Pantic, AV+EC 2015 – The First Affect Recognition Challenge Bridging Across Audio, Video, and Physiological Data, in Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 2015. p. 3-8.
15. Mousmita Sarma; Pegah Ghahremani; Daniel Povey; et al, Emotion Identification from raw speech signals using DNNs, in INTERSPEECH. 2018.
16. Pegah Ghahremani; Vimal Manohar; Daniel Povey; Sanjeev Khudanpur, Acoustic modelling from the signal domain using CNNs, in INTERSPEECH. 2016.
17. Egor Lakomkin; Cornelius Weber; Sven Magg; Stefan Wermter, Reusing Neural Speech Representations for Auditory Emotion Recognition, in Proceedings of the Eighth International Joint Conference on Natural Language Processing. 2017.
18. Zixiaofan Yang; Julia Hirschberg, Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks, in INTERSPEECH. 2018.
19. Carlos Busso; Murtaza Bulut; Chi-Chun Lee, Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 2008. 42(4): p. 335.
20. Haytham M. Fayek; Margaret Lech; Lawrence Cavedon, Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 2017. 92: p. 60-68.
21. Carlos Busso; Srinivas Parthasarathy; Alec Burmania; et al, MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 2015.
22. Florian Eyben; Klaus R Scherer; Björn Schuller; et al, The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 2016. 7(2): p. 190-202.
23. Jing Han; Zixing Zhang; Gil Keren; Björn Schuller, Emotion Recognition in Speech with Latent Discriminative Representations Learning, in Acta Acustica united with Acustica. 2018. p. 737-740.
24. Zakaria Aldeneh; Emily Mower Provost, Using regional saliency for speech emotion recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. p. 2741??2745.
25. Florian Eyben; Felix Weninger; Florian Gross; Björn Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, in Proceedings of the 21st ACM international conference on Multimedia. 2013. p. 835-838.
26. Florian Eyben; Martin Wöllmer; Böjrn Schuller, The openSMILE book - openSMILE: The Munich Versatile and Fast Open-Source Audio Feature Extractor, in ACM Multimedia. 2010.
27. Saurabh Sahu; Rahul Gupta; Carol Espy-Wilson, On enhancing speech emotion recognition using generative adversarial networks. 2018, arXiv preprint arXiv:1806.06626.
28. Shizhe Chen; Qin Jin; Xirong Li; Gang Yang; Jieping Xu, Speech emotion classification using acoustic features, in 9th International Symposium on Chinese Spoken Language Processing (ISCSLP). 2014. p. 579-583.
29. Orith Toledo-Ronen; Alexander Sorin, Voice-based sadness and anger recognition with cross-corpora evaluation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2013. p. 7517-7521.
30. Frank Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review, 1958.
31. David E. Rumelhart; Geoffrey E. Hinton; Ronald J. Williamss, Learning representations by backpropagating errors. Cognitive modeling, 1988.
32. Geoffrey E. Hinton; Ruslan R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science, 2006.
33. Alex Krizhevsky; Ilya Sutskever; Geoffrey E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012.
34. S. Lawrence; C.L. Giles; Ah Chung Tsoi; A.D. Back, Face recognition: a convolutional neural-network approach. IEEE Transactions on Neural Networks, 1997. 8: p. 98-113.
35. Shuiwang Ji; Wei Xu; Ming Yang; Kai Yu, 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013. 35(1): p. 221-231.
36. Ronald J. Williams; David Zipser, A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989. 1(2): p. 270-280.
37. Grégoire Mesnil; Xiaodong He; Li Deng; Yoshua Bengio, Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. INTERSPEECH, 2013.
38. Tomáš Mikolov; Stefan Kombrink; Lukáš Burget; et al, Extensions of recurrent neural network language model IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011.
39. Tomas Mikolov; Martin Karafiat; Lukas Burget; et al, Recurrent neural network based language model. INTERSPEECH, 2010. 2.
40. Haşim Sak; Andrew Senior; Kanishka Rao; Françoise Beaufays, Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv:1507.06947, 2015.
41. Sepp Hochreiter; Jürgen Schmidhuber, Long Short-Term Memory. Neural computation, 1997.
42. Y. Bengio; P. Simard; P. Frasconi, Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. 5(2): p. 157 - 166.
43. Klaus Greff; Rupesh K. Srivastava; Jan Koutník; et al, LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems 2017. 28(10): p. 2222 - 2232.
44. Haşim Sak; Andrew Senior; Françoise Beaufays, Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. arXiv:1402.1128 2014.
45. Has¸im Sak; Andrew Senior; Franc¸oise Beaufays, Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. INTERSPEECH, 2014.
46. Panagiotis Tzirakis; George Trigeorgis; Mihalis A. Nicolaou; Böjrn Schuller; Stefanos Zafeiriou, End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 2017. 11(8): p. 1301-1309.