研究生: |
陳媛媛 |
---|---|
論文名稱: |
應用支援向量機之情緒語音辨識 The Recognition of Emotional Speech Using Support Vector Machine |
指導教授: |
唐文華
Tarng Wernhuar |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 語音情緒辨識 、支援向量機 、音高 、梅爾倒頻譜參數係數 |
外文關鍵詞: | Emotional speech recognition, support vector machine, pitch, Mel-scale Frequency Cepstral Coefficient |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究使用支援向量機(Support Vector Machine)做為語音情緒辨識工具,分出語音資料中喜(Happiness)、怒(Angry)、哀(Sadness)、中性(Normal)四種情緒類別,並以階層式分類方法調整參數權重,將語音特徵從能量及頻率兩種角度分別訓練。本研究結合音高、音頻等總共使用29個語音特徵,並且提出一個使用連續小波轉換(Continuous Wavelet Transform)求得瞬時頻率與語句位置關聯性的特徵,做為訓練時的參數之一,以便在辨識過程加強整體語調重要性。
本研究選用德國柏林情緒語音資料庫,將語音資料分為男女聲兩類訓練。研究結果在測詴資料集的帄均正確率(Accuracy)為63.53%,而整體資料集的帄均正確率為90.89%。在情緒類別的精確率(Precision)表現最佳者為生氣情緒,帄均精確率可達83.37%。在喜、怒情緒分類中加入頻率-時間特徵後,男性測詴資料集的正確率可從72.73%提高至77.27%,而女性方面則從63.16%提高至68.42%。
In this study, a support vector machine was used as a classifier to divide the emotional speech into four categories, i.e., happiness, anger, sadness, and normal. A hierarchical classification method was designed according to two dimensions, arousal and pleasure, to adjust the weight of features. Totally, 29 acoustics features were selected for training data. In order to emphasize the tone in a sentence, we propose the temporal spectrum as a feature by using continuous wavelet transform.
In this study, the Berlin emotional speech database was chosen as the training and testing dataset, divided into male and female speech. In the experimental results, the average accuracy of test set is 63.53%, and average accuracy of overall dataset is 90.89%. The highest precision in emotion recognition is anger, and the precision rate is 83.37%. After using the feature of temporal spectrum in the classification of happiness and anger, the accuracy for male test data were increased from 72.73% to 77.27%, and the accuracy for female were increased from 63.16% to 68.42%.
[1] Shneiderman, B. (1992). Designing the user interface:Strategies for effective human-computer interaction (2nd ed.). Reading: Addison-Wesley.
[2] 林秓煌(民97)。語音情緒辨識在VoIP客服系統上的應用。大同大學資訊工程學所碩士論文,台北市。
[3] VN Vapnik. (2000).The nature of statistical learning theory. Chapter 5-6,p138-167, Springer-Verlag, New York.
[4] C. Campbell. (2001). Kernel methods:a survey of current techniques. eurocomputing, vol.40,pp.63-84.
[5] S.Gunn. (1998). Support vector machines for classification and regression. Image Speech and Intelligent System Group, University of Southampton.
[6] 張春興 (2007)。張氏心理學辭典第二版。台北市:東華。
[7] Plutchik, R. (1980). A general psychoevolutionary theory of emotion. San Diego, CA: Academic Press.
[8] Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39, 1161-1178.
[9] Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana, IL:Univ. Illinois Press.
[10] Posner, J., Russell, J. A. and Peterson, B. S. (2005). A circumplex model of affect:An integrative approach to affective.
[11] 楊延光(2003)。科學發展,367期,70~73頁。
[12] 韓紀慶和張磊(2004)。語音信號處理。北京:清華大學出版社。
[13] E. Douglas-Cowie, R. Cowie, M. Schröder. (2000). Emotional speech:Towards a new generation of databases. Speech Communication Special Issue Speech and Emotion, 40(1–2), 33-60.
40
[14] Dimitrios Ververidis & Constantine Kotropoulos. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication 48 (9) 1162-1181.
[15] Robinson, R., & Winold, A. (1976). Thechoral experience: Literature, materials, and methods. New York:HarperCollins Publishers.
[16] Cai, L., Jiang, C., Wang, Z., Zhao, L., & Zou, C. (2003). A method combining the global and time series structure features for emotion recognition in speech. In Proceedings of international conference on neural networks and signal processing (ICNNSP 03), 2, 904-907.
[17] Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signal. the eighth European conference on speech communication and technology (EUROSPEECH 03), Geneva, Switzerland.
[18] Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden markov model-based speech emotion recognition. 28th IEEE international conference on acoustic, speech and signal processing (ICASSP 03).
[19] Petrushin, V. A. (2004). Emotion recognition in speech signal:Experimental study, development, and application.” sixth international conference on spoken language processing (ICSLP 00).
[20] Vidrascu, L., & Devillers, L. (2005). Detection of real-life emotions in call centers. Proceedings of Interspeech.1841-1844.
[21] Batliner, A., Steidl, S., Hacker, C., Nöth, & E., Niemann, H. (2005). Tales of tuning – prototyping for automatic classification of emotional user states. Proceedings of Interspeech, 489-492.
[22] Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70(3), 614-636.
[23] Xia Mao, Bing Zhang & Yi Luo. (2007). Speech Emotion Recognition Based on
41
a Hybrid of HMM/ANN. Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece.
[24] Vogt, T., & Andr´e, E. (2006). Improving Automatic Emotion Recognition from Speech via Gender Differentiation. Language Resources and Evaluation Conference.
[25] 陳俊甫(民93)。應用機率式句法結構與隱含式語意索引於情緒語音合成之單元選取。國立成功大學資訊工程學所碩士論文,台南市。
[26] Box, G. E. P., & Jenkins, G, (1976). Time Series Analysis:Forecasting and Control”, Holden-Day, Saint Francisco, CA.
[27] Reynolds, D. A., & Rose, R. C. (1995) .Robust text-independent speaker identification using Gaussian mixture models. In Proceedings of the European Conference on Speech Communication and Technology, 963–966.
[28] Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77, 257-286.
[29] K. Fukunaga. (1990). Introduction to statistical pattern recognition. San Diego, CA: Academic Press.
[30] Cover, T. M & Hart, P. E. (1967).Nearest neighbor pattern classification.IEEE Transactions on Information Theory, 13, 21-27.
[31] E. H. Han, G. Karypis & V. Kumar. (2001). Text categorization using weight adjusted k-Nearest Neighbor classification. Pacific-Asia Conference on Knowledge Discovery and Data Mining, 53-65.
[32] Yao X. (1999). Evolving artificial neural networks. Proceedings of the IEEE ,87(9), 1423–1447.
[33] Nwe, T.L., Foo, S.W., & De Silva, L.C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication 41, 603-623
[34] Nwe, T.L., Foo, S.W., & De Silva, L.C. (2003). Speech emotion recognition
42
using hidden Markov models. Speech Communication 41, 603-623.
[35] Bogdan, V., & Andreas, W. (2009). Processing affected speech within human machine interaction. INTERSPEECH 2009, Brighton, UK.
[36] Siqing Wu, Tiago H. Falk, & Wai-Yip Chan. (2009). Automatic recognition of speech emotion using long-term spectro-temporal features.
[37] Neiberg, D., Elenius, K. & Laskowski, K. (2006) .Emotion recognition in spontaneous speech using GMM. Paper presented in the Proceedings of the International Conference on Spoken Language Processing, 809-812.
[38] Felix, B., Astrid ,P., Miriam, R., Walter, S., & Benjamin, W. (2005). A Database of German Emotional Speech. Proceedings Interspeech, http://pascal.kgw.tu-berlin.de/emodb/.
[39] Rabiner, L. R., & Ronald W. Schafer.(1989). Digital Processing of Speech Signals. Prentice-Hall, Inc., Englewood Cliffs, NJ
[40] Petr, K., David, N., & Christopher, A. (2009). Gwyddion user guide.Chapter 4.
[41] CC Chang, CJ Lin. (2001). LIBSVM:a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[42] Björn Schuller & Gerhard Rigoll. (2006). Timing Levels in Segment-Based Speech Emotion Recognition. INTERSPEECH 2006, ICSLP, ISCA, 1818-1821.
[43] Xiao, Z., Dellandrea, E., Dou, W., & Chen, L. (2007). Automatic Hierarchical Classification of Emotional Speech. Ninth IEEE International Symposium on Multimedia Workshops (ISMW), 291‐296.
[44] Truong, K. & D. Van Leeuwen. (2007). An ‘open-set detection evaluation methodology for automatic emotion recognition in speech. Workshop on Paralinguistic Speech – between models and data, 5-10. Saarbrucken, Germany.