簡易檢索 / 詳目顯示

研究生: 李芝宇
Lee, Chih-Yu
論文名稱: 結合不同層級特徵於語音情緒辨識之研究
Combining Different Levels of Features for Emotion Recognition in Speech
指導教授: 張智星
Jang, Jyh-Shing Roger
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 39
中文關鍵詞: 情緒辨識
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 情緒辨識廣泛的被應用在很多領域中。正確的辨識情緒一直是這項研究的目的。我們認為不同階層的語音特徵可以提供不同的語音資訊,並且相信結合不同階層的語音特徵可以提高辨識率。從實驗結果可得知,將不同階層的語音特徵結合再一起,可以有效達到彌補各階層資訊不足的情況。我們提出幾種不同階層特徵之組合方式,並且在實驗中,我們證實了組合不同階層語音特徵確實可以提升辨識率。在實驗中我們採用德語情緒語料庫以及英語eNTERFACE情緒語料庫,前者有七種情緒類別,後者則有六種情緒類別。我們擷取的語音特徵值可分為兩種,第一類是以音框階層為基準擷取的語音特徵,包含能量、音高以及梅爾倒頻譜係數;第二類則是針對區段階層以及語句階層擷取,擷取的特徵則為low-level-descriptors (LLDs)。由實驗可知,相較於單一階層的語音特徵,結合多層特徵將能有效提升辨識率。


    Emotion recognition has been successfully applied in many fields. It is believed that features extracted from each timing-level can provide different information of the emotional speech signals and therefore can compensate one another. In order to achieve a promising recognition accuracy, several methods for combining features extracted from different timing-levels are proposed in this thesis, including likelihood combination, weighted likelihood combination, raw feature combination and partial raw feature combination. We extracted spectrum features and prosodic features for frame-level features, and low-level descriptors (LLDs) for segment-level features and utterance-level features. The Berlin Emotion Database and eNTERFACE emotional database are used in the experiments. Compared with conventional one or two timing-level features, the combination of three timing-level features shows higher recognition rate.

    摘要 i Abstract ii Table of Contents iii List of Figures v List of Tables vi Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Literature 1 1.3 Proposed Frameworks 2 1.4 Thesis Organization 3 Chapter 2 Feature Extraction 4 2.1 Preprocessing 4 2.2 Frame-Level Features 4 2.2.1 Mel Frequency Cepstral Coefficients 5 2.2.2 Pitch 5 2.2.3 Energy 8 2.3 Segment-Level Features 8 2.4 Utterance-Level Features 9 2.5 Feature Selection 10 Chapter 3 Combination Methods 11 3.1 Likelihood Combination 11 3.2 Weighted Likelihood Combination 13 3.3 Raw Feature Combination 15 3.4 Partial Raw Feature Combination 16 Chapter 4 Experiments 18 4.1 Emotional Databases 18 4.2 Experimental Setup 21 4.3 Likelihood Combination 21 4.4 Weighted Likelihood Combination 22 4.5 Raw Feature Combination 24 4.6 Partial Raw Feature Combination 25 4.7 Analysis of Experimental Results 27 Chapter 5 Conclusions and Future Work 30 5.1 Conclusions 30 5.2 Future Work 31 References 31 Appendix A 34 Appendix B 37

    [1] Dan-Ning Jiang and Lian-Hong Cai, “Speech emotion classification with the combination of statistic features and temporal features”, (2004) IEEE ICME
    [2] B. Schuller and Gerhard Rigoll, “Timing levels in segment-based speech emotion recognition”, (2006) Interspeech
    [3] F. Burkhardt, A. Paeschke, M. Rolfes et al., “A database of German emotional speech”, (2005) Interspeech, 1517-1520
    [4] O. Martin, I. Kotsia, B. Macq and I. Pitas, “The eNTERFACE ’05 audio-visual emotion database”, (2006) IEEE Workshop on Multimedia Database Management
    [5] B. Vlasenko, B. Schuller, A. Wendemuth and g. Rigoll, “Combining frame and turn-level information for robust recognition of emotions within speech”, (2007) Interspeech
    [6] M. Chetouani, A. Mahdhaoui and F. Ringeval, “Time-scale feature extractions for emotional speech characterization”, (2009) Cognitive Computation, 194-201
    [7] Yi-Lin Lin and Gang Wei, “Speech emotion recognition based on GMM and SVM”, Fourth Internation Conference on Machine Learning and Cybernetics, (2005) 18-21
    [8] D. Ververidis ,C. Kotropoulos and Ioannis Pitas, “Automatic emotional speech classification”, (2004) 593-596
    [9] F. Yu, E. Chang, Y.Q. Xu and H.Y. Shum, “Emotion detection from speech to enrich multimedia content”, (2001) IEEE Pacific Rim Conference on Multimedia
    [10] N. Sato and Y. Obuchi, ” Emotion recognition using Mel-frequency cepstral coefficients”, (2007) Information and Media Technologies
    [11] M. Vondra and R. Vích, “Recognition of emotions in german speech using gaussian mixture models”, (2009) Multimodal Signals: Cognitive and Algorithmic Issues
    [12] Kwon, O.W., Chan, K., Hao J., T., “Emotion recognition by speech signals”, (2003) 8th Eur. Conf. on Speech Communication and Technology, 125-128
    [13] B. Schuller, R. Mller, M. Lang, G. Rigoll, “Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles”, (2005) Interspeech, 805-809
    [14] Yang Li and Yunxin Zhao, “Recognizing emotions in speech using short-term and long-term features”, (1998) IEEE ICSLP
    [15] Bou-Ghazale, S. E., Hansen, J. H. L., “A comparative study of traditional and newly proposed features for recognition of speech under stress”, (2000) IEEE Speech and Audio Proc., 8,429-442
    [16] B. Schuller, D. Seppi, A. Batliner, A. Maier, S. Steidl, “Towards more reality in the recognition of emotional speech”, (2007) ICASSP
    [17] R. Kohavi, G. John, “Wrappers for feature subset selection”, (1997) Artificial Intelligence, 273-324
    [18] A. Jain, D. Zonger, “Feature selection: Evaluation, application, and small sample performance”, IEEE Trans. Pattern Anal. Machine Intell. 19 (1997) 153–158
    [19] H. Liu, L. Yu, “Toward integrating feature selection algorithms for classification and clustering”, IEEE Trans. Knowledge and Data Eng. 17 (2005) 491–502
    [20] P. Pudil, F. J. Ferri, J. Novovicova, and J. Kittler, “Floating search methods for feature selection with nonmonotonic critierion finctiond”, (1994) ICCVIP, 279-283
    [21] R.-E. Fan, P. H. Chen, and C.-J. Lin. “Working set selection using the second order information for training SVM”, Journal of Machine Learning Research 6, (2005) 1889-1918 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
    [22] Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. The HTK(Hidden Markov Model Toolkit) Book V3.2 (2002), Cambridge University Engineering Department http://htk.eng.cam.ac.uk
    [23] Lawrence Rabiner, Bing-Hwang Juang, “Fundamentals of speech recognition”, (1993) PTR Prentice Hall, Signal Processing Series
    [24] Theodoridis Koutroumbas,“Pattern Recognition”, (2009) Academic Press, Elsevier
    [25] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz and J. G. Taylor, “Emotion recognition in human-computer-interaction,” IEEE Signal Processing magazine, (2001) vol. 18, no. 1, 32-80
    [26] http://en.wikipedia.org/wiki/AIBO
    [27] F. Eyben, M. Wollmer, and B. Schuller, “openEAR – Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit”, In Proc, ACII IEEE (2009), http://sourceforge.net/projects/openear
    [28] B. Schuller, S. Steidl, and A. Batliner, “ The interspeech 2009 emotion challenge”, In Interspeech (2009) ISCA.
    [29] Y.H. Yang, Y. C. Lin, Y. F. Su and Homer H. Chen, “A regression approach to music emotion recognition”, (2008) IEEE Audio, Speech and Language Processing

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE