簡易檢索 / 詳目顯示

研究生: 黃大祐
Huang, Da-Yu
論文名稱: 應用於中文語音情緒辨識的聲學特徵萃取
Acoustic Feature Extraction for Mandarin Speech Emotion Recognition
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 曾建維
Tzeng, Jian-Wei
李夢麟
Li, Meng-Lin
王道維
Wang, Daw-Wei
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 68
中文關鍵詞: 語音情緒辨識聲學特徵萃取
外文關鍵詞: Speech Emotion Recognition, Feature Extraction
相關次數: 點閱:89下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在將近兩年半前的 2020 年底,當時正逢 COVID-19 的疫情嚴峻,全台各大學皆轉為遠距教學。學生們在為了防疫而避免外出的情況下,失去和同儕間聯繫情感的機會。在這樣的轉變下,學生更難以紓解積累於內心的煩憂,甚至在同年的 11 月底,於一週半的時間內接連發生了數起大學生自盡的事件。原先處於低潮的學生們,在如此氛圍下更難以自處,只得尋求心理諮詢的協助,讓校內諮商中心出現了人力短缺的問題。
    為嘗試幫助改善此一現象,本篇論文從頭建構了一個語音情緒辨識資料庫,並利用基礎統計的方式來驗證哪些聲學特徵對於中文語音情緒辨識是特別有助益。實驗結果顯示,有特定的聲學特徵統計量與受試者在心理量表的作答情形,有著相對顯著的正相關與負相關結果。並且,在相關係數的數據當中,也有與心理學臨床經驗相符之情形。


    At the end of 2020, almost two and a half years ago, during the severe outbreak of COVID-19, universities across Taiwan transitioned to remote teaching. Students, in an effort to prevent the spread of the virus, were avoiding going out, which led to a loss of opportunities for emotional connection with their peers. Under this transformation, students found it even more difficult to alleviate the accumulated stress within themselves. In fact, within a period of one and a half weeks in November of the same year, several cases of university students took their own lives occurred consecutively. The students who were already in a bad mental state struggled even more in such an atmosphere and sought assistance from psychological counseling, causing a shortage of manpower at the counseling center of National Tsing Hua University.

    In attempt to help improve this phenomenon, this paper constructs a speech emotion recognition database from scratch and uses basic statistics to verify which acoustic features are particularly helpful for Mandarin speech emotion recognition. The experimental results show that there are specific statistics of acoustic features that have a significant positive or negative correlation with the participants' responses on psychological scales. Moreover, these correlation coefficients also align with practical clinical experiences in psychology.

    1 緒論 1.1 文獻回顧................................ 2 1.1.1 情緒標籤和資料庫 ...................... 2 1.1.2 特徵萃取 ........................... 4 1.1.3 模型設計和優化........................ 4 1.2 研究方向................................ 5 1.3 本論文架構 .............................. 6 2 語音情緒辨識資料庫之建置 7 2.1 受試者個人資料表........................... 8 2.1.1 可能造成壓力之現實因素................... 8 2.1.2 複合情緒 ........................... 9 2.2 心理量表................................ 9 2.2.1 憂鬱量表 ........................... 9 2.2.2 自殺意念量表......................... 10 2.2.3 躁症量表 ........................... 11 2.2.4 衝動量表 ........................... 12 2.2.5 絕望感量表.......................... 13 2.3 語音資料................................ 14 2.3.1 一分鐘中性語句-「氣象新聞」................ 14 2.3.2 三分鐘個人獨白-「給三年後的自己的一段話」 . . . . . . . 15 3 語音分析方法 16 3.1 聲學特徵................................ 16 3.1.1 基頻.............................. 17 3.1.2 共振峰 ............................ 19 3.1.3 方均根 ............................ 21 3.1.4 過零率 ............................ 22 3.1.5 諧噪比 ............................ 22 3.1.6 聲學特徵數值分布情形.................... 23 3.2 統計方法................................ 31 3.2.1 基本統計量.......................... 31 3.2.2 皮爾森相關係數........................ 32 4 數據結果與討論 34 4.1 相關係數................................ 34 4.1.1 三分鐘獨白-情緒量表之相關係數............... 34 4.1.2 一分鐘中性語句-情緒量表之相關係數 ............ 39 4.1.3 個人資料表-情緒量表之相關係數............... 41 4.2 與心理學臨床經驗相符之處...................... 41 5 結論 43 6 未來展望 44 References 46

    [1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, Dec 2008.
    [2] S.R.LivingstoneandF.A.Russo,“Theryersonaudio-visualdatabaseofemo- tional speech and song (ravdess): A dynamic, multimodal set of facial and vo- cal expressions in north american english,” PLOS ONE, vol. 13, pp. 1–35, 05 2018.
    [3] K. Dupuis and M. Pichora-Fuller, “Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set,” Canadian Acoustics - Acoustique Canadienne, vol. 39, pp. 182– 183, 09 2011.
    [4] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” 9th European Conference on Speech Communication and Technology, vol. 5, pp. 1517–1520, 09 2005.
    [5] S.DemircanandH.Kahramanli,“Featureextractionfromspeechdataforemo- tion recognition,” Journal of Advances in Computer Networks, vol. 2, pp. 28– 30, 01 2014.
    [6] T. Iliou and C.-N. Anagnostopoulos, “Statistical evaluation of speech fea- tures for emotion recognition,” 2010 Fifth International Conference on Digital Telecommunications, vol. 0, pp. 121–126, 07 2009.
    [7] V. Dissanayake, H. Zhang, M. Billinghurst, and S. Nanayakkara, “Speech Emotion Recognition ‘in the Wild’Using an Autoencoder,” in Proc. Inter- speech 2020, pp. 526–530, 10 2020.
    [8] C. Yu, Q. Tian, F. Cheng, and S. Zhang, “Speech emotion recognition using support vector machines,” in Advanced Research on Computer Science and Information Engineering (G. Shen and X. Huang, eds.), pp. 215–220, 2011.
    [9] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” Proceedings of the Annual Conference of the International Speech Communication Association, INTER- SPEECH, 09 2014.
    [10] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006. Neural Networks.
    [11] H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, “Pre-trained deep convolution neural network model with attention for speech emotion recogni- tion,” Frontiers in Physiology, vol. 12, 03 2021.
    [12] A. T. Beck, R. A. Steer, G. K. Brown, et al., Beck depression inventory. Har- court Brace Jovanovich New York:, 1987.
    [13] A. T. Beck, M. Kovacs, and A. Weissman, “Assessment of suicidal intention: The scale for suicide ideation.,” Journal of Consulting and Clinical Psychol- ogy, vol. 47, no. 2, p. 343–352, 1979.
    [14] R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer, “A rating scale for mania: Reliability, validity and sensitivity,” British Journal of Psychiatry, vol. 133, no. 5, p. 429–435, 1978.
    [15] J. H. Patton, M. S. Stanford, and E. S. Barratt, “Barratt impulsiveness scale- 11,” PsycTESTS Dataset, 1995.
    [16] A. Beck, A. Weissman, D. Lester, and L. Trexler, “The measurement of pes- simism: The hopelessness scale,” Journal of Consulting and Clinical Psychol- ogy, vol. 42, pp. 861–5, 01 1975.
    [17] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “The relevance of fea- ture type for the automatic classification of emotional user states: Low level descriptors and functionals,” Eighth Annual Conference of the International Speech Communication Association, pp. 2253–2256, 01 2007.
    [18] M. Abdelwahab and C. Busso, “Evaluation of syllable rate estimation in ex- pressive speech and its contribution to emotion recognition,” in 2014 IEEE Workshop on Spoken Language Technology(SLT), pp. 472–477, 12 2014.
    [19] M. C. Sezgin, B. Gunsel, and G. K. Kurt, “Perceptual audio features for emo- tion detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2012, p. 16, 05 2012.
    [20] A. Tursunov, S. Kwon, and H.-S. Pang, “Discriminating emotions in the va- lence dimension from speech using timbre features,” Applied Sciences, vol. 9, p. 2470, 06 2019.
    [21] C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” in Proc. Interspeech 2012, pp. 1179–1182, 09 2012.
    [22] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [computer program],” Mar 2023.
    [23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, vol. 8, 2015.
    [24] M. Mauch and S. Dixon, “Pyin: A fundamental frequency estimator using probabilistic threshold distributions,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663, 2014.
    [25] A. Cheveigné and H. Kawahara, “Yin, a fundamental frequency estimator for speech and music,” The Journal of the Acoustical Society of America, vol. 111, pp. 1917–30, 05 2002.
    [26] C. Ferrand, Speech Science: An Integrated Approach to Theory and Clinical Practice. Allyn & Bacon communication sciences and disorders series, Pear- son, 2014.
    [27] C. Kim and W. Sung, “Vowel pronunciation accuracy checking system based on phoneme segmentation and formants extraction,” Proceedings of Interna- tional Conference on Speech Processing, pp. 447–452, 08 2001.
    [28] C. Reuter, “The role of formant positions and micro-modulations in blending and partial masking of musical instruments,” The Journal of the Acoustical Society of America, vol. 126, pp. 2237–2237, 10 2009.
    [29] J.P.BURG,“Maximumentropyspectralanalysis,”Proceedingsof37thMeet- ing, Society of Exploration Geophysics, 1967.
    [30] H. Akaike, “Information theory and an extension of the maximum likeli- hood principle,” 2nd International Symposium on Information Theory, vol. 73, pp. 1033–1055, 01 1973.
    [31] J. Ghosh, M. Delampady, and T. Samanta, An introduction to Bayesian anal- ysis: Theory and methods. Springer, 01 2006.
    [32] P. Stoica, “Generalized yule-walker equations and testing the orders of multi- variate time series,” International Journal of Control, vol. 37, no. 5, pp. 1159– 1166, 1983.
    [33] W. E. P. Jr., C. J. Ying, R. L. Moses, and W. M. Steedly, “Accuracy and com- putational comparisons of TLS-Prony, Burg, and FFT-based scattering center extraction algorithms,” in Automatic Object Recognition III (F. A. Sadjadi, ed.), vol. 1960, pp. 140–151, International Society for Optics and Photonics, SPIE, 1993.
    [34] C. Ferrand, “Harmonics-to-noise ratio: An index of vocal aging,” Journal of Voice : Official Journal of the Voice Foundation, vol. 16, pp. 480–7, 01 2003.
    [35] P. Boersma, “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” in Proceedings of the insti- tute of phonetic sciences, vol. 17, pp. 97–110, Amsterdam, 1993.
    [36] K. Pearson, “Note on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
    [37] 台灣精神醫學會, 美國精神醫學會, and A. P. Association, DSM-5 ®: 精神疾 病診斷準則手冊. 合記圖書出版社, 2014.

    QR CODE