研究生: |
黃大祐 Huang, Da-Yu |
---|---|
論文名稱: |
應用於中文語音情緒辨識的聲學特徵萃取 Acoustic Feature Extraction for Mandarin Speech Emotion Recognition |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
曾建維
Tzeng, Jian-Wei 李夢麟 Li, Meng-Lin 王道維 Wang, Daw-Wei |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 68 |
中文關鍵詞: | 語音情緒辨識 、聲學特徵萃取 |
外文關鍵詞: | Speech Emotion Recognition, Feature Extraction |
相關次數: | 點閱:89 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在將近兩年半前的 2020 年底,當時正逢 COVID-19 的疫情嚴峻,全台各大學皆轉為遠距教學。學生們在為了防疫而避免外出的情況下,失去和同儕間聯繫情感的機會。在這樣的轉變下,學生更難以紓解積累於內心的煩憂,甚至在同年的 11 月底,於一週半的時間內接連發生了數起大學生自盡的事件。原先處於低潮的學生們,在如此氛圍下更難以自處,只得尋求心理諮詢的協助,讓校內諮商中心出現了人力短缺的問題。
為嘗試幫助改善此一現象,本篇論文從頭建構了一個語音情緒辨識資料庫,並利用基礎統計的方式來驗證哪些聲學特徵對於中文語音情緒辨識是特別有助益。實驗結果顯示,有特定的聲學特徵統計量與受試者在心理量表的作答情形,有著相對顯著的正相關與負相關結果。並且,在相關係數的數據當中,也有與心理學臨床經驗相符之情形。
At the end of 2020, almost two and a half years ago, during the severe outbreak of COVID-19, universities across Taiwan transitioned to remote teaching. Students, in an effort to prevent the spread of the virus, were avoiding going out, which led to a loss of opportunities for emotional connection with their peers. Under this transformation, students found it even more difficult to alleviate the accumulated stress within themselves. In fact, within a period of one and a half weeks in November of the same year, several cases of university students took their own lives occurred consecutively. The students who were already in a bad mental state struggled even more in such an atmosphere and sought assistance from psychological counseling, causing a shortage of manpower at the counseling center of National Tsing Hua University.
In attempt to help improve this phenomenon, this paper constructs a speech emotion recognition database from scratch and uses basic statistics to verify which acoustic features are particularly helpful for Mandarin speech emotion recognition. The experimental results show that there are specific statistics of acoustic features that have a significant positive or negative correlation with the participants' responses on psychological scales. Moreover, these correlation coefficients also align with practical clinical experiences in psychology.
[1] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, Dec 2008.
[2] S.R.LivingstoneandF.A.Russo,“Theryersonaudio-visualdatabaseofemo- tional speech and song (ravdess): A dynamic, multimodal set of facial and vo- cal expressions in north american english,” PLOS ONE, vol. 13, pp. 1–35, 05 2018.
[3] K. Dupuis and M. Pichora-Fuller, “Recognition of emotional speech for younger and older talkers: Behavioural findings from the toronto emotional speech set,” Canadian Acoustics - Acoustique Canadienne, vol. 39, pp. 182– 183, 09 2011.
[4] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” 9th European Conference on Speech Communication and Technology, vol. 5, pp. 1517–1520, 09 2005.
[5] S.DemircanandH.Kahramanli,“Featureextractionfromspeechdataforemo- tion recognition,” Journal of Advances in Computer Networks, vol. 2, pp. 28– 30, 01 2014.
[6] T. Iliou and C.-N. Anagnostopoulos, “Statistical evaluation of speech fea- tures for emotion recognition,” 2010 Fifth International Conference on Digital Telecommunications, vol. 0, pp. 121–126, 07 2009.
[7] V. Dissanayake, H. Zhang, M. Billinghurst, and S. Nanayakkara, “Speech Emotion Recognition ‘in the Wild’Using an Autoencoder,” in Proc. Inter- speech 2020, pp. 526–530, 10 2020.
[8] C. Yu, Q. Tian, F. Cheng, and S. Zhang, “Speech emotion recognition using support vector machines,” in Advanced Research on Computer Science and Information Engineering (G. Shen and X. Huang, eds.), pp. 215–220, 2011.
[9] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” Proceedings of the Annual Conference of the International Speech Communication Association, INTER- SPEECH, 09 2014.
[10] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006. Neural Networks.
[11] H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, “Pre-trained deep convolution neural network model with attention for speech emotion recogni- tion,” Frontiers in Physiology, vol. 12, 03 2021.
[12] A. T. Beck, R. A. Steer, G. K. Brown, et al., Beck depression inventory. Har- court Brace Jovanovich New York:, 1987.
[13] A. T. Beck, M. Kovacs, and A. Weissman, “Assessment of suicidal intention: The scale for suicide ideation.,” Journal of Consulting and Clinical Psychol- ogy, vol. 47, no. 2, p. 343–352, 1979.
[14] R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer, “A rating scale for mania: Reliability, validity and sensitivity,” British Journal of Psychiatry, vol. 133, no. 5, p. 429–435, 1978.
[15] J. H. Patton, M. S. Stanford, and E. S. Barratt, “Barratt impulsiveness scale- 11,” PsycTESTS Dataset, 1995.
[16] A. Beck, A. Weissman, D. Lester, and L. Trexler, “The measurement of pes- simism: The hopelessness scale,” Journal of Consulting and Clinical Psychol- ogy, vol. 42, pp. 861–5, 01 1975.
[17] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, “The relevance of fea- ture type for the automatic classification of emotional user states: Low level descriptors and functionals,” Eighth Annual Conference of the International Speech Communication Association, pp. 2253–2256, 01 2007.
[18] M. Abdelwahab and C. Busso, “Evaluation of syllable rate estimation in ex- pressive speech and its contribution to emotion recognition,” in 2014 IEEE Workshop on Spoken Language Technology(SLT), pp. 472–477, 12 2014.
[19] M. C. Sezgin, B. Gunsel, and G. K. Kurt, “Perceptual audio features for emo- tion detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2012, p. 16, 05 2012.
[20] A. Tursunov, S. Kwon, and H.-S. Pang, “Discriminating emotions in the va- lence dimension from speech using timbre features,” Applied Sciences, vol. 9, p. 2470, 06 2019.
[21] C. Busso and T. Rahman, “Unveiling the acoustic properties that describe the valence dimension,” in Proc. Interspeech 2012, pp. 1179–1182, 09 2012.
[22] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [computer program],” Mar 2023.
[23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, vol. 8, 2015.
[24] M. Mauch and S. Dixon, “Pyin: A fundamental frequency estimator using probabilistic threshold distributions,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663, 2014.
[25] A. Cheveigné and H. Kawahara, “Yin, a fundamental frequency estimator for speech and music,” The Journal of the Acoustical Society of America, vol. 111, pp. 1917–30, 05 2002.
[26] C. Ferrand, Speech Science: An Integrated Approach to Theory and Clinical Practice. Allyn & Bacon communication sciences and disorders series, Pear- son, 2014.
[27] C. Kim and W. Sung, “Vowel pronunciation accuracy checking system based on phoneme segmentation and formants extraction,” Proceedings of Interna- tional Conference on Speech Processing, pp. 447–452, 08 2001.
[28] C. Reuter, “The role of formant positions and micro-modulations in blending and partial masking of musical instruments,” The Journal of the Acoustical Society of America, vol. 126, pp. 2237–2237, 10 2009.
[29] J.P.BURG,“Maximumentropyspectralanalysis,”Proceedingsof37thMeet- ing, Society of Exploration Geophysics, 1967.
[30] H. Akaike, “Information theory and an extension of the maximum likeli- hood principle,” 2nd International Symposium on Information Theory, vol. 73, pp. 1033–1055, 01 1973.
[31] J. Ghosh, M. Delampady, and T. Samanta, An introduction to Bayesian anal- ysis: Theory and methods. Springer, 01 2006.
[32] P. Stoica, “Generalized yule-walker equations and testing the orders of multi- variate time series,” International Journal of Control, vol. 37, no. 5, pp. 1159– 1166, 1983.
[33] W. E. P. Jr., C. J. Ying, R. L. Moses, and W. M. Steedly, “Accuracy and com- putational comparisons of TLS-Prony, Burg, and FFT-based scattering center extraction algorithms,” in Automatic Object Recognition III (F. A. Sadjadi, ed.), vol. 1960, pp. 140–151, International Society for Optics and Photonics, SPIE, 1993.
[34] C. Ferrand, “Harmonics-to-noise ratio: An index of vocal aging,” Journal of Voice : Official Journal of the Voice Foundation, vol. 16, pp. 480–7, 01 2003.
[35] P. Boersma, “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” in Proceedings of the insti- tute of phonetic sciences, vol. 17, pp. 97–110, Amsterdam, 1993.
[36] K. Pearson, “Note on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
[37] 台灣精神醫學會, 美國精神醫學會, and A. P. Association, DSM-5 ®: 精神疾 病診斷準則手冊. 合記圖書出版社, 2014.