簡易檢索 / 詳目顯示

研究生: 徐 雅
Upadhyay, Shreya G.
論文名稱: 基於語言錨定的遷移學習以提升跨語言或跨域的語音情緒辨識
Linguistically Anchored Transfer Learning for Enhancing Cross-Lingual or Cross-Domain Speech Emotion Recognition
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 劉奕汶
Liu, Yi-Wen
黃元豪
Huang, Yuan-Hao
王新民
Wang, Hsin-Min
陳柏琳
Chen, Berlin
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 89
中文關鍵詞: 語音情感識別機器學習跨語言遷移學習
外文關鍵詞: Speech Emotion Recognition, Machine Learning, Cross-lingual, Transfer Learning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 跨語言與跨領域語音情感識別(SER)的研究熱度持續上升,其廣泛應用於多個領域。然而,以往的研究主要關注於特徵、領域和標籤的適配,卻往往忽略了可提升適配效率的核心語言學要素。本研究從三個相互關聯的視角探討此挑戰:語音學、模型架構和發音機制。從語音學角度,我們分析與情感表達相關的元音語音相似性,並識別出跨語言遷移學習的關鍵錨點——共享音素,特別是元音。從模型架構角度,我們利用大型預訓練模型中語言相似性較高的層作為錨點,以增強適配能力。從發音機制角度,我們探討口部發音動作作為穩定且語言無關的單位,並將其視為跨語言SER的可靠錨點,以提升情感建模的穩定性。

    為驗證我們的方法,我們在多個自然語境的情感語音語料庫上進行了廣泛的實驗。透過融合語音學、模型架構與發音動作的見解,我們的跨語言與跨領域SER模型在各項指標上均優於基線方法。實驗結果顯示,利用特定的語言單位作為錨點可顯著提升跨語言適配能力。本研究提出了一個全面且跨學科的框架,以推動SER適配技術的發展,為更精確且語言學導向的情感識別系統開闢新的研究方向。


    The increasing focus on cross-lingual and cross-domain speech emotion recognition (SER) stems from its diverse applications across multiple fields. While prior research has primarily concentrated on adapting features, domains, and labels, it often overlooks fundamental linguistic elements that can enhance adaptation efficiency. This study addresses the challenge from three interconnected perspectives: phonetic, model architecture, and articulatory aspects. From the phonetic perspective, we explore vowel-phonetic similarities tied to emotional expressions across languages, identifying shared phonemes, particularly vowels, as anchors for cross-lingual transfer learning. From the model architecture perspective, we leverage layers of large pretrained models that demonstrate higher linguistic similarity, using them as anchors to enhance adaptation. From the articulatory perspective, we investigate mouth articulatory gestures as stable and language-independent units for emotion modeling, recognizing them as a reliable anchor for cross-lingual or cross-domain SER.

    To validate our approach, we conduct extensive experiments on multiple naturalistic emotional speech corpora. By integrating insights from phonetics, model architecture, and articulatory gestures, our cross-lingual or cross-domain SER models consistently surpass baseline approaches. The results highlight the significance of leveraging distinct linguistic units as anchors to enhance adaptation across languages. This research presents a comprehensive and interdisciplinary framework for advancing SER adaptation, offering a novel pathway for developing more effective and linguistically informed emotion recognition systems.

    摘要i Abstract iii Acknowledgements v Contents viii List of Figures xi List of Tables xii 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Computational Study Of Cross-Lingual or Cross-Domain Data . . . . . 3 1.1.2 Related Affective Study On Linguistic Behavior . . . . . . . . . . . . 5 1.2 Research Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Linguistically Anchored Cross-Lingual or Cross-Domain SER . . . . . 7 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Phonetically Anchored Transfer Learning for Improved Cross-Lingual Speech Emotion Recognition 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Related Affective Study Of Vowels Behavior . . . . . . . . . . . . . . . . . . 15 2.2.1 Emotion-Specific Vowel Behavior . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Vowel-Specific Emotion Encoding. . . . . . . . . . . . . . . . . . . . 16 2.3 Phonetic Commonality Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Naturalistic Speech Emotion Corpora . . . . . . . . . . . . . . . . . . 17 2.3.2 Formant-Based Phonetic Analyses . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Wav2vec2.0 Phonetic Commonality . . . . . . . . . . . . . . . . . . . 21 2.3.4 Anchor Candidates Selection. . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Cross-Lingual SER Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 Feature Extraction and Encoding. . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Emotion Classification Task . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 Phoneme-Based Anchoring Mechanism . . . . . . . . . . . . . . . . . 27 2.5 Experiment Results and Analyses . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Cross-Lingual SER Performance Comparison . . . . . . . . . . . . . . 29 2.5.3 Before/After Anchoring Feature Space Analyses . . . . . . . . . . . . 34 ix 2.5.4 Extended Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Discussion and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 A Layer Anchoring Mechanism for Enhancing Cross-Lingual Speech Emotion Recognition 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Pretrained Model’s Layer Similarity Analyses . . . . . . . . . . . . . . . . . . 43 3.2.1 Naturalistic Affective Corpora . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Layer Similarity Analyses . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 Unified Layer Selection . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Layer Anchored Cross-Lingual SER . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Emotion Classification Task . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Layer Anchoring Mechanism . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Experiment Results and Analyses . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 Cross-Lingual SER Performance Comparison . . . . . . . . . . . . . . 50 3.4.3 CL-SER With Different Layer Selection Strategy . . . . . . . . . . . . 52 3.5 Discussion and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4 Mouth Articulation-Based Anchoring for Improved Cross-Domain Speech Emotion Recognition 55 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 Articulatory Gesture Commonality Analyses. . . . . . . . . . . . . . . . . . . 57 4.2.1 Multi-Modal Affective Corpora . . . . . . . . . . . . . . . . . . . . . 57 4.2.2 Articulatory Gesture Feature Extraction and Prepossessing . . . . . . . 59 4.2.3 Mouth Articulatory-Gesture Clustering . . . . . . . . . . . . . . . . . 59 4.3 Mouth Articulatory Gesture-Anchored Cross-Domain SER . . . . . . . . . . . 63 4.3.1 Emotion Classification Task . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Mouth Articulatory Gesture Anchoring Mechanism. . . . . . . . . . . 65 4.4 Experiment Results and Analyses . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Cross-Domain SER Performance Comparison. . . . . . . . . . . . . . 66 4.4.3 AG-Acoustic Features Association Analysis. . . . . . . . . . . . . . . 67 4.5 Discussion and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Conclusion 71 5.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 References 77

    [1] J. C. Acosta, “Using emotion to gain rapport in a spoken dialog system,” 2009.
    [2] L. Cen, F. Wu, Z. L. Yu, and F. Hu, “A real-time speech emotion recognition system
    and its application in online learning,” in Emotions, technology, design, and learning,
    pp. 27–46, Elsevier, 2016.
    [3] A. Tawari and M. Trivedi, “Speech based emotion classification framework for driver
    assistance system,” in 2010 IEEE Intelligent Vehicles Symposium, pp. 174–178, IEEE,
    2010.
    [4] M. Dewan, M. Murshed, and F. Lin, “Engagement detection in online learning: a review,”
    Smart Learning Environments, vol. 6, no. 1, pp. 1–20, 2019.
    [5] L. Devillers, C. Vaudable, and C. Chastagnol, “Real-life emotion-related states detection
    in call centers: a cross-corpora study,” in Eleventh Annual Conference of the Interna-
    tional Speech Communication Association, 2010.
    [6] Z. Farhoudi and S. Setayeshi, “Fusion of deep learning features with mixture of
    brain emotional learning for audio-visual emotion recognition,” Speech Communication,
    vol. 127, pp. 92–103, 2021.
    [7] J. Hernandez, R. R. Morris, and R. W. Picard, “Call center stress recognition with person-
    specific models,” in Affective Computing and Intelligent Interaction: 4th International
    Conference, ACII 2011, Memphis, TN, USA, October 9–12, 2011, Proceedings, Part I 4,
    pp. 125–134, Springer, 2011.
    [8] A. Menyc et al., “Real-time integration of emotion analysis into homecare platforms,”
    IEEE, Jul, vol. 23, 2019.
    [9] H. Basanta, Y.-P. Huang, and T.-T. Lee, “Assistive design for elderly living ambient
    using voice and gesture recognition system,” in 2017 IEEE International Conference on
    Systems, Man, and Cybernetics (SMC), pp. 840–845, IEEE, 2017.
    [10] E. Polyakov, M. Mazhanov, A. Rolich, L. Voskov, M. Kachalova, and S. Polyakov, “In-
    vestigation and development of the intelligent voice assistant for the internet of things
    using machine learning,” in 2018 Moscow workshop on electronic and networking tech-
    nologies (MWENT), pp. 1–5, IEEE, 2018.
    [11] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su, and C. Busso, “Deep representation
    learning for affective speech signal analysis and processing: Preventing unwanted signal
    disparities,” IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 22–38, 2021.
    [12] S. Latif, J. Qadir, and M. Bilal, “Unsupervised adversarial domain adaptation for cross-
    lingual speech emotion recognition,” in 2019 8th international conference on affective
    computing and intelligent interaction (ACII), pp. 732–737, IEEE, 2019.
    [13] W. Zehra, A. R. Javed, Z. Jalil, H. U. Khan, and T. R. Gadekallu, “Cross corpus multi-
    lingual speech emotion recognition using ensemble learning,” Complex & Intelligent
    Systems, vol. 7, no. 4, pp. 1845–1854, 2021.
    [14] H. Luo and J. Han, “Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2047–2060, 2020.
    [15] Y. Ahn, S. J. Lee, and J. W. Shin, “Cross-corpus speech emotion recognition based
    on few-shot learning and domain adaptation,” IEEE Signal Processing Letters, vol. 28,
    pp. 1190–1194, 2021.
    [16] M. Neumann et al., “Cross-lingual and multilingual speech emotion recognition on English and French,” in 2018 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), pp. 5769–5773, IEEE, 2018.
    [17] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition,” IEEE Transactions on Affective Computing, 2022.
    [18] K. A. Lindquist, L. F. Barrett, E. Bliss-Moreau, and J. A. Russell, “Language and the
    perception of emotion.,” Emotion, vol. 6, no. 1, p. 125, 2006.
    [19] S. Parthasarathy and C. Busso, “Semi-supervised speech emotion recognition with ladder
    networks,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28,
    pp. 2697–2709, 2020.
    [20] J. Gideon, M. McInnis, and E. Mower Provost, “Improving cross-corpus speech emo-
    tion recognition with adversarial discriminative domain generalization (ADDoG),” IEEE
    Transactions on Affective Computing, vol. 12, pp. 1055–1068, October-December 2021.
    [21] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,”
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 2423–
    2435, December 2018.
    [22] B.-H. Su and C.-C. Lee, “A conditional cycle emotion gan for cross corpus speech emo-
    tion recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 351–
    357, IEEE, 2021.
    [23] J. Li, N. Yan, and L. Wang, “Unsupervised cross-lingual speech emotion recognition us-
    ing pseudo multilabel,” in 2021 IEEE Automatic Speech Recognition and Understanding
    Workshop (ASRU), pp. 366–373, IEEE, 2021.
    [24] J. Kim, G. Englebienne, K. P. Truong, and V. Evers, “Towards speech emotion recognition” in the wild” using aggregated corpora and deep multi-task learning,” arXiv preprint
    arXiv:1708.03920, 2017.
    [25] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,”
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12,
    pp. 2423–2435, 2018.
    [26] H. Luo and J. Han, “Nonnegative matrix factorization based transfer subspace learn-
    ing for cross-corpus speech emotion recognition,” IEEE/ACM Transactions on Audio,
    Speech, and Language Processing, vol. 28, pp. 2047–2060, 2020.
    [27] N. Liu, B. Zhang, B. Liu, J. Shi, L. Yang, Z. Li, and J. Zhu, “Transfer subspace learning for unsupervised cross-corpus speech emotion recognition,” IEEE Access, vol. 9,pp. 95925–95937, 2021.
    [28] N. Naderi and B. Nasersharif, “Cross corpus speech emotion recognition using trans-
    fer learning and attention-based fusion of wav2vec2 and prosody features,” Knowledge-
    Based Systems, vol. 277, p. 110814, 2023.
    [29] M. Ochal, J. Vazquez, Y. Petillot, and S. Wang, “A comparison of few-shot learning
    methods for underwater optical and sonar image classification,” in Global Oceans 2020:
    Singapore–US Gulf Coast, pp. 1–10, IEEE, 2020.
    [30] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
    [31] H. Zhang, H. Chen, Z. Song, D. Boning, I. S. Dhillon, and C.-J. Hsieh, “The limitations of
    adversarial training and the blind-spot attack,” arXiv preprint arXiv:1901.04684, 2019.
    [32] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,” Journal of artificial intelligence research, vol. 11, pp. 169–198, 1999.
    [33] M. A. Jalal, R. Milner, and T. Hain, “Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition,” in Proceedings of Interspeech 2020, pp. 4113–4117, International Speech Communication Association (ISCA),
    2020.
    [34] J. Yuan, X. Cai, R. Zheng, L. Huang, and K. Church, “The role of phonetic units in speech emotion recognition,” arXiv preprint arXiv:2108.01132, 2021.
    [35] A. Körner and R. Rummer, “Articulation contributes to valence sound symbolism.,”
    Journal of Experimental Psychology: General, vol. 151, no. 5, p. 1107, 2022.
    [36] A. Körner and R. Rummer, “What is preferred in the in–out effect: articulation locations or articulation movement direction?,” Cognition and Emotion, vol. 36, no. 2, pp. 230–239, 2022.
    [37] D. N. McIntosh RB Zajonc Peter S. Vig Stephen W. Emerick, “Facial movement, breathing, temperature, and affect: Implications of the vascular theory of emotional efference,” Cognition & Emotion, vol. 11, no. 2, pp. 171–196, 1997.
    [38] A. Majid, “Current emotion research in the language sciences,” Emotion Review, vol. 4, no. 4, pp. 432–443, 2012.
    [39] C. S. Yu, M. K. McBeath, and A. M. Glenberg, “Phonemes convey embodied emotion,” in Handbook of Embodied Psychology, pp. 221–243, Springer, 2021.
    [40] N. Iwasaki, D. P. Vinson, and G. Vigliocco, “What do english speakers know about gera-gera and yota-yota?: A cross-linguistic investigation of mimetic words for laughing and walking,” Japanese-language education around the globe, vol. 17, pp. 53–78, 2007.
    [41] B. Vlasenko, D. Philippou-Hübner, D. Prylipko, R. Böck, I. Siegert, and A. Wendemuth,“Vowels formants analysis allows straightforward detection of high arousal emotions,” in 2011 IEEE International Conference on Multimedia and Expo, pp. 1–6, IEEE, 2011.
    [42] D. E. Blasi, S. Wichmann, H. Hammarström, P. F. Stadler, and M. H. Christiansen,
    “Sound–meaning association biases evidenced across thousands of languages,” Proceed-
    ings of the National Academy of Sciences, vol. 113, no. 39, pp. 10818–10823, 2016.
    [43] A. Körner and R. Rummer, “Valence sound symbolism across language families: a comparison between japanese and german,” Language and Cognition, vol. 15, no. 2, pp. 337–354, 2023.
    [44] S. G. Upadhyay, L. Martinez-Lucas, B.-H. Su, W.-C. Lin, W.-S. Chien, Y.-T. Wu,
    W. Katz, C. Busso, and C.-C. Lee, “Phonetic anchor-based transfer learning to facilitate
    unsupervised cross-lingual speech emotion recognition,” in ICASSP 2023-2023 IEEE In-
    ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5,
    IEEE, 2023.
    [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser,
    and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International
    Conference on Neural Information Processing Systems, NIPS’17, (Red Hook, NY, USA),
    p. 6000–6010, Curran Associates Inc., 2017.
    [46] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka,
    X. Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech
    processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6,
    pp. 1505–1518, 2022.
    [47] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, pp. 28492–28518, PMLR, 2023.
    [48] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
    [49] L.-W. Chen and A. Rudnicky, “Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023.
    [50] L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” Interspeech 2021, 2021.
    [51] T. Feng and S. Narayanan, “Peft-ser: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models,” in 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 1–8, IEEE, 2023.
    [52] T. tom Dieck, P.-A. Pérez-Toro, T. Arias-Vergara, E. Nöth, and P. Klumpp, “Wav2vec behind the scenes: How end2end models learn phonetics,” Proc. Interspeech 2022, pp. 5130–5134, 2022.
    [53] P. C. English, J. Kelleher, and J. Carson-Berndsen, “Domain-informed probing of
    wav2vec 2.0 embeddings for phonetic features,” in Proceedings of the 19th SIGMOR-
    PHON Workshop on Computational Research in Phonetics, Phonology, and Morphol-
    ogy, pp. 83–91, 2022.
    [54] T. Berns, N. Vaessen, and D. A. van Leeuwen, “Speaker and language change detection using wav2vec2 and whisper,” arXiv preprint arXiv:2302.09381, 2023.
    [55] Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
    [56] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 on speaker verification and language identification,” Interspeech 2021, 2021.
    [57] Y. Li, Y. Mohamied, P. Bell, and C. Lai, “Exploration of a self-supervised speech model: A study on emotional corpora,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 868–875, IEEE, 2023.
    [58] J. Peng, O. Plchot, T. Stafylakis, L. Mošner, L. Burget, and J. Černockỳ, “An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 555–562, IEEE, 2023.
    [59] V. Popov, M. Ostarek, and C. Tenison, “Practices and pitfalls in inferring neural representations,” NeuroImage, vol. 174, pp. 340–351, 2018.
    [60] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern recognition, vol. 44, no. 3, pp. 572–587, 2011.
    [61] H. Li, X. Zhang, S. Duan, and H. Liang, “Speech emotion recognition based on bi-
    directional acoustic-articulatory conversion,” Knowledge-Based Systems, p. 112123,
    2024.
    [62] Z. Zhang, M. Huang, and Z. Xiao, “A study of correlation between physiological process of articulation and emotions on mandarin chinese,” Speech Communication, vol. 147, pp. 82–92, 2023.
    [63] Y. Kim and E. Mower Provost, “Say cheese vs. smile: Reducing speech-related vari-
    ability for facial emotion recognition,” in Proceedings of the 22nd ACM international
    conference on Multimedia, pp. 27–36, 2014.
    [64] N. Sadoughi and C. Busso, “Expressive speech-driven lip movements with multitask
    learning,” in 2018 13th IEEE international conference on automatic face & gesture
    recognition (FG 2018), pp. 409–415, IEEE, 2018.
    [65] M. Shah, M. Tu, V. Berisha, C. Chakrabarti, and A. Spanias, “Articulation constrained learning with application to speech emotion recognition,” EURASIP journal on audio, speech, and music processing, vol. 2019, pp. 1–17, 2019.
    [66] C. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and
    S. Narayanan, “Emotion recognition based on phoneme classes,” in 8th International
    Conference on Spoken Language Processing (ICSLP 04), (Jeju Island, Korea), pp. 889–
    892, October 2004.
    [67] M. Bartelds, W. de Vries, F. Sanal, C. Richter, M. Liberman, and M. Wieling, “Neu-
    ral representations for modeling variation in speech,” Journal of Phonetics, vol. 92,
    p. 101137, 2022.
    [68] S. G. Upadhyay, L. Martinez-Lucas, W. Katz, C. Busso, and C.-C. Lee, “Phonetically-anchored domain adaptation for cross-lingual speech emotion recognition,” IEEE Transactions on Affective Computing, no. 01, pp. 1–15, 2025.
    [69] R. Rummer, J. Schweppe, R. Schlegelmilch, and M. Grice, “Mood is linked to vowel
    type: The role of articulatory movements.,” Emotion, vol. 14, no. 2, p. 246, 2014.
    [70] J. Auracher, W. Menninghaus, and M. Scharinger, “Sound predicts meaning: Cross-
    modal associations between formant frequency and emotional tone in stanzas,” Cognitive
    Science, vol. 44, no. 10, p. e12906, 2020.
    [71] B. Myers-Schulz, M. Pujara, R. C. Wolf, and M. Koenigs, “Inherent emotional quality of human speech sounds,” Cognition & emotion, vol. 27, no. 6, pp. 1105–1113, 2013.
    [72] A. Li, Q. Fang, F. Hu, L. Zheng, H. Wang, and J. Dang, “Acoustic and articulatory
    analysis on mandarin chinese vowels in emotional speech,” in 2010 7th International
    Symposium on Chinese Spoken Language Processing, pp. 38–43, IEEE, 2010.
    [73] C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and
    S. Narayanan, “Emotion recognition based on phoneme classes,” in Eighth international
    conference on spoken language processing, 2004.
    [74] F. Ringeval and M. Chetouani, “Exploiting a vowel based approach for acted emotion recognition,” in Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction: COST Action 2102 International Conference, Patras, Greece, October 29-31, 2007. Revised Papers, pp. 243–254, Springer, 2008.
    [75] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on
    Affective Computing, vol. 10, no. 4, pp. 471–483, 2017.
    [76] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal
    forced aligner: Trainable text-speech alignment using Kaldi,” in Proc. Interspeech 2017,
    (Stockholm, Sweden), pp. 498–502, August 2017.
    [77] L. Rice, “Hardware & software for speech synthesis,” Dr. Dobb’s Journal of Computer Calisthenics & Orthodontia, vol. 1, pp. 6–8, April 1976.
    [78] Y.-F. Liao, W.-H. Hsu, Y.-C. Lin, Y.-H. S. Chang, M. Pleva, J. Juhar, and G.-F. Deng, “Formosa speech recognition challenge 2018: data, plan and baselines,” in 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 270–274, IEEE, 2018.
    [79] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
    [80] C. G. Clopper, D. B. Pisoni, and K. De Jong, “Acoustic characteristics of the vowel
    systems of six regional varieties of American English,” The Journal of the Acoustical
    society of America, vol. 118, no. 3, pp. 1661–1676, 2005.
    [81] G. E. Peterson and H. L. Barney, “Control methods used in a study of the vowels,” The Journal of the acoustical society of America, vol. 24, no. 2, pp. 175–184, 1952.
    [82] H. Liu and M. L. Ng, “Formant characteristics of vowels produced by Mandarin
    esophageal speakers,” Journal of voice, vol. 23, no. 2, pp. 255–260, 2009.
    [83] J. Hillenbrand, L. A. Getty, M. J. Clark, and K. Wheeler, “Acoustic characteristics of
    American English vowels,” The Journal of the Acoustical society of America, vol. 97,
    no. 5, pp. 3099–3111, 1995.
    [84] T. Nearey, Phonetic Feature Systems for Vowels. Indiana University (Bloomington).
    Linguistics Club. (Bd 224), Indiana University Linguistics Club, 1978.
    [85] K. Choi and E. J. Yeo, “Opening the black box of wav2vec feature encoder,” arXiv
    preprint arXiv:2210.15386, 2022.
    [86] A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” arXiv preprint arXiv:2211.03929, 2022.
    [87] W.-C. Lin and C. Busso, “Chunk-level speech emotion recognition: A general frame-work of sequence-to-one dynamic temporal modeling,” IEEE Transactions on Affective Computing, 2021.
    [88] B.-H. Su and C.-C. Lee, “Unsupervised cross-corpus speech emotion recognition using a multi-source cycle-gan,” IEEE Transactions on Affective Computing, 2022.
    [89] V. Kondratenko, A. Sokolov, N. Karpov, O. Kutuzov, N. Savushkin, and
    F. Minkin, “Large raw emotional dataset with aggregation mechanism,” arXiv preprint
    arXiv:2212.12266, 2022.
    [90] S. G. Upadhyay, C. Busso, and C.-C. Lee, “A layer-anchoring strategy for enhancing
    cross-lingual speech emotion recognition,” arXiv preprint arXiv:2407.04966, 2024.
    [91] J. Shi, D. Berrebbi, W. Chen, H. L. Chung, E. P. Hu, W. P. Huang, X. Chang, S. W. Li, A. Mohamed, H. Y. Lee, et al., “Ml-superb: Multilingual speech universal performance benchmark,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023, pp. 884–888, 2023.
    [92] J. C. Vásquez-Correa and A. Álvarez Muniain, “Novel speech recognition systems applied to forensics within child exploitation: Wav2vec2. 0 vs. whisper,” Sensors, vol. 23, no. 4, p. 1843, 2023.
    [93] R. E. Zezario, Y.-W. Chen, S.-W. Fu, Y. Tsao, H.-M. Wang, and C.-S. Fuh, “A study on incorporating whisper for robust speech assessment,” arXiv preprint arXiv:2309.12766, 2023.
    [94] S. G. Upadhyay, W.-S. Chien, B.-H. Su, L. Goncalves, Y.-T. Wu, A. N. Salman, C. Busso, and C.-C. Lee, “An intelligent infrastructure toward large scale naturalistic affective speech corpora collection,” in 2023 10th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE, 2023.
    [95] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsupervised domain adaptation,” Domain adaptation in computer vision applications, pp. 153–171, 2017.
    [96] S. G. Upadhyay, A. N. Salman, C. Busso, and C.-C. Lee, “Mouth articulation-based
    anchoring for improved cross-corpus speech emotion recognition,” arXiv e-prints,
    pp. arXiv–2412, 2024.
    [97] D. Erickson, C. Zhu, S. Kawahara, and A. Suemitsu, “Articulation, acoustics and perception of mandarin chinese emotional speech,” Open Linguistics, vol. 2, no. 1, 2016.
    [98] J. Kim, A. Toutios, S. Lee, and S. S. Narayanan, “Vocal tract shaping of emotional
    speech,” Computer speech & language, vol. 64, p. 101100, 2020.
    [99] S. Narayanan, K. Nayak, S. Lee, A. Sethy, and D. Byrd, “An approach to real-time magnetic resonance imaging for speech production,” The Journal of the Acoustical Society of America, vol. 115, no. 4, pp. 1771–1776, 2004.
    [100] J. Wang, J. R. Green, A. Samal, and Y. Yunusova, “Articulatory distinctiveness of vowels and consonants: A data-driven approach,” 2013.
    [101] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
    [102] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M.
    Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion per-
    ception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2016.
    [103] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal
    forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, vol. 2017,
    pp. 498–502, 2017.
    [104] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE winter conference on applications of computer
    vision (WACV), pp. 1–10, IEEE, 2016.
    [105] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment via regressing local binary features,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1233–1245, 2016.
    [106] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in 2018 IEEE/CVF Confer ence on Computer Vision and Pattern Recognition, pp. 2235–2245, 2018.
    [107] X. Huang, Y. Ye, L. Xiong, R. Y. Lau, N. Jiang, and S. Wang, “Time series k-means:A new k-means type smooth subspace clustering for time series data,” Information Sciences, vol. 367, pp. 1–13, 2016.
    [108] M. Cui et al., “Introduction to the k-means clustering algorithm based on the elbow
    method,” Accounting, Auditing and Finance, vol. 1, no. 1, pp. 5–8, 2020.

    QR CODE