簡易檢索 / 詳目顯示

研究生: 周惶振
Chou, Huang-Cheng
論文名稱: 重新探討語音情緒辨識的建模與評量方法:考慮標註者的主觀性與情緒的模糊性
Revisiting Modeling and Evaluation Approaches in Speech Emotion Recognition: Considering Subjectivity of Annotators and Ambiguity of Emotions
指導教授: 李祈均
Lee, Chi-Chun
口試委員: 林嘉文
Lin, Chia-Wen
馬席彬
Ma, Hsi-Pin
陳柏琳
Chen, Berlin
冀泰石
Chi, Tai-Shih
王新民
Wang, Hsin-Min
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 94
中文關鍵詞: 語音情緒辨識情感運算主觀感受模糊情緒軟標籤學習多標籤分類情緒共現性分佈型標籤學習從分歧中學習
外文關鍵詞: speech emotion recognition, affective computing, subjective perception, ambiguous emotions, soft-label learning, multi-label classification, co-occurrence of emotions, distributional-label learning, learning from disagreement
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音情緒辨識在過去二十年裡獲得了越來越多的關注。建立語音情緒識別系統需要情緒數據庫,資料庫需要有人聲以及人類的情緒感受標記。研究人員們會訓練群眾標記者或內部標記者,在收聽或觀看情緒錄像後,通過選擇預先定義的情緒類別來描述和提供他們的情緒感知。然而,當研究人員們要求標記者們從預定義的情緒中選擇情緒時,觀察到標記者之間出現分歧是很常見的。為了處理這種標記者之間的分歧,大部分專家學者們將分歧視為雜訊,並使用標籤聚合方法來獲得單一的共識情緒標記,作為訓練語音情緒識別系統的學習目標。雖然這種通行做法將任務簡化為單一情緒標籤識別任務,但這個方法忽略了人類情緒感知的自然行為。 在本論文中,我們主張應重新檢視語音情感識別中的建模和評估方法。主要的研究問題是:(1) 我們是否應該移除少數的情感評分?(2) 我們是否應該只有讓語音情感識別系統學習少數人的情感感知?(3) 語音情感識別系統是否應該每次只預測一種情緒類別?

    從心理學領域相關研究成果發現情緒感受是主觀的,每個個體對同一情感刺激的情緒感受可能有所不同。此外,人類感知中的情緒類別界限是重疊、混合且模糊的。這些情感的模糊性和情緒感受的主觀性的發現啟發我們重新審視在語音情緒辨識中的建模與評估方法。本博士論文探討了構建語音情緒識別系統的三個層面的新穎觀點。首先,我們接受情緒感受的主觀性,並考慮標記者的所有情緒標記。傳統的方法只允許每位標記者對每個樣本給予一票情緒標記,但我們藉由考慮所有標記者的所有標記,利用現有的軟標籤方法(soft-label)重新計算標籤表示方式。此外,我們直接利用個別標記者的情緒標記來訓練個別標記者的語音情緒識別系統,並聯合訓練個別標記者語音情緒識別系統和標準語音情緒識別系統(使用共識標籤)。在使用多數決所獲得的共識標籤當作最終情緒標籤進行測試時,個別標記者的建模方法提升了語音情緒辨識系統的性能。

    其次,我們重新思考了評估語音情緒辨識系統的方法以及語音情緒辨識任務的制定和定義。我們主張在評估語音情緒辨識系統的性能時,不應該刪除任何數據和情緒標記。此外,我們認為語音情緒辨識任務的定義可以包含情緒的共現性(例如,悲傷和生氣)。因此,樣本的真實標籤不應該是單一情緒標籤,而應該是包含更多情緒感知多樣性的分佈式標籤。我們提出了一種新的標籤聚合規則,稱為「全包容規則」(all-inclusive rule),用於選擇訓練集和測試集的數據和其情緒標記。對四個公開英文情緒數據庫的結果表明,使用「全包容規則」方法決定的訓練集所訓練的語音情緒辨識系統,在各種測試條件下,其性能優於使用傳統方法,包括絕對多數決和相對多數決訓練的語音情緒辨識系統。

    最後但同樣重要的是,我們受到心理學研究關於情緒共現性的發現啟發。我們根據情緒資料庫訓練集中情緒標記來估計情緒共現性的頻率,並基於每種情緒類別的數量對矩陣進行標準化。接著,我們使用單位矩陣減去標準化矩陣作為懲罰矩陣。我的想法是在訓練過程中當模型預測到罕見的情緒共現時對語音情緒辨識系統進行懲罰。因此,懲罰矩陣被整合進現有的目標函數,例如交叉熵損失(cross-entropy)。在最大的英語情緒數據庫結果顯示,即使在單一情緒標籤測試條件下,懲罰矩陣也提升了語音情緒辨識系統的性能。


    Over the past twenty years, there has been a growing focus on speech emotion recognition (SER). To develop SER systems capable of identifying emotions in speech, researchers need to gather emotional databases for training purposes. This process involves training crowdsourced raters or in-house annotators to express their emotional responses after experiencing emotional recordings by selecting from a set list of emotions. Nevertheless, it is common for raters to disagree on emotion selection from these predefined categories. To address this issue, many researchers consider such disagreements noise and apply label aggregation techniques to produce a unified consensus label, which serves as the target for training SER systems. While the common practice simplifies the task as a single-label recognition task, it ignores the natural behaviors of human emotion perception. In this dissertation, we contend that we should revisit the modeling and evaluation approaches in SER. The driving research questions are (1) Should we remove the minority of emotional ratings? (2) Should we only let the SER systems learn the emotional perceptions of a few people? (3) Should SER systems only predict one emotion per speech?

    Based on the findings of psychological studies, emotion perception is subjective. Each individual could have varying responses to the same emotional stimulus. Additionally, boundaries of emotions in human perception are overlapped, blended, and ambiguous. Those ambiguities of emotions and subjectivity of emotion perceptions inspire us to revisit modeling and evaluation approaches in SER. This dissertation explores novel perspectives on three main views of building SER systems. First, we embrace the subjectivity of emotional perception and consider every emotional rating from annotators. Also, the conventional approach only allows each rater to provide one vote for each sample. Still, we re-calculate label representation in the distributional format with the existing soft-label method by considering all ratings from all raters. Moreover, we directly utilize ratings of individual annotators to train SER systems and jointly train the individual SER systems and the standard SER systems. The modeling of individual annotators improves the performances of SER systems on the test sets with the consensus labels obtained by the majority vote.

    Secondly, we rethink the determination of methods to evaluate SER systems and the formulation and definition of the SER task. We argue that we should not remove any data and emotional ratings when assessing the performances of SER systems. Also, we think the definition of SER task can have a co-occurrence of emotions (e.g., sad and angry). Therefore, the ground truth of samples should not be the one-hot single label, and it can be distributional to include more diversity of emotion perception. We propose a novel label aggregation rule, named the ``all-inclusive rule,'' to use all data and include the maximum emotional rating for the test set. The results across 4 public English emotion databases show that the SER systems trained by the train set decided by the proposed method outperformed the ones trained by the conventional techniques, including majority rule and plurality rule on the various testing conditions.

    Finally, we draw inspiration from psychological research on the co-occurrence of emotions. We assess the frequency with which different categorical emotions occur together, using emotional ratings from the training data of emotion databases. This matrix is then normalized, considering the frequency of each emotion class. We derive a penalization matrix by subtracting the normalized matrix from an identity matrix. We aim to apply penalties to SER systems during training when they predict rarely occurring combinations of emotions. This penalization matrix is integrated into objective functions like cross-entropy loss. The findings from the largest English emotion database indicate that using the penalization matrix enhances the performance of SER systems, even under single-label testing conditions.

    With the extensive results, we conclude that (1) we should involve the minority of emotional ratings instead of removing them to build better-performance SER systems, (2) we should consider emotional ratings from more people instead of fewer people during training SER systems to get better-performance SER systems; (3) we should allow SER systems to predict multiple emotions to handle the possibility of co-occurring emotions in the real-life scenarios. In future work, we plan to investigate training emotion recognition systems with multi-modalities (e.g., video, text, and audio) to process signals to improve the performance of SER systems. Also, we are interested in the relationship between the number of training human-labeled data and the performances of SER systems. Furthermore, we aim to understand the performance bias in the demographic groups, such as gender, race, and age. Last but not least, we plan to build a multi-lingual emotion recognition system.

    Abstract (Chinese) I Acknowledgements (Chinese) III Abstract V Acknowledgements VIII Contents X List of Figures XIV List of Tables XVI 1 Introduction 1 1.1 Motivation 1 1.2 Background, Related Works, and Challenges 4 1.2.1 Emotion Representations 4 1.2.2 Emotion Recognition Systems using Multi-/Uni-modality 4 1.2.3 Evaluation of SER Systems 5 1.2.4 Label Prepossessing for Training SER Systems 6 1.2.5 Co-occurrence of Emotions 7 1.2.6 Disagreement between Raters on the Emotion Datasets 8 1.3 Contributions 9 1.3.1 Every Rating Matters Considering the Subjectivity of Annotators. 9 1.3.2 Novel Evaluation Method by an All-Inclusive Aggregation Rule 10 1.3.3 Training Loss Using Co-occurrence Frequency of Emotions . . 10 1.4 Outline of the Dissertation 11 2 Emotion Databases 12 2.1 IEMOCAP 12 2.2 IMPROV 13 2.3 CREMA-D 13 2.4 MSP-PODCAST 14 2.5 Standard Partition 15 2.5.1 Standard Partition of the IEMOCAP 15 2.5.2 Standard Partition of the MSP-IMRPOV 16 2.5.3 Standard Partition of the CREMA-D 16 3 Every Rating Matters Considering Subjectivity of Annotators 17 3.1 Motivation 17 3.2 Background and Related Works 18 3.2.1 Subjectivity of Emotion Perception 18 3.2.2 Mixture of Annotators 18 3.2.3 Soft-label Training Method for SER Systems 19 3.3 Resource and Task Formulation 19 3.4 Speech Emotion Classifier 20 3.4.1 Input Features 20 3.4.2 Model Structure 21 3.4.3 Training Labels 21 3.5 Proposed Method 22 3.5.1 Rater-Modeling 22 3.5.2 Final Concatenation Layer 23 3.6 Experimental Setup 23 3.6.1 Foundational Component 23 3.6.2 All Model Comparison 24 3.7 Other Training Details 25 3.8 Experimental Results and Analyses 26 3.9 Summary 27 4 Novel Evaluation Method by an All-Inclusive Aggregation Rule 28 4.1 Motivation and Background 29 4.2 Previous Literature 31 4.2.1 Evaluation of SER Systems 31 4.2.2 Curriculum Learning for Emotion Recognition 32 4.3 Methodology 33 4.3.1 Proposed All-inclusive Rule 33 4.3.2 Employing the All-Inclusive Rule for Test Set Construction 34 4.4 Exerimental Setup 35 4.4.1 Resource 35 4.4.2 Speech Emotion Classifier 35 4.4.3 Train/Test Set Defined by Aggregation Rules 37 4.4.4 Label Learning for SER 37 4.4.5 Evaluation Metrics and Statistical Significance 38 4.5 Experimental Results and Analyses 40 4.5.1 Comparison of Results with Prior SOTA Methods 41 4.5.2 Assessment with Full and Partial Test Data 43 4.5.3 Evaluation on the Ambiguous Set 48 4.5.4 What is the most effective label learning method for SER? 53 4.6 Summary 55 5 Training Loss by Using Co-occurrence Frequency of Emotions 58 5.1 Motivation 59 5.2 Background and Related Works 60 5.2.1 Contrastive Learning in Emotion Recognition 60 5.2.2 Label Learning in Emotion Recognition 61 5.3 Proposed Method 63 5.3.1 Penalization Weights based on the Counts of Co-Existing Emo- tions 63 5.3.2 Label Processing to Train SER Systems 64 5.3.3 Loss Functions Integrated by the Proposed Penalization Matrix . 64 5.4 Experimental Setup 66 5.4.1 Resource 66 5.4.2 Acoustic Features 66 5.4.3 SER Models and Other Details 66 5.4.4 Evaluation Metrics 67 5.4.5 Statistical Significance 68 5.5 Experimental Results and Analyses 68 5.5.1 Does incorporating the penalty loss (LP +loss) benefit SER Sys- tems? 68 5.5.2 Effect of Co-occurrence Matrix 69 5.6 Summary 70 6 Conclusion 71 6.1 Discussion and Limitation 72 6.2 Future Works 73 Bibliography 75

    [1] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2227–2231.
    [2] A. Ando, S. Kobashikawa, H. Kamiyama, R. Masumura, Y. Ijima, and Y. Aono, “Soft-Target Training with Ambiguous Emotional Utterances for DNN-Based Speech Emotion Classification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4964–4968.
    [3] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
    [4] S. Communication, L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim, J. Hoffman, M.-J. Hwang, H. Inaguma, C. Klaiber, I. Kulikov, P. Li, D. Licht, J. Maillard, R. Mavlyutov, A. Rakotoarison, K. R. Sadagopan, A. Ramakrishnan, T. Tran, G. Wenzek, Y. Yang, E. Ye, I. Evtimov, P. Fernandez, C. Gao, P. Hansanti, E. Kalbassi, A. Kallet, A. Kozhevnikov, G. M. Gonzalez, R. S. Roman, C. Touret, C. Wong, C. Wood, B. Yu, P. Andrews, C. Balioglu, P.-J. Chen, M. R. Costa-jussa`, M. Elbayad, H. Gong, F. Guzma´n, K. Heffernan, S. Jain, J. Kao, A. Lee, X. Ma, A. Mourachko, B. Peloquin, J. Pino, S. Popuri, C. Ropers, S. Saleem, H. Schwenk, A. Sun, P. Tomasello, C. Wang, J. Wang, S. Wang, and M. Williamson, “Seamless: Multilingual Expressive and Streaming Speech Translation,” 2023. [Online]. Available: https://arxiv.org/abs/2312.05187
    [5] L. F. Parra-Gallego and J. R. Orozco-Arroyave, “Classification of emotions and evaluation of customer satisfaction from speech in real world acoustic environments,” Digital Signal Processing, vol. 120, p. 103286, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1051200421003250
    [6] A. S. Cowen and D. Keltner, “Self-report captures 27 distinct categories of emotion bridged by continuous gradients,” Proceedings of the National Academy of Sciences, vol. 114, no. 38, pp. E7900–E7909, 2017. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.1702247114
    [7] ——, “Semantic Space Theory: A Computational Approach to Emotion,” Trends in Cognitive Sciences, vol. 25, no. 2, pp. 124–136, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S136466132030276X
    [8] J. A. Hall and D. Matsumoto, “Gender differences in judgments of multiple emotions from facial expressions,” Emotion (Washington, D.C.), vol. 4, no. 2, p. 201—206, June 2004. [Online]. Available: https://doi.org/10.1037/1528-3542.4.2.201
    [9] D. Matsumoto, “American-Japanese Cultural Differences in the Recognition of Universal Facial Expressions,” Journal of Cross-Cultural Psychology, vol. 23, no. 1, pp. 72–84, 1992. [Online]. Available: https://doi.org/10.1177/0022022192231005
    [10] A. Suzuki, T. Hoshino, K. Shigemasu, and M. Kawamura, “Decline or improvement?: Age-related differences in facial expression recognition,” Biological Psychology, vol. 74, no. 1, pp. 75–84, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0301051106001669
    [11] J. A. Russell, “Core affect and the psychological construction of emotion,” Psychological review, vol. 110, no. 1, p. 145—172, January 2003. [Online]. Available: https://doi.org/10.1037/0033-295x.110.1.145
    [12] L. F. Barrett, “Valence is a basic building block of emotional life,” Journal of Research in Personality, vol. 40, no. 1, pp. 35–55, 2006, proceedings of the 2005 Meeting of the Association of Research in Personality. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0092656605000590
    [13] A. S. Cowen, P. Laukka, H. A. Elfenbein, R. Liu, and D. Keltner, “The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures,” Nature human behaviour, vol. 3, no. 4, pp. 369–382, 2019.
    [14] L. Goncalves and C. Busso, “AuxFormer: Robust Approach to Audiovisual Emotion Recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7357–7361.
    [15] J. Almeida, L. Vilac¸a, I. N. Teixeira, and P. Viana, “Emotion Identification in Movies through Facial Expression Recognition,” Applied Sciences, vol. 11, no. 15, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/15/6827
    [16] J. S. Go´mez-Can˜o´n, E. Cano, Y.-H. Yang, P. Herrera, and E. Gomez, “Let’s agree to disagree: Consensus Entropy Active Learning for Personalized Music Emotion Recognition,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference. ISMIR, Oct. 2021, pp. 237–245. [Online]. Available: https://doi.org/10.5281/zenodo.5624399
    [17] J. S. Go´mez-Can˜o´n, E. Cano, T. Eerola, P. Herrera, X. Hu, Y.-H. Yang, and E. Go´mez, “Music Emotion Recognition: Toward new, robust standards in personalized and context-sensitive applications,” IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 106–114, 2021.
    [18] H.-C. Chou and C.-C. Lee, “Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5886–5890.
    [19] H.-C. Chou, W.-C. Lin, C.-C. Lee, and C. Busso, “Exploiting annotators’ typed description of emotion perception to maximize utilization of ratings for speech emotion recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7717–7721.
    [20] S. Parthasarathy and C. Busso, “Semi-Supervised Speech Emotion Recognition With Ladder Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2697–2709, 2020.
    [21] K. Vansteelandt, I. Van Mechelen, and J. B. Nezlek, “The co-occurrence of emotions in daily life: A multilevel approach,” Journal of Research in Personality, vol. 39, no. 3, pp. 325–335, 2005. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0092656604000431
    [22] S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, “” Of all things the measure is man” automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition],” in Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., vol. 1, 2005, pp. I/317–I/320 Vol. 1.
    [23] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal Multi-label Emotion Detection with Modality and Label Dependence,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 3584–3593. [Online]. Available: https://aclanthology.org/2020.emnlp-main.291
    [24] X. Kang, X. Shi, Y. Wu, and F. Ren, “Active Learning With Complementary Sampling for Instructing Class-Biased Multi-Label Text Emotion Classification,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 523–536, 2023.
    [25] M. Wang, Y. Zhao, Y. Wang, T. Xu, and Y. Sun, “Image emotion multi-label classification based on multi-graph learning,” Expert Systems with Applications, vol. 231, p. 120641, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417423011430
    [26] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection,” in Proceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 512–520. [Online]. Available: https://doi.org/10.1145/3394171.3413577
    [27] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal Multi-label Emotion Recognition with Heterogeneous Hierarchical Message Passing,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, pp. 14 338–14 346, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17686
    [28] S. Li and W. Deng, “Blended Emotion in-the-Wild: Multi-label Facial Expression Recognition Using Crowdsourced Annotations and Deep Locality Feature Learning,” Int. J. Comput. Vision, vol. 127, no. 6–7, p. 884–906, jun 2019. [Online]. Available: https://doi.org/10.1007/s11263-018-1131-1
    [29] X. Geng, “Label Distribution Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, 2016.
    [30] Y. Zhou, H. Xue, and X. Geng, “Emotion Distribution Recognition from Facial Expressions,” in Proceedings of the 23rd ACM International Conference on Multimedia, ser. MM ’15. New York, NY, USA: Association for Computing Machinery, 2015, p. 1247–1250. [Online]. Available: https://doi.org/10.1145/2733373.2806328
    [31] D. Zhou, X. Zhang, Y. Zhou, Q. Zhao, and X. Geng, “Emotion Distribution Learning from Texts,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras, Eds. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 638–647. [Online]. Available: https://aclanthology.org/D16-1061
    [32] G. N. Yannakakis, R. Cowie, and C. Busso, “The ordinal nature of emotions,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 248–255.
    [33] ——, “The Ordinal Nature of Emotions: An Emerging Approach,” IEEE Transactions on Affective Computing, vol. 12, no. 1, pp. 16–35, 2021.
    [34] R. A. Martin, G. E. Berry, T. Dobranski, M. Horne, and P. G. Dodgson, “Emotion Perception Threshold: Individual Differences in Emotional Sensitivity,” Journal of Research in Personality, vol. 30, no. 2, pp. 290–305, 1996. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0092656696900197
    [35] H. Alhuzali and S. Ananiadou, “SpanEmo: Casting Multi-label Emotion Classification as Span-prediction,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 1573–1584. [Online]. Available: https://aclanthology.org/2021.eacl-main.135
    [36] C.-K. Yeh, W.-C. Wu, W.-J. Ko, and Y.-C. F. Wang, “Learning Deep Latent Space for Multi-Label Classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, Feb. 2017. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/10769
    [37] J. Deng and F. Ren, “Multi-Label Emotion Detection via Emotion-Specified Fea- ture Extraction and Emotion Correlation Learning,” IEEE Transactions on Affec- tive Computing, vol. 14, no. 1, pp. 475–486, 2023.
    [38] M. Sabou, K. Bontcheva, L. Derczynski, and A. Scharl, “Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, Eds. Reykjavik, Iceland: European Language Resources Association (ELRA), May 2014, pp. 859– 866. [Online]. Available: http://www.lrec-conf.org/proceedings/lrec2014/pdf/ 497 Paper.pdf
    [39] R. Lotfian and C. Busso, “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2019.
    [40] Z. Waseem, “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter,” in Proceedings of the First Workshop on NLP and Computational Social Science, D. Bamman, A. S. Dog˘ruo¨z, J. Eisenstein, D. Hovy, D. Jurgens, B. O’Connor, A. Oh, O. Tsur, and S. Volkova, Eds. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 138–142. [Online]. Available: https://aclanthology.org/W16-5618
    [41] A. M. Davani, M. D´ıaz, and V. Prabhakaran, “Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 92–110, 01 2022. [Online]. Available: https://doi.org/10.1162/tacl a 00449
    [42] V. Prabhakaran, A. Mostafazadeh Davani, and M. Diaz, “On Releasing Annotator-Level Labels and Information in Datasets,” in Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, C. Bonial and N. Xue, Eds. Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 133–138. [Online]. Available: https://aclanthology.org/2021.law-1.14
    [43] M. Sandri, E. Leonardelli, S. Tonelli, and E. Jezek, “Why Don’t You Do It Right? Analysing Annotators’ Disagreement in Subjective Tasks,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, A. Vlachos and I. Augenstein, Eds. Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 2428–2441. [Online]. Available: https://aclanthology.org/2023.eacl-main.178
    [44] S. Oluyemi, B. Neuendorf, J. Plepi, L. Flek, J. Schlo¨tterer, and C. Welch, “Corpus Considerations for Annotator Modeling and Scaling,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 1029–1040. [Online]. Available: https://aclanthology.org/2024.naacl-long.59
    [45] L. Martinez-Lucas, A. Salman, S.-G. Leem, S. G. Upadhyay, C.-C. Lee, and C. Busso, “Analyzing the Effect of Affective Priming on Emotional Annotations,” in 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), 2023, pp. 1–8.
    [46] C. Hube, B. Fetahu, and U. Gadiraju, “Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, ser. CHI ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1–12. [Online]. Available: https://doi.org/10.1145/3290605.3300637
    [47] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 25, 2008.
    [48] N. Antoniou, A. Katsamanis, T. Giannakopoulos, and S. Narayanan, “Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
    [49] C. Busso, S. Parthasarathy, A. Burmania, M. Abdel Wahab, N. Sadoughi, and E. M. Provost, “MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
    [50] A. Burmania, S. Parthasarathy, and C. Busso, “Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 374–388, 2016.
    [51] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
    [52] E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, and S. Narayanan, “Interpreting ambiguous emotional expressions,” in 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, 2009, pp. 1–8.
    [53] V. Sethu, E. M. Provost, J. Epps, C. Busso, N. Cummins, and S. Narayanan, “The Ambiguous World of Emotion Representation,” 2019. [Online]. Available: https://arxiv.org/abs/1909.00360
    [54] Y. Kim and E. M. Provost, “Leveraging inter-rater agreement for audio-visual emotion recognition,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 553–559.
    [55] J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller, “From Hard to Soft: Towards more Human-like Emotion Recognition by Modelling the Perception Uncertainty,” in Proceedings of the 25th ACM International Conference on Multimedia, ser. MM ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 890–897. [Online]. Available: https://doi.org/10.1145/3123266.3123383
    [56] Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy, “Modeling annotator expertise: Learning when everybody knows a bit of something,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 13–15 May 2010, pp. 932–939. [Online]. Available: https://proceedings.mlr.press/v9/yan10a.html
    [57] H. Fayek, M. Lech, and L. Cavedon, “Modeling subjectiveness in emotion recog- nition with deep neural networks: Ensembles vs soft labels,” in 2016 Interna- tional Joint Conference on Neural Networks (IJCNN), 2016, pp. 566–570.
    [58] B. Zhang, Y. Kong, G. Essl, and E. M. Provost, “f-similarity preservation loss for soft labels: A demonstration on cross-corpus speech emotion recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 5725–5732, Jul. 2019. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/4518
    [59] F. Eyben, M. Wo¨llmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 1459–1462. [Online]. Available: https://doi.org/10.1145/1873951.1874246
    [60] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Inter- national Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
    [61] R. Lotfian and C. Busso, “Formulating emotion perception as a probabilistic model with application to categorical emotion classification,” in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 415–420.
    [62] Y. Kim and J. Kim, “Human-Like Emotion Recognition: Multi-Label Learning from Noisy Labeled Audio-Visual Expressive Speech,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5104–5108.
    [63] A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, and Y. Aono, “Speech Emotion Recognition Based on Multi-Label Emotion Existence Model,” in Proc. Interspeech 2019, 2019, pp. 2818–2822.
    [64] K. Sridhar, W.-C. Lin, and C. Busso, “Generative Approach Using Soft-Labels to Learn Uncertainty in Predicting Emotional Attributes,” in 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII), 2021, pp. 1–8.
    [65] J. Li, Y. Chen, X. Zhang, J. Nie, Z. Li, Y. Yu, Y. Zhang, R. Hong, and M. Wang, “Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos With Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2023, pp. 5838–5844.
    [66] L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Networks, vol. 18, no. 4, pp. 407–422, 2005, emotion and Brain. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608005000407
    [67] E. Mower, M. J. Mataric, and S. Narayanan, “Human Perception of Audio-Visual Synthetic Character Emotion Expression in the Presence of Ambiguous and Conflicting Information,” IEEE Transactions on Multimedia, vol. 11, no. 5, pp. 843– 855, 2009.
    [68] V. Sethu, E. M. Provost, J. Epps, C. Busso, N. Cummins, and S. Narayanan, “The ambiguous world of emotion representation,” ArXiv e-prints (arXiv:1909.00360), pp. 1–19, May 2019.
    [69] C. Busso and S. S. Narayanan, “Scripted dialogs versus improvisation: lessons learned about emotional elicitation techniques from the IEMOCAP database,” in Proc. Interspeech 2008, 2008, pp. 1670–1673.
    [70] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, vol. 92, pp. 60–68, 2017, advances in Cognitive Engineering Using Neural Networks. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S089360801730059X
    [71] Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang, J. Tao, and B. W. Schuller, “Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition,” Neural Networks, vol. 141, pp. 52–60, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0893608021000939
    [72] S. Mekruksavanich, A. Jitpattanakul, and N. Hnoohom, “Negative Emotion Recognition using Deep Learning for Thai Language,” in 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), 2020, pp. 71–74.
    [73] B. Mocanu, R. Tapu, and T. Zaharia, “Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition,” Sensors, vol. 21, no. 12, 2021. [Online]. Available: https://www.mdpi.com/1424-8220/21/12/4233
    [74] M. Neumann and N. T. Vu, “Improving Speech Emotion Recognition with Un- supervised Representation Learning on Unlabeled Speech,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7390–7394.
    [75] L. Goncalves and C. Busso, “Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks,” in Proc. In- terspeech 2022, 2022, pp. 1168–1172.
    [76] R. Pappagari, J. Villalba, P. Z˙ elasko, L. Moro-Velazquez, and N. Dehak, “Copy- paste: An augmentation method for speech emotion recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6324–6328.
    [77] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection,” in Proceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 512–520. [Online]. Available: https://doi.org/10.1145/3394171.3413577
    [78] P. Riera, L. Ferrer, A. Gravano, and L. Gauder, “No Sample Left Behind: To- wards a Comprehensive Evaluation of Speech Emotion Recognition Systems,” in Workshop on Speech, Music and Mind (SMM 2019), Graz, Austria, September 2019, pp. 11–15.
    [79] R. Lotfian and C. Busso, “Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 815–826, 2019.
    [80] L. Yang, Y. Shen, Y. Mao, and L. Cai, “Hybrid Curriculum Learning for Emotion Recognition in Conversation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11 595–11 603, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/21413
    [81] J. Li, X. Wang, Y. Liu, and Z. Zeng, “ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy,” Knowledge-Based Systems, vol. 286, p. 111434, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705124000698
    [82] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
    [83] X. Wang, H. Liu, C. Shi, and C. Yang, “Be Confident! Towards Trustworthy Graph Neural Networks via Confidence Calibration,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 23 768–23 779. [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/c7a9f13a6c0940277d46706c7ca32601-Paper.pdf
    [84] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Ey- ben, and B. W. Schuller, “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 745–10 759, 2023.
    [85] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli, “Robust wav2vec 2.0: Analyz- ing Domain Shift in Self-Supervised Pre-Training,” in Proc. Interspeech 2021, 2021, pp. 721–725.
    [86] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, and Q. L. and A.M. Rush, “HuggingFace’s transformers: State-of-the-art natural language pro- cessing,” ArXiv e-prints (arXiv:1910.03771v5), pp. 1–8, October 2019.
    [87] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, San Diego, CA, USA, May 2015, pp. 1–13.
    [88] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chin- tala, “PyTorch: An imperative style, high-performance deep learning library,” in Conference on Neural Information Processing Systems (NeurIPS 2019), Vancou- ver, BC, Canada, December 2019, pp. 1–12.
    [89] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
    [90] H.-C. Chou, C.-C. Lee, and C. Busso, “Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier,” in Proc. Interspeech 2022, 2022, pp. 161–165.
    [91] Y. Li, T. Zhao, and T. Kawahara, “Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning,” in Proc. Inter- speech 2019, 2019, pp. 2803–2807.
    [92] L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech 2021, 2021, pp. 3400–3404.
    [93] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andre´, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong, “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016.
    [94] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960.
    [95] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0377042787901257
    [96] K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech Synthesis With Mixed Emotions,” IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3120–3134, 2023.
    [97] A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y. Adi, J. Pino, J. Gu, and W.-N. Hsu, “Textless Speech-to-Speech Translation on Real Data,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 860–872. [Online]. Available: https://aclanthology.org/2022.naacl-main.63
    [98] X. Li, Y. Jia, and C.-C. Chiu, “Textless Direct Speech-to-Speech Translation with Discrete Speech Representation,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1– 5.
    [99] Q. Jin, C. Li, S. Chen, and H. Wu, “Speech emotion recognition with acoustic and lexical features,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4749–4753.
    [100] S. Parthasarathy and C. Busso, “Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes,” in Interspeech 2018, Hyderabad, India, September 2018, pp. 3698– 3702.
    [101] Z. Aldeneh and E. Mower Provost, “Using regional saliency for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), New Orleans, LA, USA, March 2017, pp. 2741– 2745.
    [102] M. Abdelwahab and C. Busso, “Study Of Dense Network Approaches For Speech Emotion Recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018). Calgary, AB, Canada: IEEE, April 2018, pp. 5084–5088.
    [103] S. Yoon, S. Byun, S. Dey, and K. Jung, “Speech Emotion Recognition Using Multi-hop Attention Mechanism,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2822–2826.
    [104] X. Kang, X. Shi, Y. Wu, and F. Ren, “Active Learning With Complementary Sampling for Instructing Class-Biased Multi-Label Text Emotion Classification,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 523–536, 2023.
    [105] P. Xu, Z. Liu, G. I. Winata, Z. Lin, and P. Fung, “EmoGraph: Capturing Emotion Correlations using Graph Networks,” CoRR, vol. abs/2008.09378, 2020.
    [106] X. Wang, S. Zhao, and Y. Qin, “Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1913–1917.
    [107] Y. Zhao, J. Wang, C. Lu, S. Li, B. W. Schuller, Y. Zong, and W. Zheng, “Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11 846–11 850.
    [108] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal multi-label emotion detection with modality and label dependence,” in Empirical Methods in Natural Language Processing (EMNLP 2020), Virtual Conference, November 2020, pp. 3584–3593.
    [109] W. Wu, C. Zhang, X. Wu, and P. C. Woodland, “Estimating the Uncertainty in Emotion Class Labels With Utterance-Specific Dirichlet Priors,” IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 2810–2822, 2023.
    [110] A. Keesing, Y. S. Koh, and M. Witbrock, “Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech,” in Proc. Inter- speech 2021, 2021, pp. 3415–3419.
    [111] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised Pre-Training for Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
    [112] W.-C. Lin and C. Busso, “Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1215–1227, 2023.
    [113] H. Fei, Y. Zhang, Y. Ren, and D. Ji, “Latent Emotion Memory for Multi-Label Emotion Classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 7692–7699, Apr. 2020. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6271
    [114] H. Chou, L. Goncalves, S. Leem, A. N. Salman, C. Lee, and C. Busso, “Minority views matter: Evaluating speech emotion classifiers with human subjective an- notations by an all-inclusive aggregation rule,” IEEE Transactions on Affective Computing, no. 01, pp. 1–15, jun 2024.
    [115] E. Zhang, R. Trujillo, and C. Poellabauer, “The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V. Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 13 960–13 970. [Online]. Available: https://aclanthology.org/2024.acl-long.752
    [116] S. Zhang, Z. Huang, D. P. Paudel, and L. Van Gool, “Facial emotion recognition with noisy multi-task annotations,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2021, pp. 21–31.

    QR CODE