研究生: |
蘇柏豪 Su, Bo-Hao |
---|---|
論文名稱: |
基於循環式對抗性網路的數據增量及梯度察覺純化防禦機制建立強健式語音情緒辨識模型 Building a Robust Speech Emotion Recognition using Cycle-GAN-based Data Augmentation and Gradient-Aware Purification Adversarial Defense Mechanism |
指導教授: |
李祈均
Lee, Chi-Chun |
口試委員: |
廖元甫
Liao, Yuan-Fu 王新民 Wang, Hsin-Ming 陳柏琳 Chen, Po-Lin 林嘉文 Lin, Chia-Wen 劉奕汶 Liu, Yi-Wen |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2023 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 86 |
中文關鍵詞: | 語音情緒辨識 、強健式學習 、循環式對抗性網路 、對抗式學習 |
外文關鍵詞: | Speech Emotion Recognition, Robustness Learning, Cycle Generative Adversarial Learning, Adversarial Learning |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音情緒識別是一門重要的技術並在人機互動的過程中扮演著不可或缺的角色,它可以讓機器理解人的感受並增加整個過程的流暢度及適當的同理心。然而,情緒是非常複雜的一個感受,需要非常高變化度和多樣的環境來同時建構。要單純靠單一資料庫來建構模型並且直接應用在日常生活中是非常困難的一個任務。再者,語音情緒資料庫的高度情境化的特性更嚴重地成為了發展語音情緒辨識模型強健性的一大阻礙。因此,在這篇論文當中,我們將著墨於探討情緒語音識別的強健性,並透過不同的兩個觀點來探討,第一個觀點是建構一個強健的語音情緒辨識模型可以直接被應用在跨資料庫(跨域)的情境底下,第二個則是語音情緒辨識模型的防禦對抗式樣本的能力提升。
在此論文中我們設計了兩個計算框架來處理這兩個語音情緒辨識模型強健式的議題。首先,我們利用生成式模型的優點來增加資源資料庫的變異度始知能夠涵蓋到目標資料庫,而不是單純的去平移資料域和目標域這兩個分佈。我們設計的資料庫察覺機制能夠在生成類目標樣本的時候去捕捉到不同的資源資料庫之間的特性,進而利用之生成更接近目標資料庫的樣本,最後透過資料增量來提升語音情緒辨識模型的精準度。再來,深度模型的弱點在近期被高度凸顯,同時也是一個非常新興的研究領域,尤其是針對對抗式樣本的強健性這點。因此,我們去全面地探究了對抗式樣本對語音情緒辨識模型的影響,同時提供了一個針對梯度察覺做純化的防禦機制,它能夠加強模型在這些相對容易被攻擊的特徵點上作純化。此外,我們還設計了一個可以公平地衡量模型防禦能力的指標來提供給相關研究領域的學者使用。整體而言,兩個我們所提出的計算框架對比最先進的模型都展現出了卓越的進步,也展示了一些在語音情緒識別強健性的遠見發現。
Speech emotion recognition (SER) is a crucial technique that can equip machines with the understanding of human feelings and increase smoothness and empathy properly. However, emotion is a sophisticated task that needs a wide variety and diverse environment as well. It's hard to build a model on a single corpus and apply it to different scenarios directly. Furthermore, the highly contextualized characteristics of speech emotion datasets severely hinder the robustness of SER models. Hence, in this dissertation, we aim to discuss the robustness of SER models in two aspects, which are building a robust SER model that can be directly applied to the cross-corpus scenario and a model that can defend the adversarial attacks.
Specifically, we design two frameworks to tackle these robustness issues of SER models. First of all, instead of shifting the domain between source and target, we utilize the pros of the generative model to increase the variety coverage of our source corpus. While synthesizing the samples, the corpus-aware mechanism is able to capture the uniqueness of each different source dataset and provide a better target-like sample for the augmentation. Secondly, the vulnerability of the deep-learning model is the latest topic and has been revealed recently, especially under adversarial attacks. Hence, we probe the effects of adversarial attacks in SER domain comprehensively and provide a gradient-aware purification module that can reinforce the sanitization of easily-poisoned locations. Furthermore, we provide a new metric that can fairly evaluate the defense ability of each model as well. Overall, both the proposed models manifest significant improvements when compared to state-of-the-art models and show lots insights in the SER robustness domain.
[1] C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, B.-H. Su, and C. Busso, “Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities,” IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 22–38, 2021.
[2] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database, ”Language resources and evaluation, vol. 42, pp. 335–359, 2008.
[3] R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017.
[4] M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, 2018.
[5] G.-Y. Chao, Y.-S. Lin, C.-M. Chang, and C.-C. Lee, “Enforcing semantic consistency for cross corpus valence regression from speech using adversarial discrepancy learning.,” in INTERSPEECH, pp. 1681–1685, 2019.
[6] P. Wei, Y. Ke, X. Qu, and T.-Y. Leong, “Subdomain adaptation with manifolds discrepancy alignment,” IEEE Transactions on Cybernetics, vol. 52, no. 11, pp. 11698–11708, 2021.
[7] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pantazopoulos, M. Nikandrou, T. Giannakopoulos, A. Katsamanis, A. Potamianos, and S. Narayanan, “Data augmentation using gans for speech emotion recognition.,” in Interspeech, pp. 171–175, 2019.
[8] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and B. Schmauch, “Cnn+ lstm architecture for speech emotion recognition with data augmentation,” arXiv preprint arXiv:1802.05630, 2018.
[9] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” Ieee Access, vol. 6, pp. 14410–14430, 2018.
[10] T. Long, Q. Gao, L. Xu, and Z. Zhou, “A survey on adversarial attacks in computer vision: Taxonomy, visualization and future directions,” Computers & Security, p. 102847, 2022.
[11] H. Wu, B. Zheng, X. Li, X. Wu, H.-Y. Lee, and H. Meng, “Characterizing the adversarial vulnerability of speech self-supervised learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3164–3168,
IEEE, 2022.
[12] N. Akhtar, A. Mian, N. Kardan, and M. Shah, “Advances in adversarial attacks and defenses in computer vision: A survey,” IEEE Access, vol. 9, pp. 155161–155196, 2021.
[13] S. Qiu, Q. Liu, S. Zhou, and C. Wu, “Review of artificial intelligence adversarial attack and defense technologies,” Applied Sciences, vol. 9, no. 5, p. 909, 2019.
[14] Z. Ren, A. Baird, J. Han, Z. Zhang, and B. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” in ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 7184–7188, IEEE, 2020.
[15] B.-H. Su and C.-C. Lee, “Unsupervised cross-corpus speech emotion recognition using a multi-source cycle-gan,” IEEE Transactions on Affective Computing, no. 01, pp. 1–1, 2022.
[16] F. Ren and C. Quan, “Linguistic-based emotion analysis and recognition for measuring consumer satisfaction: an application of affective computing,” Information Technology and Management, vol. 13, no. 4, pp. 321–332, 2012.
[17] G. N. Yannakakis, “Enhancing health care via affective computing,” 2018.
[18] J. Hernandez, R. R. Morris, and R. W. Picard, “Call center stress recognition with person-specific models,” in International Conference on Affective Computing and Intelligent Interaction, pp. 125–134, Springer, 2011.
[19] A. Menychtas, M. Galliakis, P. Tsanakas, and I. Maglogiannis, “Real-time integration of emotion analysis into homecare platforms,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3468–3471, IEEE,
2019.
[20] H. Basanta, Y.-P. Huang, and T.-T. Lee, “Assistive design for elderly living ambient using voice and gesture recognition system,” in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 840–845, IEEE, 2017.
[21] E. Polyakov, M. Mazhanov, A. Rolich, L. Voskov, M. Kachalova, and S. Polyakov, “Investigation and development of the intelligent voice assistant for the internet of things using machine learning,” in 2018 Moscow Workshop on Electronic and Networking Technologies (MWENT), pp. 1–5, IEEE, 2018.
[22] C. Filippini, D. Perpetuini, D. Cardone, A. M. Chiarelli, and A. Merla, “Thermal infrared imaging-based affective computing and its application to facilitate human robot interaction: a review,” Applied Sciences, vol. 10, no. 8, p. 2924, 2020.
[23] R. K. Moore, “Is spoken language all-or-nothing? implications for future speech-based human-machine interaction,” in Dialogues with Social Robots, pp. 281–291, Springer, 2017.
[24] C. Tschöpe, F. Duckhorn, M. Huber, W. Meyer, and M. Wolff, “A cognitive user interface for a multi-modal human-machine interaction,” in International Conference on Speech and Computer, pp. 707–717, Springer, 2018.
[25] P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu, “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol. 83, pp. 34–41, 2016.
[26] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech
using cross-modal transfer in the wild,” in Proceedings of the 26th ACM international
conference on Multimedia, pp. 292–301, 2018.
[27] Y. Zong, W. Zheng, T. Zhang, and X. Huang, “Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression,” IEEE signal processing letters, vol. 23, no. 5, pp. 585–589, 2016.
[28] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Cross corpus speech emotion classification-an effective transfer learning technique,” arXiv preprint arXiv:1801.06353, 2018.
[29] Z. Zhang, F. Weninger, M. Wöllmer, and B. Schuller, “Unsupervised learning in cross-corpus acoustic emotion recognition,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 523–528, IEEE, 2011.
[30] B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010.
[31] Z. Huang, W. Xue, Q. Mao, and Y. Zhan, “Unsupervised domain adaptation for speech emotion recognition using pcanet,” Multimedia Tools and Applications, vol. 76, no. 5, pp. 6785–6799, 2017.
[32] P. Song, “Transfer linear subspace learning for cross-corpus speech emotion recognition.,” IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 265–275, 2019.
[33] J. Gideon, M. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, 2019.
[34] L. Yi and M.-W. Mak, “Adversarial data augmentation network for speech emotion recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 529–534, IEEE, 2019.
[35] F. Bao, M. Neumann, and N. T. Vu, “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition.,” in INTERSPEECH, pp. 2828–2832, 2019.
[36] B.-H. Su and C.-C. Lee, “A conditional cycle emotion gan for cross corpus speech emotion recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 351–357, IEEE, 2021.
[37] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019.
[38] T.-S. Nguyen, S. Stueker, J. Niehues, and A. Waibel, “Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7689–7693, IEEE, 2020.
[39] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224, IEEE, 2017.
[40] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23, IEEE, 2017.
[41] P. Sheng, Z. Yang, and Y. Qian, “Gans for children: A generative data augmentation strategy for children speech recognition,” in 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), pp. 129–135, IEEE, 2019.
[42] Z. Chen, A. Rosenberg, Y. Zhang, G. Wang, B. Ramabhadran, and P. J. Moreno, “Improving speech recognition using gan-based speech synthesis and contrastive unspoken text selection,” Proc. Interspeech 2020, pp. 556–560, 2020.
[43] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
[44] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning, pp. 214–223, PMLR, 2017.
[45] Y. Xiao, H. Zhao, and T. Li, “Learning class-aligned and generalized domain-invariant representations for speech emotion recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 480–489, 2020.
[46] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based unsupervised domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014.
[47] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Universum autoencoder-based domain adaptation for speech emotion recognition,” IEEE Signal Processing Letters, vol. 24, no. 4, pp. 500–504, 2017.
[48] M. Neumann and N. T. Vu, “Improving speech emotion recognition with unsupervised representation learning on unlabeled speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7390–7394, IEEE, 2019.
[49] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning, pp. 1989–1998, PMLR, 2018.
[50] S. Zhao, C. Lin, P. Xu, S. Zhao, Y. Guo, R. Krishna, G. Ding, and K. Keutzer, “Cycleemotiongan: Emotional semantic consistency preserved cyclegan for adapting image emotions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 2620–2627, 2019
[51] M. Grimm, K. Kroschel, and S. Narayanan, “The vera am mittag german audio-visual emotional speech database,” in 2008 IEEE international conference on multimedia and expo, pp. 865–868, IEEE, 2008.
[52] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462, 2010.
[53] F. Haider, S. Pollak, P. Albert, and S. Luz, “Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods,” Computer Speech & Language, vol. 65, p. 101119, 2021.
[54] S. T. Rajamani, K. T. Rajamani, A. Mallol-Ragolta, S. Liu, and B. Schuller, “A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6294–6298, IEEE, 2021.
[55] C. Fu, C. Liu, C. T. Ishi, and H. Ishiguro, “Maec: Multi-instance learning with an adversarial auto-encoder-based classifier for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6299–6303, IEEE, 2021.
[56] A. Triantafyllopoulos and B. W. Schuller, “The role of task and acoustic similarity in audio transfer learning: Insights from the speech emotion recognition case,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7268–7272, IEEE, 2021.
[57] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2016.
[58] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss, et al., “A database of german emotional speech.,” in Interspeech, vol. 5, pp. 1517–1520, 2005.
[59] A. Metallinou, C.-C. Lee, C. Busso, S. Carnicke, S. Narayanan, et al., “The usc creativeit
database: A multimodal database of theatrical improvisation,” Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, p. 55, 2010.
[60] Z. Farhoudi and S. Setayeshi, “Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition,” Speech Communication, vol. 127, pp. 92–103, 2021.
[61] L. Cen, F. Wu, Z. L. Yu, and F. Hu, “A real-time speech emotion recognition system and its application in online learning,” in Emotions, technology, design, and learning, pp. 27–46, Elsevier, 2016.
[62] M. Dewan, M. Murshed, and F. Lin, “Engagement detection in online learning: a review,” Smart Learning Environments, vol. 6, no. 1, pp. 1–20, 2019.
[63] J. Zhang, Z. Yin, P. Chen, and S. Nichele, “Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,” Information Fusion, vol. 59, pp. 103–126, 2020.
[64] M. Spezialetti, G. Placidi, and S. Rossi, “Emotion recognition for human-robot interaction: Recent advances and future perspectives,” Frontiers in Robotics and AI, p. 145, 2020.
[65] S. Poria, N. Majumder, R. Mihalcea, and E. Hovy, “Emotion recognition in conversation: Research challenges, datasets, and recent advances,” IEEE Access, vol. 7, pp. 100943–100953, 2019.
[66] G. R. Machado, E. Silva, and R. R. Goldschmidt, “Adversarial machine learning in image classification: A survey toward the defender's perspective,” ACM Computing Surveys (CSUR), vol. 55, no. 1, pp. 1–38, 2021.
[67] P. Vidnerová and R. Neruda, “Vulnerability of classifiers to evolutionary generated adversarial examples,” Neural Networks, vol. 127, pp. 168–181, 2020.
[68] J. Villalba, Y. Zhang, and N. Dehak, “x-vectors meet adversarial attacks: Benchmarking adversarial robustness in speaker verification.,” in INTERSPEECH, pp. 4233–4237, 2020.
[69] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
[70] R. Olivier, B. Raj, and M. Shah, “High-frequency adversarial defense for speech and audio,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2995–2999, IEEE, 2021.
[71] Z. Yang, B. Li, P.-Y. Chen, and D. Song, “Towards mitigating audio adversarial perturbations,” 2018.
[72] S. Samizade, Z.-H. Tan, C. Shen, and X. Guan, “Adversarial example detection by classification for deep speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3102–3106, IEEE, 2020.
[73] J. Zhang, B. Zhang, and B. Zhang, “Defending adversarial attacks on cloud-aided automatic speech recognition systems,” in Proceedings of the Seventh International Workshop on Security in Cloud Computing, pp. 23–31, 2019.
[74] C.-H. Yang, J. Qi, P.-Y. Chen, X. Ma, and C.-H. Lee, “Characterizing speech adversarial examples using self-attention u-net enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3107–3111, IEEE, 2020.
[75] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-y. Lee, “Adversarial defense for automatic speaker verification by cascaded self-supervised learning models,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6718–6722, IEEE, 2021.
[76] H. Wu, X. Li, A. T. Liu, Z. Wu, H. Meng, and H.-Y. Lee, “Improving the adversarial robustness for speaker verification by self-supervised learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 202–217, 2021.
[77] H. Wu, P.-c. Hsu, J. Gao, S. Zhang, S. Huang, J. Kang, Z. Wu, H. Meng, and H.-y. Lee, “Adversarial sample detection for speaker verification by neural vocoders,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 236–240, IEEE, 2022.
[78] L.-C. Chang, Z. Chen, C. Chen, G. Wang, and Z. Bi, “Defending against adversarial attacks in speaker verification systems,” in 2021 IEEE International Performance, Computing, and Communications Conference (IPCCC), pp. 1–8, IEEE, 2021.
[79] C.-H. H. Yang, Z. Ahmed, Y. Gu, J. Szurley, R. Ren, L. Liu, A. Stolcke, and I. Bulyko, “Mitigating closed-model adversarial examples with bayesian neural modeling for enhanced end-to-end speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6302–6306, IEEE,
2022.
[80] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition,” Proc. Interspeech 2020, pp. 2327–2331, 2020.
[81] B.-H. Su and C.-C. Lee, “Vaccinating ser to neutralize adversarial attacks with self-supervised augmentation strategy,” Proc. Interspeech 2022, pp. 1153–1157, 2022.
[82] L. Hansen, Y.-P. Zhang, D. Wolf, K. Sechidis, N. Ladegaard, and R. Fusaroli, “A generalizable speech emotion recognition model reveals depression and remission,” Acta Psychiatrica Scandinavica, vol. 145, no. 2, pp. 186–199, 2022.
[83] Y. Chang, S. Laridi, Z. Ren, G. Palmer, B. W. Schuller, and M. Fisichella, “Robust federated learning against adversarial attacks for speech emotion recognition,” arXiv preprint arXiv:2203.04696, 2022.
[84] P. Gyawali, S. Ghimire, and L. Wang, “Enhancing mixup-based semi-supervised learning with explicit lipschitz regularization,” in 2020 IEEE International Conference on Data Mining (ICDM), pp. 1046–1051, IEEE, 2020.
[85] S. Amini and S. Ghaemmaghami, “Towards improving robustness of deep neural networks to adversarial perturbations,” IEEE Transactions on Multimedia, vol. 22, no. 7, pp. 1889–1903, 2020
[86] A. Jati, C.-C. Hsu, M. Pal, R. Peri, W. AbdAlmageed, and S. Narayanan, “Adversarial
attack and defense strategies for deep speaker recognition systems,” Computer Speech & Language, vol. 68, p. 101199, 2021.
[87] S. Wang, W. Liu, and C.-H. Chang, “Detecting adversarial examples for deep neural networks via layer directed discriminative noise injection,” in 2019 Asian Hardware Oriented Security and Trust Symposium (AsianHOST), pp. 1–6, IEEE, 2019.
[88] S. Joshi, J. Villalba, P. Żelasko, L. Moro-Velázquez, and N. Dehak, “Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 4811–4826, 2021.
[89] A. Sreeram, N. Mehlman, R. Peri, D. Knox, and S. Narayanan, “Perceptual-based deep-learning denoiser as a defense against adversarial attacks on asr systems,” arXiv preprint arXiv:2107.05222, 2021.
[90] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
[91] L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc. Interspeech 2021, pp. 3400–3404, 2021.
[92] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” arXiv preprint arXiv:1904.01038, 2019.
[93] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
[94] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
[95] D. Terjék, “Adversarial lipschitz regularization,” in International Conference on Learning Representations, 2020.
[96] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of asv,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6564–6568, IEEE, 2020.