研究生: |
林蔭澤 Lin, Yin-Tse |
---|---|
論文名稱: |
基於深度學習的抗噪聲頻帶擴展,應用於8kHz語音 Deep Learning-Based Noise-Robust Bandwidth Expansion for 8 kHz Speech Recordings |
指導教授: |
李祈均
Lee, Chi-Chun |
口試委員: |
李夢麟
Li, Meng-Lin 冀泰石 Chi, Tai-Shih 曹昱 Tsao, Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 通訊工程研究所 Communications Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 頻帶擴展 、語音增強 、自動語音識別 、抗噪語音表徵學習 、梯度生成式建模 |
外文關鍵詞: | Bandwidth expansion, Speech enhancement, Automated speech recognition, Robust speech representation learning, Score-based generative modeling |
相關次數: | 點閱:68 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
來自客服中心的語音錄音通常是窄頻且混雜各種噪音的,這對於自動語音識別(ASR)系統來講構成了挑戰。而透過具備抗噪能力的頻寬擴展(BWE)技術來應對這些挑戰,對於弭補低取樣率與高取樣率語音之間的差異以及消除錄音中的各種失真至關重要。在這篇研究中,我們提出了兩個新的模型來實現這個目標,分別是EP-WUN和SWiBE。第一個模型EP-WUN使用了一個語音品質分類器在模型的隱藏層中同時處理噪音和頻寬擴展。這個方法在VoiceBank-Demand資料庫的驗證中取得了更好的語音質量指標,參數量則比當前最先進的抗噪頻帶擴展模型少了33\%,並在玉山銀行用互動語音應答系統收集的真實數據上實現了11.71\%的單詞錯誤率(WER)減少。第二個模型SWiBE則是透過使用一個基於梯度生成式建模(SGM)的參數化隨機擴散過程,在頻譜圖中達到逐步的頻寬擴展。此一方法在語音品質和頻譜重建相關的各種指標上都優於其他參與比較的模型,包括基於擴散模型和基於GAN的模型。整體來說,透過驗證模型再語音品質以及語音識別結果上的表現,這篇有關於兩個框架的研究展示了頻帶擴展在噪音環境下穩健性的提升。
Speech recordings in call centers are often narrowband and mixed with various noises, creating challenges for automated speech recognition (ASR) systems. Addressing these challenges through noise-robust bandwidth expansion (BWE) is essential for both bridging the gap between low and high sampling rate speech data and eliminating a variety of distortions within the speech recordings. In this work, we propose two novel frameworks to achieve this goal, which are Embedding-Polished Wave-U-Net (EP-WUN) and Step-Wised Bandwidth Expansion (SWiBE).
The first model, EP-WUN, utilizes a speech quality classifier to handle noise and bandwidth expansion simultaneously in an model embedding domain. This approach shows improved speech quality metrics on the VoiceBank-Demand corpus with 33 \% fewer parameters compared to the current state-of-the-art (SOTA) noise-robust BWE model and achieves an 11.71 \% word error rate reduction on real-world data from an interactive voice response system used by E.SUN bank.
The second model, SWiBE, leverages score-based generative modeling (SGM) through a parameterized stochastic diffusion process, employing stepwise bandwidth expansion in the spectrogram. The approach outperforms baseline methods, including diffusion and GAN-based models, in various metrics related to perceptual quality and spectral reconstruction.
Among the evaluations on both speech quality and ASR performance, we shows that our combined exploration of the two frameworks enhances the robustness of BWE in noisy environments.
[1] O. Amir, E. Kamar, A. Kolobov, and B. Grosz, “Interactive teaching strategies for agent training,” in In Proceedings of IJCAI 2016, 2016.
[2] C. S. T. Koum´ etio, W. Cherif, and S. Hassan, “Optimizing the prediction of telemarketing target calls by a classification technique,” in 2018 6th International Conference on Wireless Networks and Mobile Communications (WINCOM). IEEE, 2018, pp. 1–6. 453–459.
[3] G. Mishne, D. Carmel, R. Hoory, A. Roytman, and A. Soffer, “Automatic analysis of call-center conversations,” in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp.
[4] T. Fukuda and S. Thomas, “Mixed bandwidth acoustic modeling leveraging knowledge distillation,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 509–515.
[5] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, “Feature learning in deep neural networks-studies on speech recognition tasks,” arXiv preprint arXiv:1301.3605, 2013.
[6] G. Mantena, O. Kalinli, O. Abdel-Hamid, and D. McAllaster, “Band
width embeddings for mixed-bandwidth speech recognition,” arXiv preprint arXiv:1909.02667, 2019.
[7] W. Heymans, M. H. Davel, and C. v. Heerden, “Multi-style training for south african call centre audio,” in Southern African Conference for Artificial Intelligence Research. Springer, 2021, pp. 111–124.
[8] S. Sulun and M. E. Davies, “On filter generalization for music bandwidth extension using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 132–142, 2020.
[9] J. Abel, M. Strake, and T. Fingscheidt, “A simple cepstral domain dnn approach to artificial speech bandwidth extension,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5469–5473.
[10] H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka, “Speaker augmentation and bandwidth extension for deep speaker embedding.” in Interspeech, 2019, pp. 406–410. [11] E. Moliner and V. V¨ alim¨aki, “Behm-gan: Bandwidth extension of historical music using generative adversarial networks,” arXiv preprint arXiv:2204.06478, 2022.
[12] S. Li, S. Villette, P. Ramadas, and D. J. Sinder, “Speech bandwidth extension using generative adversarial networks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5029–5033.
[13] H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural vocoder is all you need for speech super-resolution,” arXiv preprint arXiv:2203.14941, 2022.
[14] V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super-resolution using neural networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings, 2017.
[15] S. Birnbaum, V. Kuleshov, Z. Enam, P. W. W. Koh, and S. Ermon, “Temporal film: Capturing long-range sequence dependencies with feature-wise modulations.” Advances in Neural Information Processing Systems, vol. 32, 2019.
[16] H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural Vocoder is All You Need for Speech Super-resolution,” in Proc. Interspeech 2022, 2022, pp. 4227–4231.
[17] X. Li, V. Chebiyyam, K. Kirchhoff, and A. Amazon, “Speech audio superresolution for speech recognition.” in INTERSPEECH, 2019, pp. 3416–3420.
[18] S. Kim and V. Sathe, “Bandwidth extension on raw audio via generative adversarial networks,” arXiv preprint arXiv:1903.09027, 2019.
[19] J. Lee and S. Han, “Nu-wave: A diffusion probabilistic model for neural audio upsampling,” Proc. Interspeech 2021, pp. 1634–1638, 2021.
[20] S. Han and J. Lee, “NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates,” in Proc. Interspeech 2022, 2022, pp. 4401–4405.
[21] C.-Y. Yu, S.-L. Yeh, G. Fazekas, and H. Tang, “Conditioning and sampling in variational diffusion models for speech super-resolution,” in ICASSP 2023 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[22] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[23] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, T. Peer, and T. Gerkmann, “Causal diffusion models for generalized speech enhancement,” IEEE Open Journal of Signal Processing, 2024.
[24] Y.-T. Lin, B.-H. Su, C.-H. Lin, S.-C. Kuo, J.-S. R. Jang, and C.-C. Lee, “Noise-Robust Bandwidth Expansion for 8K Speech Recordings,” in Proc. INTERSPEECH 2023, 2023, pp. 5107–5111.
[25] Y.-T. Lin, S. G. Upadhyay, B.-H. Su, and C.-C. Lee, “Swibe: A parameterized stochastic diffusion process for noise-robust bandwidth expansion,” in Interspeech 2024, 2024, pp. 2265–2269.
[26] B. Liu, J. Tao, and Y. Zheng, “A novel unified framework for speech enhance(ISCSLP). IEEE, 2018, pp. 11–15. ment and bandwidth extension based on jointly trained neural networks,” in 2018 11th International Symposium on Chinese Spoken Language Processing
[27] N. Hou, C. Xu, J. T. Zhou, E. S. Chng, and H. Li, “Multi-task learning for end-to-end noise-robust bandwidth extension.” in INTERSPEECH, 2020, pp. 4069–4073.
[28] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.” in SSW, 2016, pp. 146–152.
[29] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019.
[30] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in 9th International Conference on Learning Representations, ICLR, 2021.
[31] J. Serr`a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” arXiv preprint arXiv:2206.03065, 2022.
[32] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406.
[33] H. Yen, F. G. Germain, G. Wichern, and J. Le Roux, “Cold diffusion for speech enhancement,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[34] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[35] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-Scale Neural Network for End- to-End Audio Source Separation,” in Proceedings of the 19th International Society for Music Information Retrieval Conference. Paris, France: ISMIR, Sep. 2018, pp. 334–340.
[36] Y.-F. Liao, Y.-H. S. Chang, Y.-C. Lin, W.-H. Hsu, M. Pleva, and J. Juhar, “Formosa speech in the wild corpus for improving taiwanese mandarin speechenabled human-computer interaction,” Journal of Signal Processing Systems, vol. 92, no. 8, pp. 853–873, 2020.
[37] J. Lin, Y. Wang, K. Kalgaonkar, G. Keren, D. Zhang, and C. Fuegen, “A two-stage approach to speech bandwidth extension.” in Interspeech, 2021, pp. 1689–1693.
[38] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749–752.
[39] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010, pp. 4214–4217. vol. 16, no. 1, pp. 229–238, 2007.
[40] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing,
[41] K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4395–4399.
[42] V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super resolution using neural networks,” arXiv preprint arXiv:1708.00853, 2017.
[43] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF. IEEE Signal Processing Society, 2011.
[44] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[45] S. Welker, J. Richter, and T. Gerkmann, “Speech enhancement with scorebased generative models in the complex STFT domain,” in Proc. Interspeech 2022, 2022, pp. 2928–2932.
[46] A. Hyv¨arinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research, vol. 6, no. 4, 2005.
[47] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
[48] S. S¨arkk¨a and A. Solin, Applied stochastic differential equations. Cambridge University Press, 2019, vol. 10.
[49] Y. Song, C. Durkan, I. Murray, and S. Ermon, “Maximum likelihood training of score-based diffusion models,” Advances in Neural Information Processing Systems, vol. 34, pp. 1415–1428, 2021.
[50] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, 2020.
[51] Y. Song and S. Ermon, “Advances in neural information processing systems,” Advances in Neural Information Processing Systems, pp. 12438–12448, 2020.
[52] R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based Metric GAN for Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 936–940.
[53] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021.
[54] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518.