簡易檢索 / 詳目顯示

研究生: 莊家倫
Chuang, Cia-Lun
論文名稱: 實現於邊緣裝置之基於深度學習模型之麥克風陣列語音增強系統
Microphone Array Speech Enhancement on Edge Devices based on Deep Learning
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 李夢麟
Li, Meng-Lin
廖元甫
Liao, Yuan-Fu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 52
中文關鍵詞: 語音增強邊緣運算麥克風陣列深度學習語音處理
外文關鍵詞: Speech enhancement, Edge computing, Microphone array, Deep learning, Speech processing
相關次數: 點閱:44下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音增強(speech enhancement)的目的在於去除掉一段語音中的干擾源 並且保留說話者的語音訊息。在語音增強或者聲源分離(source separation) 的領域中,這類的任務常見的做法就是預測出一個遮罩,然後和時頻域的 特徵(T-F feature)相乘後得到純淨的語音,早期典型的演算法為維納濾波 (Wiener filter)。最近幾年,隨著硬體設備快速發展、計算資源普及的帶動 下,深度學習演算法在語音增強領域中亦被採用, 強大的深度學習模型相繼 問世。然而,並非所有裝置皆有辦法搭載高昂的運算設備,因此邊緣運算的 需求便更加凸顯出來。本研究使用了基於長短期記憶神經單元(LSTM)的 雙核心訊號轉換模型(Dual-Signal Transformation LSTM Network)作為原型 去開發邊緣運算神經網路,功能相仿的 GRU 去取代 LSTM,旨在加速其推 論速度和降低模型參數。訓練和測試資料集的生成使用了 DNS challenge 所 提供的資料及和合成方法。生成的資料集不但含有各種訊噪比之下的語音, 同時考慮了空間頻率響應 (room impulse response)的影響,因此增加了模型 的泛化性。最終結果顯示,相較於原本的架構,我們的模型不但在大部分 指標上微幅提升了性能,且分別將參數和在 Raspberry pi 執行的時間降低了 10% 和 48%。本實驗亦針對麥克風陣列前處理的流程做透徹的評估,引進了 各個定位 (localization) 演算法作用於 delay & sum beamformer 以及用於去迴 響(Dereverberation)的 weighted prediction error (WPE)演算法。在最後的 結果中,使用 STOI 和 PESQ 作為指標於實際錄音音檔測試後,得出在沒有 任何去迴響機制的情況下 steered response power phase transform (SRP-PHAT) localization algorithm 對於 beamformer 在實際錄音上表現最佳。


    The purpose of speech enhancement is to eliminate the interference from a degraded speech signal and retain the speech from a speaker. For speech enhancement or source separation, the masking method is widely adopted. The predicted mask is able to acquire a clean speech by multiplying with a T-F feature. The Wiener filter is a typical method used in earlier years. In recent years, with the development of high performance computational devices, deep learning algorithms are also adopted in speech enhancement, and powerful deep neural networks are successively released. However, not every device can afford high computational cost. Thus, the demands for edge computing have risen. In the thesis, we developed an edge deep learning model based on the dual-signal transformation long short-term memory (LSTM) network (DTLN). In order to accelerate the inference and lower the number of network parameters, we utilized the functionally similar gated recurrent unit (GRU) to replace the LSTM. The training set and testing set were generated by using several corpora and the synthesis procedure offered by the DNS challenge. The generated datasets not only consisted of the degraded audios with various signal to noise ratios (SNR) but also took into account the effect of the room impulse response (RIR) . Therefore, the generalizability of the model was improved. The final outcomes show that our architecture outperformed the baseline model on most of the metrics. Moreover, we lowered the number of model parameters and processing time on a Raspberry pi by 10% and 48% respectively. In the research, we also thoroughly evaluated the microphone array pre-processing procedures. We introduced several localization algorithms for the delay & sum beamformer and the weighted prediction error (WPE) algorithm for dereverberation. The performance was achieve by simply utilizing a beamformer coupled with the steered response power-phase transform (SRP-PHAT) localization, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) tested on our recordings.

    1 Introduction 1 1.1 Related work 1 1.2 ThesisOrganization 4 2 Database and Alignment 5 2.1 Recording experiments 5 2.2 Data Alignment 7 3 The Pre-processing System 9 3.1 Delay and Sum Beamformer 10 3.2 LocalizationAlgorithms 11 3.2.1 The Generalized Cross-Correlation-Phase Transform 11 3.2.2 The SRP-PHAT Localization Algorithm 12 3.2.3 The Multiple Signal Classification Algorithm 13 3.3 Dereverberation algorithms 15 4 Deep Learning Methods 18 4.1 Deep learning models 18 4.1.1 The baseline model 18 4.1.2 Model Architecture 19 4.2 Dataset 21 4.2.1 SpeechDataset 22 4.2.2 NoiseDataset 24 4.2.3 SynthesisProcedure 25 4.3 PerformanceEvaluation 27 4.3.1 Metrics 27 4.3.2 Automatic SpeechRecognition 29 4.4 TheExperimentalSetup 30 5 Results and Discussion 31 5.1 TheRecordingSystem 31 5.1.1 Evaluation of Localization Algorithm 31 5.1.2 The Evaluation of Delay and Sum Beamformer 32 5.1.3 Evaluation of Dereverberation Algorithm 33 5.2 ModelPerformance 35 5.2.1 Comparison of DTLN Based Models 35 5.2.2 Comparison with SOTA Model 42 6 Conclusions 43 7 Future Work 44 References 45 Appendix 49 A.1 Lemma 49 A.1.1 Lemma1 49 A.1.2 Lemma2 49 A.2 Localization results under different SNR 50 A.3口委的建議 52

    N. Wiener, Extrapolation, interpolation, and smoothing of stationary time series. The MIT Press, London, England: MIT Press, mar 1964.

    D. FitzGerald, “Vocal separation using nearest neighbours and median filtering,” in IET Irish Signals and Systems Conference (ISSC 2012), pp. 1–5, June 2012.

    R. Yu, “Speech enhancement based on soft audible noise masking and noise power estimation,” Speech Communication, vol. 55, pp. 964–974, nov 2013.

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, (Cham), pp. 234–241, Springer, October 2015.

    A. A. Nugraha, A. Liutkus, and E. Vincent, “Deep neural network based multichannel audio source separation,” in Audio Source Separation, Signals and communication technology, (Cham), pp. 157–185, Springer International Publishing, 2018.

    R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” Journal of Open Source Software, vol. 5, no. 50, p. 2154, 2020.

    A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Interspeech, 2020.

    S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An improved version of MetricGAN for speech enhancement,” in Interspeech 2021, pp. 201–205, International Speech Communication Association (ISCA), Aug. 2021.

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.

    I. Union, “Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs,” International Telecommunication Union, Recommendation P, vol. 862, 2007.

    C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Oriental COCOSDA International Conference on Speech Database and Assessments, pp. 1–4, IEEE, 2013.

    J.Thiemann,N.Ito,andE.Vincent,“Thediverseenvironmentsmulti-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” in International Congress on Acoustics (ICA) 2013 Montreal, vol. 19, (Montreal, Canada), p. 035081, June 2013.

    C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25, IEEE, 2021.

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

    J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35, IEEE, 2016.

    C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6493–6497, 2021.

    J. Chen, W. Rao, Z. Wang, Z. Wu, Y. Wang, T. Yu, S. Shang, and H. Meng, “Speech enhancement with fullband-subband cross-attention network,” CoRR, vol. abs/2211.05432, 2022.

    N. L. Westhausen and B. T. Meyer, “Dual-signal transformation lstm network for real-time noise suppression,” CoRR, vol. abs/2005.07551, 2020.

    Z.-Q.WangandD.Wang,“Onspatialfeaturesforsupervisedspeechseparation and its application to beamforming and robust asr,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5709– 5713, IEEE, 2018.

    A. Li, W. Liu, C. Zheng, and X. Li, “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6487–6491, IEEE, 2022.

    C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech,” in Speech Synthesis Workshop (SSW), pp. 146–152, ISCA, 2016.

    K. J. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia, (New York, NY, USA), pp. 1015–1018, Association for Computing Machinery, 2015.

    C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217, IEEE, 2010.

    T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

    J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014.

    A. A. Harishchandra Dubey, V. Gopal, B. Naderi, S. Braun, R. Cutler, A. Ju, M. Zohourian, M. Tang, H. Gamper, M. Golestaneh, and R. Aichner, “ICASSP 2023 deep noise suppression challenge,” CoRR, vol. abs/2303.11510, 2023.

    H.Cao,D.G.Cooper,M.K.Keutmann,R.C.Gur,A.Nenkova,andR.Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.

    G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf, “A pitch tracking corpus with evaluation on multipitch tracking scenario,” in Twelfth Annual Conference of the International Speech Communication Association, August 2011.

    J. V. Yamagishi, C. MacDonald, and Kirsten, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.

    J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “Vocalset: A singing voice dataset.,” in International Society for Music Information Retrieval (ISMIR), pp. 468–474, 2018.

    J.S.Garofolo,“TIMITacousticphoneticcontinuousspeechcorpus,”Linguistic Data Consortium, 1993.

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, IEEE, 2017.

    F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in Proceedings of the 21st ACM iInternational Conference on Multimedia, pp. 411–412, 2013.

    T.Ko,V.Peddinti,D.Povey,M.L.Seltzer,andS.Khudanpur,“Astudyondata augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224, IEEE, 2017.

    S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and handsfree speech recognition.,” in International Conference on Language Resources and Evaluation (LREC), pp. 965–968, 2000.

    K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. HaebUmbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, et al., “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, IEEE, 2013

    M.Jeub,M.Schafer,andP.Vary,“Abinauralroomimpulseresponsedatabase for the evaluation of dereverberation algorithms,” in 16th International Conference on Digital Signal Processing, pp. 1–5, IEEE, 2009.

    E. A. Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep, vol. 2, no. 2.4, p. 1, 2006.

    J.H.L.HansenandB.L.Pellom,“Aneffectivequalityevaluationprotocolfor speech enhancement algorithms,” in 5th International Conference on Spoken Language Processing (ICSLP 1998), pp. 2819–2822, ISCA, Nov 1998.

    Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008.

    A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, Curran Associates, Inc., 2020.

    V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W.-N. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1,000+ languages,” CoRR, vol. abs/2305.13516, 2023.

    D.P.KingmaandJ.Ba,“Adam:Amethodforstochasticoptimization,”CoRR, vol. abs/1701.07875, 2017.

    QR CODE