研究生: |
林子傑 Lin, Zi-Jie |
---|---|
論文名稱: |
應用Transformer神經網路於麥克風陣列在迴響環境中的語音定位與分離 Multi-Channel Separation and Localization of Speech Signals Using Transformer Network in Reverberant Environments |
指導教授: |
白明憲
Bai, Ming-sian R. |
口試委員: |
劉奕汶
Liu, Yi-Wen 簡仁宗 Chien, Jen-Tzung |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 動力機械工程學系 Department of Power Mechanical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 54 |
中文關鍵詞: | 深度學習 、去迴響 、聲源定位 、語音分離 |
外文關鍵詞: | Multichannel learning-based network, Signal separation |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本篇論文探討如何在迴響嚴重的環境下,進行聲源的定位與分離。在第二章我們比較了傳統定位演算法的效果,並針對其缺點作改良使其能更有效的應對多聲源的定位問題。接著在第三章,我們分別探討如何在單聲源及多聲源的環境下,估計系統的相對轉移函數(relative transfer function, RTF),再應用波束成型(beamforming)演算法進行聲源分離以及降低迴響及噪聲的影響。在第四章,我們提出一個深度神經網路(deep learning network, DNN)模型,應用自注意力(self attention)機制網路同時處理語音分離與去迴響,並加入數據增強(data augmentation)技術使模型的訓練結果更為穩定。我們利用聲學陣列的模型與鏡像聲源算法(Image Source Method)來模擬聲波的傳遞,生成高迴響環境下的語音訊號作為測試集,並將提出的神經網路演算法與陣列演算法及其他神經網路做比較,且採用SISNR、PESQ和STOI做為分離效果的評估指標,最後的結果顯示我們所提出的模型能夠在高迴響環境下有效地進行語音分離。除此之外,我們展示模型的分離結果可以與傳統的定位方法結合,藉由減少干擾聲源(interference)以及迴響的影響,使定位準確度大為提升。
This thesis describes problems and solution regarding the source localization and separation in reverberant environments. For localization task, Multiple Signal Classification (MUSIC) algorithm and Steered Response Power – Phase Transform (SRP-PHAT) algorithm are introduced and compared. The experiment results indicate that SRP-PHAT algorithm outperforms MUSICalgorithm in single source reverberation scenarios. However, we found that the SRP-PHAT algorithm suffers from performance degradation in multiple-speaker and reverberant environments. To overcome this degradation, the expectation-maximization (EM) is introduced and is combined with the SRP-PHAT. The results demonstrate the robustness and effectiveness of locating multiple sources in reverberant environments. For separation task, the global and local simplex separation (GLOSS) method is introduced to estimate the spectral mask and the relative transfer function (RTF). The separation can be performed by the spectral masking or the beamformer techniques. Moreover, the learning-based system, Filter-and-Sum network (FaSNet), for jointly dereverberation and separation is also introduced. Furthermore, a learning-based speech enhancement system is developed in light of a multi-channel separation transformer (mSepformer). The model is trained in an end-to-end manner, and the data augmentation (DM) method is applied to prevent the network from overfitting. The proposed mSepformer is compared with several learning-based and array-based baseline approaches. The results of the experiments demonstrate that the proposed network significantly outperform the baseline methods in the task of jointly dereverberation and separation. In addition, the separated results output by the proposed network can be utilized to locate the sources with SRP-PHAT algorithm.
[1] K. Nakadai, D. Matsuura, H. Kitano, H. G. Okuno, and H. Kitano, “Applying scattering theory to robot audition system: Robust sound source localization and extraction,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2003, pp. 1147–1152.
[2] S. Nakamura, “Acoustic sound database collected for hands-free speech recognition and sound scene understanding,” in Proc. International Workshop on Hands-Free Speech Communication (HSC01), Kyoto, Japan, April 2001, pp. 43–46.
[3] H. Tanigawa, T. Arikawa, S. Masaaki, and K.Shimamura, “Personal multimedia-multimedia-multipoint teleconference system,” Proceedings of IEEE INFOCOM ’91, Apr. 1991, pp. 1127-1134.
[4] R.O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antennas Propagat., vol. 34, 1986, pp. 276-280.
[5] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 24, no. 4, 1976, pp. 320-327.
[6] E. Hänsler and G. Schmidt, “Speech and Audio Processing in Adverse Environments,” USA, NY, New York:Springer-Verlag, 2008.
[7] S. T. Birchfield, “A unifying framework for acoustic localization,” in Signal Processing Conference, 2004 12th European. IEEE, 2004, pp. 1127-1130.
[8] J. DiBiase, H. Silverman, and M. Brandstein, “Microphone arrays: signal processing techniques and applications,” Springer Verlag, 2001, eh. Robust Localization in Reverberant Rooms, pp. 157-180.
[9] S. Rickard and O. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2002, pp. 529–532.
[10] N. Madhu and R. Martin, “A scalable framework for multiple speaker localization and tracking,” in Proc. Int. Workshop Acoust. Echo Cancellation and Noise Control (IWAENC), Seattle, WA, USA, 2008.
[11] E. Hadad and S. Gannot, “Multi-speaker direction of arrival estimation using SRP-PHAT algorithm with a weighted histogram,” in IEEE International Conference on the Science of Electrical Engineering in Israel (ICSEE), 2018.
[12] F. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “A neural network based algorithm for speaker localization in a multi-room environment,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., 2016, pp. 1–6.
[13] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2015, pp. 2814–2818.
[14] T. Hirvonen, “Classification of spatial audio location and content using convolutional neural networks,” in Proc. Audio Eng. Soc. Conv. 138, 2015, paper 9294.
[15] S. Adavanne, A. Politis, and T. Virtanen, “Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network,” in Proc. Euro. Signal Process. Conf., 2018.
[16] S. Araki, T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda, and T. Nakatani, 2015. “Exploring multi-channel features for denoising-autoencoderbased speech enhancement.” In: IEEE International Conference on Acoustics, Speech Signal Process. Brisbane, Australia, Apr. 2015, pp. 116–120.
[17] Y. Tu, J. Du, Y. Xu, L. Dai, and C.-H. Lee, 2014. “Speech separation based on improved deep neural networks with dual outputs of speech features for both target and interfering speakers,” In: Int. Symp. Chin., Spoken Lang. Process. Singapore, Sep. 2014, pp. 250–254.
[18] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, 2014. “Singing voice separation from monaural recordings using deep recurrent neural networks,” In: Int. Soc. Music Inf. Retrieval. Taipei, Taiwan, Oct. 2014, pp. 477–482.
[19] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, 2014. “Deep learning for monaural speech separation,” In: IEEE International Conference on Acoustics, Speech Signal Process. Florence, Italy, May 2014, pp. 1562–1566.
[20] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process. vol. 23, no. 12, Dec. 2015, pp. 2136–2147.
[21] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural network based instrument extraction from music.” In: IEEE International Conference on Acoustics, Speech Signal Process. Brisbane, Australia, Apr. 2015, pp. 2135–2139.
[22] Y. Jiang, D. Wang, R. Liu, and Z. Feng, “Binaural classification for reverberant speech segregation using deep neural networks.” IEEE/ACM Trans. Audio, Speech, Lang. Process. vol. 22, no. 12, Dec. 2014, pp. 2112–2121.
[23] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). vol. 27, no. 8, 2019, pp. 1256–1266.
[24] Z.-Q. Wang, K. Tan, and D.L. Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2019, pp. 71–75.
[25] G. Wichern and J.L. Roux, 2018. “Phase reconstruction with learned time-frequency representations for single-channel speech separation.” In Proceeding of the IWAENC, 2018, pp. 396–400.
[26] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation.” arXiv preprint arXiv:1910.06379, 2019.
[27] A.Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need.” CoRR, vol. abs/1706.03762, 2017.
[28] J. Chen, Q. Mao, and D. Liu, “Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation.” In Proceeding of the Interspeech 2020, pp. 2642–2646.
[29] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation.” arXiv preprint arXiv:2010.13154.
[30] M. N. Murthi and B. D. Rao, “Minimum variance distortionless response (MVDR) modeling of voiced speech.” In International Conference on Acoustics, Speech Signal Process. Munich, Germany, Apr. 1997.
[31] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2016, pp. 196–200.
[32] H. Erdogan, J. R. Hershey, S. Watanabe, M. Mandel, and J. L. Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. Interspeech, 2016, pp. 1981–1985.
[33] Y. Xu et al., “Joint training of complex ratio mask based beamformer and acoustic model for noise robust ASR,” In: IEEE International Conference on Acoustics, Speech Signal Process. 2019, pp. 6745–6749.
[34] A. S. Subramanian, C. Weng, M. Yu, S.-X. Zhang, Y. Xu, S. Watanabe, and D. Yu, “Far-field location guided target speech extraction using end-to-end speech recognition objectives.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2020.
[35] Z.-Q. Wang and D. Wang, “Mask weighted STFT ratios for relative transfer function estimation and its application to robust ASR.” in ICASSP, 2018.
[36] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-TasNet: Time-domain audio separation network meets frequency-domain beamformer,” In: IEEE International Conference on Acoustics, Speech Signal Process. 2020, pp. 6384–6388.
[37] H. Chen, and P. Zhang, “Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output.” arXiv:2102.02998.
[38] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2018, pp. 1–5.
[39] Z. Wang and D. Wang, “Integrating spectral and spatial features for multi-channel speaker separation.” in Proceeding of the Interspeech, vol. 2018, pp. 2718–2722.
[40] J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2020, pp. 6389–6393.
[41] R. Gu, J. Wu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “End-to-end multi-channel speech separation,” arXiv preprint arXiv:1905.06286, 2019.
[42] Y. Luo, E. Ceolini, C. Han, S.-C. Liu, and N. Mesgarani, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing.” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).
[43] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation.” In: IEEE International Conference on Acoustics, Speech Signal Process. 2020.
[44] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” In IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, pp. 626–630.
[45] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs,” In: IEEE International Conference on Acoustics, Speech Signal Process. 2001, vol. 2, pp. 749–752.
[46] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, 2010. “A short-time objective intelligibility measure for time-frequency weighted noisy speech.” In: IEEE International Conference on Acoustics, Speech Signal Process. pp. 4214–4217.
[47] R. C. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” IEEE Proceedings of the sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, October 1995, pp. 39-43.
[48] Fahim A M,Salem A M,Torkey F A, “An efficient enhanced k-means clustering algorithm.” Journal of Zhejiang University Science A, Vol.10, July 2006, pp:1626-1633.
[49] S. Raychaudhuri, “Introduction to Monte Carlo simulation,” in Proc. Winter Simulation Conf. (WSC), Dec. 2008, pp. 91–100.
[50] J. Yamagishi, C. Veaux, K. MacDonald, et al., “Cstr vctk corpus: "English multi-speaker corpus for cstr voice cloning toolkit.” (version 0.92), (2019).
[51] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Global and local simplex representations for multichannel source separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, 2020, pp. 914–928.
[52] M. Fuhry and L. Reichel, “A new Tikhonov regularization method, Numerical Algorithms.” 59, 433 (2012)
[53] A. G Howard, M. Zhu, Bo Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: “Efficient convolutional neural networks for mobile vision applications.” arXiv preprint 1704.04861, 2017.
[54] Y. Luo and N. Mesgarani, “Implicit filter-and-sum network for multichannel speech separation,” arXiv preprint arXiv:2011.08401, 2020.
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
[56] J. Allen and D. Berkley, “Image method for efficiently simulating smallroom acoustics,” J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943–950, Apr. 1979.
[57] N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” arXiv preprint arXiv:2002.08933, 2020.
[58] M. Kolbak, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utternance-level permutation invariant training of deep recurrent neural networks.” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 10. 2017, pp. 1901–1913.