研究生: |
孔繁傑 Kung, Fan-Jie |
---|---|
論文名稱: |
應用麥克風陣列訊號處理於惡劣環境下的語者數目估計、語音增強和聲源分離 Microphone array signal processing for speaker counting, speech enhancement, and source separation in adverse environments |
指導教授: |
白明憲
Bai, Ming-Sian R. |
口試委員: |
劉奕汶
Liu, Yi-Wen 黃元豪 Huang, Yuan-Hao 冀泰石 Chi, Tai-Shih 吳炤民 Wu, Chao-Min |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 127 |
中文關鍵詞: | 巢式廣義旁辦消除 、最小變異無失真響應 、線性限制最小變異 |
外文關鍵詞: | Nested generalized sidelobe cancellation (NGSC), minimum variance distortionless response (MVDR), linearly constrained minimum variance (LCMV) |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在惡劣的聲學條件下,例如嘈雜和混響環境中,語音品質容易下降。麥克風陣列訊號處理在惡劣聲學條件下,提高語音品質發揮著至關重要的作用。在這種條件下,聲源數目估計、語音增強和聲源分離特別具有挑戰性。在本論文中,我們提出了幾種基於陣列的技術來解決惡劣聲學環境中的聲源數目估計、語音增強和聲源分離問題。首先,我們專注於使用目標聲源到達方向(DOA)的先驗知識進行語音增強。實現了一種嵌入廣義旁辦消除(GSC)波束形成器結構的最小變異無失真響應,該結構具有串聯一個基於維納的後置濾波器(MVDR-GSC-PF)。MVDR-GSC-PF的最小變異無失真響應中的雜訊和混響共變異矩陣是使用加權預測誤差方法和阻塞矩陣技術進行估計。廣義旁辦消除波束形成器的自適應權重可用於減輕干擾訊號。廣義旁辦消除波束形成器的輸出端設計了基於維納的後置濾波器,以進一步減少殘餘雜訊和殘響。使用田口正交陣列設計的實驗表明,在語音質量感知評估(PESQ)、短時客觀清晰度(STOI)和訊號失真比(SDR)方面,MVDR-GSC-PF的性能優於基於阻塞的多通道维纳濾波器(BMWF)演算法、兩階段波束成形方法(TSBA)以及廣義旁辦消除和線性預測卡爾曼濾波器(ISCLP)演算法。對於聲源數目估計,我們也提出了一種基於使用連續阻塞過程的功率監測的巢式廣義旁辦消除(NGSC)方法。對於聲源分離,基於聲源數量和聲源到達方向(DOA)資訊提出了具有後置濾波器的線性限制最小變異後置濾波器(LCMV-PF)。自由場引導向量和透過巢式廣義旁辦消除估計的相對傳遞函數(RTF)向量之間的餘弦相似度可用於聲源到達方向估計。此外,將黃金分割搜尋(GSS)演算法結合餘弦相似度來加速聲源到達方向搜尋。透過結合遞迴平均和特徵值分解(EVD)的技術,LCMV-PF中的雜訊共變異數矩陣可以得到全秩重建。蒙特卡羅模擬和具有客觀品質測量的實驗用於評估所提出的巢式廣義旁辦消除和線性限制最小變異後置濾波器方法。
Speech quality is prone to degradation under adverse acoustic conditions, such as noisy and reverberant environments. Microphone array signal processing plays a vital role in improving speech quality under adverse acoustic conditions, where source counting, speech enhancement, and source separation can be particularly challenging. In this thesis, several array-based techniques are proposed to address the problems of source counting, speech enhancement, and source separation in adverse acoustic environments. First, we focus on speech enhancement with prior knowledge of the direction-of-arrival (DOA) of the target source. A minimum variance distortionless response beamformer embedded in a generalized sidelobe canceller (GSC) structure with a cascaded Wiener-based postfilter (MVDR-GSC-PF) is implemented. The noise and reverberation covariance matrices in MVDR-GSC-PF are estimated using the weighted prediction error method and a blocking matrix. Adaptive weighting of the GSC beamformer is used to mitigate the interfering source. A Wiener-based postfilter is cascaded at the output of the GSC beamformer to further reduce the residual noise and reverberation. Experiments designed using the Taguchi orthogonal arrays show that the MVDR-GSC-PF algorithm outperforms the blocking-based multichannel Wiener filter (BMWF) algorithm, the two-stage beamforming approach (TSBA), and the integrated sidelobe cancellation and linear prediction Kalman filter (ISCLP) algorithm in terms of the perceptual evaluation of speech quality (PESQ), the short-time objective intelligibility (STOI), and the signal-to-distortion ratio (SDR). For source counting, we also propose a nested GSC (NGSC) approach based on power monitoring using a successive blocking procedure. For source separation, a linearly constrained minimum variance beamformer with postfiltering (LCMV-PF) is proposed based on the number of sources and the DOA information. The cosine similarity between the free-field steering vector and the vector of the relative transfer function (RTF) via NGSC is employed for DOA estimation. The golden section search (GSS) algorithm is applied with cosine similarity to accelerate the DOA search. By combining the techniques of recursive averaging and eigenvalue decomposition (EVD), the noise covariance matrix in LCMV can be reconstructed with full rank. Monte Carlo simulations and experiments with objective quality measures are used to evaluate the proposed NGSC and LCMV-PF methods.
[1] S. Pasha, J. Donley, C. Ritz, and Y. X. Zou, “Towards real-time source counting by estimation of coherent-to-diffuse ratios from ad-hoc microphone array recordings,” Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, March 1-3, 2017, pp. 161-165.
[2] C. Rascon and I. Meza, “Localization of sound sources in robotics: A review,” Robotics and Autonomous Systems, vol. 96, pp. 184-210, Oct. 2017.
[3] E. Hänsler and G. Schmidt, Speech and Audio Processing in Adverse Environments (Springer, Berlin Heidelberg, 2008), pp. 417-464.
[4] E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement (Wiley, USA, UK, 2018), pp. 47-60.
[5] F.-J. Kung and M.R. Bai, “Estimation of the noise and reverberation covariance matrices with application in speech enhancement using multichannel wiener filters,” in Proc. 2020 International Congress on Noise Control Engineering, INTER-NOISE 2020, Seoul, South Korea, Aug. 2020, pp. 3647-3657.
[6] B. Ng, M. Er, and C. Kot, “A MUSIC approach for estimation of directions of arrival of multiple narrowband and broadband sources,” Signal Processing, vol. 40, no. 2, pp. 319-323, 1994.
[7] B. Laufer Goldshtein, R. Talmon, and S. Gannot, "Audio source separation by activity probability detection with maximum correlation and simplex geometry", EURASIP Journal on Audio, Speech and Music, vol. 2021, Jan. 2021.
[8] B. Laufer-Goldshtein, R. Talmon, and S. Gannot, “Source counting and separation based on simplex analysis,” IEEE Transactions on Signal Processing, vol. 66, no. 24, pp. 6458-6473, 2018.
[9] M. Wax and T. Kailath, “Detection of signals by information theoretic criteria,” IEEE Trans. Acoust. Speech Signal Process. ASSP-33(2), 387-392, 1985.
[10] Z. He, A. Cichocki, S. Xie, and K. Choi, “Detecting the number of clusters in n-way probabilistic clustering,” IEEE Trans. Pattern Analysis, Machine Intel. 32(11), 2006-2021, 2010.
[11] L. Huang, T. Long, and S. Wu, “Source enumeration for high-resolution array processing using improved Gerschgorin radii without eigendecomposition,” IEEE Transactions on Signal Processing, vol. 56, no. 10, pp. 5916-5925, 2008.
[12] T. Long, Q. Li, L. Huang, H. Zhou, L. Feng, and D. Xia, “Joint source enumeration and direction finding without eigendecomposition for satellite navigation,” in Proc. IEEE 12th Sensor Array Multichannel Signal Processing Workshop, Trondheim, Norway, June 20-23, 2022, pp. 425-429.
[13] F.-J. Kung and M. R. Bai, “A nested generalized sidelobe canceller for source counting, localization, and signal separation in reverberant fields,” Journal of the Acoustical Society of America, vol. 154, pp. 3769-3781, Dec. 2023.
[14] M. Mozaffarzadeh, A. Mahloojifar, M. Orooji, S. Adabi, and M. Nasiriavanaki, “Double-stage delay multiply and sum beamforming algorithm: application to linear-array photoacoustic imaging,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 1, pp. 31-42, Jan. 2018.
[15] M. R. Bai, J.-G. Ih, and J. Benesty, Acoustic Array Systems: Theory, Implementation, and Application, Wiley, Singapore, 2013.
[16] J.-H. Lin, Microphone Arrays: Noise Source Identification and Sound Field Visualization, Ph.D. thesis, National Yang Ming Chiao Tung University, Taiwan, Jan. 2011.
[17] R. O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, vol. AP-34, no. 3, pp. 276-280, 1986.
[18] M.R. Bai, F.-J. Kung, and C.-S. Tao, “Tracking of moving sources in a reverberant environment using evolutionary algorithms,” IEEE Access, vol. 10, pp. 107563-107574, Oct. 2022.
[19] F. Gustafsson and F. Gunnarsson, ‘‘Positioning using time-difference of arrival measurements,’’ in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, Apr. 2003, pp. 553-556.
[20] A. Paulraj, V. U. Reddy, T. J. Shan, and T. Kailath, “Performance analysis of the music algorithm with spatial smoothing in the presence of coherent sources,” in Proc. of the IEEE Military Communications Conference: Communications – Computers: Teamed, Monterey, CA, October 5-9, 1986, p. 41.
[21] C. Blandin, A. Ozerov, and E. Vincent, “Multi-source TDOA estimation in reverberant audio using angular spectra and clustering,” Signal Processing, vol. 2012, no. 92, pp. 1950-1960, 2012.
[22] O. Schwarz, S. Gannot, and E. A. Habets, "Multi-speaker LCMV beamformer and postfilter for source separation and noise reduction", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 940-951, May 2017.
[23] M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications (Springer, New York, 2001).
[24] A. Spriet, M. Moonen, and J. Wouters, “Stochastic gradient-based implementation of spatially preprocessed speech distortion weighted multichannel Wiener filtering for noise reduction in hearing aids,” IEEE Transactions on Signal Processing, vol. 53, no. 3, pp. 911-925 Mar. 2005.
[25] S. Braun and E. A. P. Habets, “A multichannel diffuse power estimation for dereverberation in the presence of multiple sources,” EURASIP Journal on Audio, Speech and Music, vol. 2015, paper 34, Dec. 2015.
[26] K. U. Simmer, J. Bitzer, and C. Marro, “PostFiltering techniques,” in M. S. Brandstein and D. B. Ward (Eds.), Microphone Arrays: Signal Processing Techniques and Applications, pp. 39–60 (Springer, Digital Signal Processing Berlin, Germany, 2001).
[27] I. Kodrasi and S. Doclo, “Joint late reverberation and noise power spectral density estimation in a spatially homogeneous noise field,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, Canada, Apr. 2018, pp. 441-445.
[28] M.R. Bai and F.-J. Kung, “Speech enhancement by denoising and dereverberation using a generalized sidelobe canceller-based multichannel Wiener filter,” Journal of the Audio Engineering Society, vol. 70, no. 3, pp. 140- 155, Mar. 2022.
[29] E. A. P. Habets and J. Benesty, “A two-stage beamforming approach for noise reduction and dereverberation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 945-958, May, 2013.
[30] T. Lotter and P. Vary, “Dual-channel speech enhancement by superdirective beamforming,” EURASIP Journal on Applied Signal Processing, vol. 2006, pp. 1-14, Dec. 2006.
[31] T. Dietzen, S. Doclo, M. Moonen, and T. van. Waterschoot, “Integrated sidelobe cancellation and linear prediction Kalman filter for joint multi-microphone speech dereverberation, interfering speech cancellation, and noise reduction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 740-754, Jan. 2020.
[32] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1071-1086, Aug. 2009.
[33] Z. Tian, K. L. Bell, and H. L. Van Trees, “A recursive least squares implementation for LCMP beamforming under quadratic constraint,” IEEE Transactions on Signal Processing, vol. 49, no. 6, pp. 1138-1145, June 2001.
[34] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance normalized delayed linear prediction,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18 no. 7, pp. 1717-1731, Sep. 2010.
[35] A. Jukic and S. Doclo, “Speech dereverberation using weighted prediction error with Laplacian model of the desired signal,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, May, 2014, pp. 5172-5176.
[36] M. Parchami, W.-P. Zhu, and B. Champagne, “Speech dereverberation using weighted prediction error with correlated interframe speech components,” Speech Communication, vol. 87, pp. 49-57, Mar. 2017.
[37] X. Li, R. Horaud, L. Girin, and S. Gannot, “Local relative transfer function for sound source localization,” in Proc. 23rd European Signal Processing Conference (EUSIPCO), Nice, France, August 31-September 4, 2015, pp. 1-5.
[38] L. P. Dinu and R.-T Ionescu, “A rank-based approach of cosine similarity with applications in automatic classification,” in Proc. 14th International Symposium on Symbolic Numeric Algorithms for Scientific Computing, Timisoara, Romania, September 26-29, 2012, pp. 260-264.
[39] H. Almuzaini and S. Habib, “Analyzing legacy system’s interfaces through Monte Carlo simulations,” in Proc. 12th International Conference on Computer Modelling and Simulation, Cambridge, UK, March 24-26, 2010, pp. 604-608.
[40] P. J. Hardin, “Comparing main diagonal entries in normalized confusion matrices: a bootstrapping,” in Proc. IEEE International Geoscience and Remote Sensing Symposium, Hamburg, Germany, (June 28-July 2, 1999), pp. 345-347.
[41] G. Taguchi, S. Chowdhury, and Y. Wu, Taguchi’s Quality Engineering Handbook (Wiley, Singapore, 2004).
[42] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229-238, Jan. 2008.
[43] C. H. Taai, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE Internal Conference on Acoustic, Speech, and Signal Processing, Dallas, TX, March 14-19, 2010, pp. 4214-4217.
[44] E. Vincent, R. Gribonval, and C. Févotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, 1462-1469, 2006.
[45] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 10, pp. 1365-1376, Oct. 1987.
[46] N. Salkind. Encyclopedia of Measurement and Statistics (Thousand Oaks (CA): Sage, 2007).
[47] A P Aparna and H. S. Sonalikar, “Fast computation of radome EM parameters with golden section search method for radiation pattern peak detection,” in Proc. IEEE International Conference on Electronics, Computing, and Communication Technologies, Bangalore, India, July 26-27, 2019.
[48] S. Doclo and M. Moonen, “Superdirective beamforming robust against microphone mismatch,” in Proc. 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 2006, p. V-41.
[49] Y. Hu and P. Loizou, “Subjective evaluation and comparison of speech enhancement 661 algorithms,” Speech Communication, vol. 49, pp. 588-601, 2007.
[50] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-source multi-speaker corpora of the English accents in the British Isles,” in Proc. 12th Language Resources and Evaluation Conference, Marseille, France, May 11-16, 2020, pp. 6532-6654.
[51] E. Lehmann and A. Johansson, “Diffuse reverberation model for efficient image-source simulation of room impulse responses,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1429-1439, Aug. 2010.
[52] E. Lehmann and A. Johansson, “Prediction of energy decay in room impulse responses simulated with an image-source model,” Journal of the Acoustical Society of America, vol. 124, no. 1, pp. 269-277, 2008.
[53] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Proc. 14th International Workshop on Acoustic Signal Enhancement, Juan-les-Pins, France, September 8-11, 2014, pp. 313-317.
[54] S. Braun, A. Kuklasiński, O. Schwartz, O. Thiergart, E. A. P. Habets, S. Gannot, S. Doclo, and J. Jensen, “Evaluation and comparison of late reverberation power spectral density estimators,” IEEE Transactions on Audio, Speech, and language Processing, vol. 26, no. 6, pp. 1056-1071, June 2018.
[55] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp.1109-1121, April 1984.
[56] A. Kuklasinski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximum Likelihood Based Multi-Channel Isotropic Reverberation Reduction for Hearing Aids,” in Proc. 22nd European Signal Processing Conference, Lisbon, Portugal, pp. 61-65, Sep. 2014.
[57] T. Dietzen, S. Doclo, M. Moonen, and T. van. Waterschoot, “Joint Multi-Microphone Speech Dereverberation and Noise Reduction Using Integrated Sidelobe Cancellation and Linear Prediction,” in Proc. 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 221-225, Tokyo, Japan, Sep. 2018.
[58] D. H. Brandwood, “A Complex Gradient Operator and Its Application in Adaptive Array Theory,” IEE Proc. F Radar Signal Processing, vol. 130, no. 1, pp. 11-16, Feb. 1983.
[59] P. Loizou, “MATLAB software: PESQ and other objective measures for evaluating quality of speech processed by noise suppression algorithms,” https://ecs.utdallas.edu/loizou/speech/software.htm (last viewed March 7, 2023).
[60] C. H. Taal, “STOI-short time objective intelligibility measure,” https://ceestaal.nl/code (last viewed March 7, 2023).
[61] E. Vincent, “MATLAB toolbox: BSS eval,” ttps://gitlab.inria.fr/bass-db/bss_eval (last viewed March 7, 2023)
[62] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach, “NARA-WPE: a python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing,” in Proc. Speech Communication; 13th ITG-Symposium, Oldenburg, Germany, October 10-12, 2018, pp. 216-220.
[63] D. B. Williams, “Detection: Determining the Number of Sources,” in E. V. K. Madisetti and D. B. Williams (Eds.), Digital Signal Processing Handbook, 67 (CRC Press, Boca Raton, USA, 1999).
[64] G. Schwartz, “Estimating the dimension of a model,” Ann. Stat., vol. 6, pp. 461-464, 1978.
[65] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465-471, 1978.
[66] T. W. Anderson, “Asymptotic theory for principal component analysis,” Ann. J. Math. Stat., vol. 34, pp. 122-148, 1963.
[67] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd ed., John Wiley & Sons, New York, 1984.
[68] R. J. Muirhead, Aspects of Multivariate Statistical Theory, John Wiley & Sons, New York, 1982.
[69] K. Han and A. Nehorai, “Source number detection with nested arrays and ULAs using jackknifing,” in Proc. 5th IEEE International Workshop on Computational Advance in Multi-Sensor Adaptive Processing, St. Martin, France, December 15-18, 2013, pp. 57-60.
[70] Z. Tan, Y. C. Eldar, and A. Nehorai, “Detection of arrival estimation using co-prime arrays: a super resolution viewpoint,” IEEE Transactions on Signal Processing, vol. 62, no. 21, pp. 5565-5576, 2014.
[71] R. N. Kacker, E. S. Lagergren, and J. J. Filliben, “Taguchi’s orthogonal arrays are classical designs of experiments,” Journal of Research of the National Institute of Standards and Technology, vol. 96, no. 5, pp. 577-591, 1991.
[72] R. C. Bose, “Mathematical theory of the symmetrical factorial design,” Sankhyā: The Indian Journal of Statistics, vol. 8, no. 2, pp. 107-166, 1947.
[73] K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in Proc. Interspeech, Hyderabad, September, 2018, pp. 3229-3233.
[74] Y. Hsu and M. R. Bai, “Learning-based robust speaker counting and separation with the aid of spatial coherence,” Eurasip Journal on Audio, Speech, and MUSIC Processing, vol. 2023, no. 1, p. 36, 2023.
[75] Y. Hsu, Y. Lee, and M. R. Bai, “Array configuration-agnostic personalized speech enhancement using long-short-term spatial coherence,” Journal of the Acoustical Society of America, vol. 154, no. 4, pp. 2499-2511, 2023.
[76] G. Richard, P. Smaragdis, S. Gannot, P. A. Naylor, S. Makino, W. Kellermann, and A. Sugiyama, “Audio signal processing in the 21st century: The important outcomes of the past 25 years,” IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 12-26, 2023.
[77] S. Gannot, Z.-H. Tan, M. Haardt, N. F. Chen, H.-T. Wai, I. Tashev, W. Kellermann, and J. Dauwels, “Data science education: The signal processing perspective,” IEEE Signal Processing Magazine, vol. 40, no. 7, pp. 89-93, 2023.