研究生: |
劉美忻 Liu, Elaine Mei-Hsin |
---|---|
論文名稱: |
以諧振時間結構分群作含有音樂背景的單聲道語音增強 Single-Channel Speech Enhancement with Background Music Based on Harmonic-Temporal Structured Clustering |
指導教授: |
王小川
Wang, Hsiao-Chuan |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 英文 |
論文頁數: | 87 |
中文關鍵詞: | 諧振時間結構 、分群 、音樂背景 、單聲道 、語音增強 、短時段傅立葉轉換 、高斯混合模型 、音高頻率 |
外文關鍵詞: | Hamonic-temporal structured clustering, HTC, cluster, background music, single-channel, speech enhancement, short-time fourier transform, Gaussian mixture model, GMM, F0 contour, pitch |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
諧振時間結構分群 (harmonic-temporal structured clustering, HTC) 是一種把單聲道錄音作聲音分離的方法,本研究即應用HTC技術作語音增強,把單人語音從一或兩種樂器產生的背景音樂中分離出來。分離的操作是在短時段傅立葉轉換(short-time Fourier transform, STFT)之後的時頻域中進行。然後用重疊相加(overlap-add, OLA)方法合成分離之後的聲音。HTC方法是用高斯混合模型(Gaussian mixture model, GMM)來近似一個混合訊號的能量頻譜,在本研究中是由語音模型和音樂模型組成,它們分別估計語音和音樂能量頻譜的貢獻。每一個聲音模型製作一個遮蔽函數(masking function),乘上觀察的能量頻譜,用以抽出該聲音的能量頻譜。語音和音樂的能量頻譜假設是可加性的, HTC只估計訊號做STFT之後的大小值,原來混合訊號的相位就用在OLA演算來合成所抽出聲音的波形。
語音和音樂模型是數個高斯函數的加權總和,高斯函數的時間頻率位置用以表示諧振時間結構。這個結構型造了有聲語音和音樂音符中的音高頻率(pitch frequency, F0)、開始時間,音色,和持續時間。聲音模型估計觀察到的音符和音素是尋找能量頻譜內最適合此結構的成份。語音模型和音樂模型的區別在於它們音高曲線有不同的類型和起始狀態。音樂模型以固定頻率的直線估計音符的F0,語音模型估計有聲音語音的F0,描述成一個以說話者平均音高差距不遠的連續而有些微彎曲的線段。說話者的平均音高用來作為語音模型中音高的起始值,幫助將模型的音高定在有聲語音音高附近。語音模型用有聲語音找到音素,但也可以粗略估計無聲語音。不同的F0估計類型和起始值是分區語音和音樂這兩種聲音的基礎。
以聲音模型來近似觀察到的聲音能量頻譜,是用期望-最大化演算法(Expectation-Maximization algorithm, EM algorithm),反覆更新模型的參數。這演算法的目的是找出目標函數的最大值,此目標函數判斷觀察的能量頻譜和總GMM的相似度。這目標函數會偏向具有某些特質的模型。
實驗顯出HTC是語音增強的有效方法,當HTC應用在低SNR的混合訊號時,語音有明顯的改進。給高SNR的混合訊號時,語音改進就很少。HTC也提供音樂訊號的估計,但是□果顯示當音樂支配混合訊號時,分離出來的音樂有好的品質,否則品質較差。因此以HTC技術作語音和音樂的分離,應用在語音增強勝於應用在音樂增強。
Harmonic-temporal structured clustering (HTC) is a method for sound separation in a single-channel recording. In this work, HTC is applied to speech enhancement by separating single-speaker speech from additive background music due to one or two musical instruments. Separation is done in the short-time Fourier transform (STFT) time-frequency domain, and the separated sounds are reconstructed by overlap-and-add (OLA). HTC essentially fits a Gaussian mixture model (GMM) to the mixture’s power spectrum. The GMM is composed of a speech sound model and a music sound model that respectively approximate the speech and music power spectrum contributions. A masking function is created from each sound model and applied to the observed power spectrum to extract an estimate of the clean sound’s power spectrum. The speech and music power spectra are assumed additive, and HTC only estimates the STFT magnitude. The mixture’s phase is used in OLA to reconstruct an estimate of the clean sound’s waveform.
The speech and music sound models are sums of weighted Gaussians placed at time-frequency locations corresponding to a harmonic and temporal structure. This structure models the pitch (F0), onset time, timbre, and duration of musical notes or voicing in phones. Sound models approximate notes and phones in the observation by finding components in the power spectrum that best fit the structure.
The speech sound model is distinguished from the music sound model by their different types and initializations of F0 contours. The music model estimates the F0 of music notes as lines of constant frequency. The speech model estimates the F0s of voiced speech as segments along a continuous, slightly curving contour that does not deviate much from the speaker’s average voicing pitch. It is assumed that the speaker’s average F0 is given, so that it can be used to initialize the speech model’s parameters and help the model fit to voiced speech with pitch near the average F0. The speech model uses voicing to find phones, but it can also roughly estimate unvoiced speech. The different types and initializations of F0 estimates are the foundation for why the two sound models can distinguish between speech and music and thus separate the two sounds.
The sound models are fit to the observed spectrogram by iteratively updating their parameters in the “expectation-maximization algorithm” (EM algorithm). This algorithm tries to maximize an objective function that measures the similarity between the total GMM and the observed power spectrum. The objective function is biased to favor models with certain preferred characteristics.
An experiment demonstrates that HTC is an effective method for speech enhancement. When HTC is applied to mixtures at low signal-to-noise ratio, there is significant improvement of speech. For mixtures at high SNR, there is little improvement. HTC also provides an estimate of the music signal, but results show that estimated music has good quality only if the mixture is dominated by music and poor quality otherwise. Thus HTC speech-music separation is more applicable to speech enhancement than music enhancement.
[1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., John Wiley & Sons, 2006.
[2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed., John Wiley & Sons, 2001.
[3] A. Gelman, J. B. Carlin, H. S. Stern, and R. B. Rubin, Bayesian Data Analysis, 2nd ed., Chapman & Hall/CRC, 2004.
[4] S. Ghahramani, Fundamentals of Probability with Stochastic Processes, 3rd ed., Prentice Hall, 2005.
[5] G. N. Hu and D. L. Wang, “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation,” IEEE. Trans. Neural Networks, vol. 15, no. 5, pp. 1135-1150, Sept. 2004.
[6] X. D. Huang, A. Acero, and H. W. Hon, Spoken Language Processing, Prentice Hall, 2001.
[7] R. Jang, “Audio Signal Processing and Recognition,” [Online]. Available: http://neural.cs.nthu.edu.tw/jang/books/audioSignalProcessing
[8] R. Jang, “Data Clustering and Pattern Recognition,” [Online]. Available: http://neural.cs.nthu.edu.tw/jang/books/dcpr
[9] X. C. Jin and Z. F. Wang, “Speech Separation from Background Music Based on Single-Channel Recording,” in Proc. of the IEEE 18th International Conference on Pattern Recognition (ICPR’06), vol. 4, pp. 278-281, 2006.
[10] H. Kameoka, T. Nishimoto, and S. Sagayama, “A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 982-994, Mar. 2007.
[11] H. Kameoka, “Statistical Approach to Multipitch Analysis,” Ph.D. dissertation, The University of Tokyo, Tokyo, Japan, 2007.
[12] H. Kameoka, T. Nishimoto, S. Sagayama, “Accurate F0 Detection Algorithm for Concurrent Sounds Based on EM Algorithm and Information Criterion," in Proc. Special Workshop in MAUI (SWIM), in CD-ROM, 2004.
[13] H. Kameoka, T. Nishimoto, S. Sagayama, “Separation of Harmonic Structures Based on Tied Gaussian Mixture Model and Information Criterion for Concurrent Sounds,” in Proc. IEEE, International Conference on Acoustics, Speech and Signal Processing (ICASSP2004), vol. 4, pp. 297-300, 2004.
[14] A. Klapuri, “Multipitch estimation and sound separation by the spectral smoothness principle,” in Proc. IEEE ICASSP, vol. 5, pp. 3381-3384, May 2001.
[15] K. Melih and R. Gonzalez, “Harmonic Grouping for Computational Auditory Scene Analysis,” in Proc. of the IEEE 8th International Symposium on Signal Processing and Its Applications, vol. 2, pp. 623-626, Aug. 2005.
[16] K. Miyamoto, H. Kameoka, T. Nishimoto, N. Ono, and S. Sagayama, “Harmonic-Temporal-Timbral Clustering (HTTC) for the Analysis of Multi-Instrument Polyphonic Music Signals,” in Proc. IEEE ICASSP, pp.113-116, Apr. 2008.
[17] K. Miyamoto, H. Kameoka, H. Takeda, T. Nishimoto, and S. Sagayama, “Probabilistic Approach to Automatic Music Transcription from Audio Signals,” in Proc. IEEE ICASSP, vol. 2, pp. 697-700, 2007.
[18] V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing, 2nd ed., Prentice Hall, 1999.
[19] T. W. Parsons, “Separation of Speech from Interfering Speech by Means of Harmonic Selection,” J. Acoust. Soc. Amer., vol. 60, no. 4, pp. 911-918, Oct. 1976.
[20] W. H. Press, W. T. Vetterling, S. A. Teukolsky, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge, U.K.: Cambridge Univ. Press, 1992.
[21] T. F. Quatieri, Discrete-Time Speech Signal Processing Principles and Practice, Prentice Hall, Englewood Cliffs, NJ, 2002.
[22] B. Raj, V. N. Parikh, and R. M. Stern, “The Effects of Background Music on Speech Recognition Accuracy,” in Proc. IEEE ICASSP, vol. 2, pp. 851-854, Apr. 1997.
[23] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Single Channel Speech and Background Segregation through Harmonic-Temporal Clustering,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.279-282, Oct. 2007.
[24] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Monaural Speech Separation Through Harmonic-Temporal Clustering of the Power Spectrum,” in Proc. of ASJ Autumn Meeting., 3-4-3, pp.351-352, Sept., 2007.
[25] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Single and Multiple F0 Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1135-1145, May 2007.
[26] W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, 1976.
[27] S. Sagayama, K. Takahashi, H. Kameoka, and T. Nishimoto, “Specmurt Anasylis: A Piano-Roll-Visualization of Polyphonic Music Signal by Deconvolution of Log-Frequency Spectrum,” in Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA2004), in CD-ROM, 2004.
[28] S. Saito, H. Kameoka, T. Nishimoto, and S. Sagayama, “Specmurt Analysis of Multi-Pitch Music Signals with Adaptive Estimation of Common Harmonic Structure,” in Proc. International Conference on Music Information Retrieval (ISMIR2005), pp. 84-91, 2005.
[29] S. H. Srinivasan, “Auditory Blobs,” in Proc. IEEE ICASSP, vol. 4, pp. 313-316, 2004.
[30] T. Virtanen, and A. Klapuri, “Separation of Harmonic Sounds Using Multipitch Analysis and Iterative Parameter Estimation," in Proc. IEEE WASPAA, pp. 83-86, Oct. 2001.
[31] H.C. Wang (王小川), 語音訊號處理, 全華科技圖書股份有限公司, 2004.
[32] A. T. Yu and H. C. Wang, “New Speech Harmonic Structure Measure and its Applications to Speech Processing,” J. Acoust. Soc. Amer., vol. 120, no. 5, pp. 2938-2949, Nov. 2006.
[33] “MAT Speech Database – TCC-300,” [Online]. Available: www.aclclp.org.tw/doc/tcc_apply_ac_e.pdf
[34] “RWC Music Database: Popular, Classical, and Jazz Music Database,” [Online]. Available: http://staff.aist.go.jp/m.goto/RWC-MDB