簡易檢索 / 詳目顯示

研究生: 劉美忻
Liu, Elaine Mei-Hsin
論文名稱: 以諧振時間結構分群作含有音樂背景的單聲道語音增強
Single-Channel Speech Enhancement with Background Music Based on Harmonic-Temporal Structured Clustering
指導教授: 王小川
Wang, Hsiao-Chuan
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 87
中文關鍵詞: 諧振時間結構分群音樂背景單聲道語音增強短時段傅立葉轉換高斯混合模型音高頻率
外文關鍵詞: Hamonic-temporal structured clustering, HTC, cluster, background music, single-channel, speech enhancement, short-time fourier transform, Gaussian mixture model, GMM, F0 contour, pitch
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 諧振時間結構分群 (harmonic-temporal structured clustering, HTC) 是一種把單聲道錄音作聲音分離的方法,本研究即應用HTC技術作語音增強,把單人語音從一或兩種樂器產生的背景音樂中分離出來。分離的操作是在短時段傅立葉轉換(short-time Fourier transform, STFT)之後的時頻域中進行。然後用重疊相加(overlap-add, OLA)方法合成分離之後的聲音。HTC方法是用高斯混合模型(Gaussian mixture model, GMM)來近似一個混合訊號的能量頻譜,在本研究中是由語音模型和音樂模型組成,它們分別估計語音和音樂能量頻譜的貢獻。每一個聲音模型製作一個遮蔽函數(masking function),乘上觀察的能量頻譜,用以抽出該聲音的能量頻譜。語音和音樂的能量頻譜假設是可加性的, HTC只估計訊號做STFT之後的大小值,原來混合訊號的相位就用在OLA演算來合成所抽出聲音的波形。

    語音和音樂模型是數個高斯函數的加權總和,高斯函數的時間頻率位置用以表示諧振時間結構。這個結構型造了有聲語音和音樂音符中的音高頻率(pitch frequency, F0)、開始時間,音色,和持續時間。聲音模型估計觀察到的音符和音素是尋找能量頻譜內最適合此結構的成份。語音模型和音樂模型的區別在於它們音高曲線有不同的類型和起始狀態。音樂模型以固定頻率的直線估計音符的F0,語音模型估計有聲音語音的F0,描述成一個以說話者平均音高差距不遠的連續而有些微彎曲的線段。說話者的平均音高用來作為語音模型中音高的起始值,幫助將模型的音高定在有聲語音音高附近。語音模型用有聲語音找到音素,但也可以粗略估計無聲語音。不同的F0估計類型和起始值是分區語音和音樂這兩種聲音的基礎。

    以聲音模型來近似觀察到的聲音能量頻譜,是用期望-最大化演算法(Expectation-Maximization algorithm, EM algorithm),反覆更新模型的參數。這演算法的目的是找出目標函數的最大值,此目標函數判斷觀察的能量頻譜和總GMM的相似度。這目標函數會偏向具有某些特質的模型。

    實驗顯出HTC是語音增強的有效方法,當HTC應用在低SNR的混合訊號時,語音有明顯的改進。給高SNR的混合訊號時,語音改進就很少。HTC也提供音樂訊號的估計,但是□果顯示當音樂支配混合訊號時,分離出來的音樂有好的品質,否則品質較差。因此以HTC技術作語音和音樂的分離,應用在語音增強勝於應用在音樂增強。


    Harmonic-temporal structured clustering (HTC) is a method for sound separation in a single-channel recording. In this work, HTC is applied to speech enhancement by separating single-speaker speech from additive background music due to one or two musical instruments. Separation is done in the short-time Fourier transform (STFT) time-frequency domain, and the separated sounds are reconstructed by overlap-and-add (OLA). HTC essentially fits a Gaussian mixture model (GMM) to the mixture’s power spectrum. The GMM is composed of a speech sound model and a music sound model that respectively approximate the speech and music power spectrum contributions. A masking function is created from each sound model and applied to the observed power spectrum to extract an estimate of the clean sound’s power spectrum. The speech and music power spectra are assumed additive, and HTC only estimates the STFT magnitude. The mixture’s phase is used in OLA to reconstruct an estimate of the clean sound’s waveform.
    The speech and music sound models are sums of weighted Gaussians placed at time-frequency locations corresponding to a harmonic and temporal structure. This structure models the pitch (F0), onset time, timbre, and duration of musical notes or voicing in phones. Sound models approximate notes and phones in the observation by finding components in the power spectrum that best fit the structure.
    The speech sound model is distinguished from the music sound model by their different types and initializations of F0 contours. The music model estimates the F0 of music notes as lines of constant frequency. The speech model estimates the F0s of voiced speech as segments along a continuous, slightly curving contour that does not deviate much from the speaker’s average voicing pitch. It is assumed that the speaker’s average F0 is given, so that it can be used to initialize the speech model’s parameters and help the model fit to voiced speech with pitch near the average F0. The speech model uses voicing to find phones, but it can also roughly estimate unvoiced speech. The different types and initializations of F0 estimates are the foundation for why the two sound models can distinguish between speech and music and thus separate the two sounds.
    The sound models are fit to the observed spectrogram by iteratively updating their parameters in the “expectation-maximization algorithm” (EM algorithm). This algorithm tries to maximize an objective function that measures the similarity between the total GMM and the observed power spectrum. The objective function is biased to favor models with certain preferred characteristics.
    An experiment demonstrates that HTC is an effective method for speech enhancement. When HTC is applied to mixtures at low signal-to-noise ratio, there is significant improvement of speech. For mixtures at high SNR, there is little improvement. HTC also provides an estimate of the music signal, but results show that estimated music has good quality only if the mixture is dominated by music and poor quality otherwise. Thus HTC speech-music separation is more applicable to speech enhancement than music enhancement.

    Abstract (in Mandarin) i Abstract ii Chapter 1 Introduction 1 1.1 Background 1 1.1.1 Overview 1 1.1.2 Introduction of HTC 2 1.2 STFT Time-Frequency Analysis 4 1.2.1 Energy Density 4 1.2.2 Sounds with Pitch 4 1.2.3 Additivity of Power Spectra 6 1.3 Implementation 6 1.3.1 STFT Analysis 6 1.3.2 OLA Reconstruction 6 1.3.3 Implementing HTC with the STFT 7 Chapter 2 HTC Sound Models 10 2.1 Total GMM 10 2.2 The Source Model 12 2.3 Two Types of Sound Models 15 2.3.1 Music Sound Model 15 2.3.2 Speech Sound Model 16 2.3.3 Linear Interpolation for Speech F0 Contour 17 2.4 Notation 19 2.5 Extraction of lth Sound by Masking Function 22 Chapter 3 Unsupervised Gaussian Fitting by the EM Algorithm 23 3.1 Introduction 23 3.2 Relative Entropy (Kullback-Leibler Distance) 24 3.3 Introduction of the Penalty Function 25 3.3.1 Dirichlet Distribution 26 3.3.2 Prior for Harmonic Envelope 27 3.3.3 Prior for Temporal Envelope 29 3.3.4 Prior for Speech F0 Contour 29 3.4 Summary: The Objective Function 30 3.5 Derivation of the EM Algorithm 33 3.6 Implementation of the EM Algorithm 38 3.7 Relation to Pattern Classification 41 Chapter 4 Experimental Evaluation 43 4.1 Methodology 43 4.1.1 Experiment Setup 43 4.1.2 Discussion of Initial Model Parameters 48 4.1.3 Segmental SNR 53 4.2 Results 53 4.2.1 General Analysis 54 4.2.2 Discussion of SNR 58 4.2.3 Example 61 4.2.4 Illustration of Mask and Convergence 66 Chapter 5 Conclusion 69 5.1 Summary and Conclusion 69 5.2 Possible Improvements 70 Appendix 71 Overview 71 I. M-Step Update Equations 73 II. Notes on Concavity 80 III. The Lagrange Multiplier Theorem 83 References 85

    [1] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed., John Wiley & Sons, 2006.

    [2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed., John Wiley & Sons, 2001.

    [3] A. Gelman, J. B. Carlin, H. S. Stern, and R. B. Rubin, Bayesian Data Analysis, 2nd ed., Chapman & Hall/CRC, 2004.

    [4] S. Ghahramani, Fundamentals of Probability with Stochastic Processes, 3rd ed., Prentice Hall, 2005.

    [5] G. N. Hu and D. L. Wang, “Monaural Speech Segregation Based on Pitch Tracking and Amplitude Modulation,” IEEE. Trans. Neural Networks, vol. 15, no. 5, pp. 1135-1150, Sept. 2004.

    [6] X. D. Huang, A. Acero, and H. W. Hon, Spoken Language Processing, Prentice Hall, 2001.

    [7] R. Jang, “Audio Signal Processing and Recognition,” [Online]. Available: http://neural.cs.nthu.edu.tw/jang/books/audioSignalProcessing

    [8] R. Jang, “Data Clustering and Pattern Recognition,” [Online]. Available: http://neural.cs.nthu.edu.tw/jang/books/dcpr

    [9] X. C. Jin and Z. F. Wang, “Speech Separation from Background Music Based on Single-Channel Recording,” in Proc. of the IEEE 18th International Conference on Pattern Recognition (ICPR’06), vol. 4, pp. 278-281, 2006.

    [10] H. Kameoka, T. Nishimoto, and S. Sagayama, “A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 982-994, Mar. 2007.

    [11] H. Kameoka, “Statistical Approach to Multipitch Analysis,” Ph.D. dissertation, The University of Tokyo, Tokyo, Japan, 2007.
    [12] H. Kameoka, T. Nishimoto, S. Sagayama, “Accurate F0 Detection Algorithm for Concurrent Sounds Based on EM Algorithm and Information Criterion," in Proc. Special Workshop in MAUI (SWIM), in CD-ROM, 2004.

    [13] H. Kameoka, T. Nishimoto, S. Sagayama, “Separation of Harmonic Structures Based on Tied Gaussian Mixture Model and Information Criterion for Concurrent Sounds,” in Proc. IEEE, International Conference on Acoustics, Speech and Signal Processing (ICASSP2004), vol. 4, pp. 297-300, 2004.

    [14] A. Klapuri, “Multipitch estimation and sound separation by the spectral smoothness principle,” in Proc. IEEE ICASSP, vol. 5, pp. 3381-3384, May 2001.

    [15] K. Melih and R. Gonzalez, “Harmonic Grouping for Computational Auditory Scene Analysis,” in Proc. of the IEEE 8th International Symposium on Signal Processing and Its Applications, vol. 2, pp. 623-626, Aug. 2005.

    [16] K. Miyamoto, H. Kameoka, T. Nishimoto, N. Ono, and S. Sagayama, “Harmonic-Temporal-Timbral Clustering (HTTC) for the Analysis of Multi-Instrument Polyphonic Music Signals,” in Proc. IEEE ICASSP, pp.113-116, Apr. 2008.

    [17] K. Miyamoto, H. Kameoka, H. Takeda, T. Nishimoto, and S. Sagayama, “Probabilistic Approach to Automatic Music Transcription from Audio Signals,” in Proc. IEEE ICASSP, vol. 2, pp. 697-700, 2007.

    [18] V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing, 2nd ed., Prentice Hall, 1999.

    [19] T. W. Parsons, “Separation of Speech from Interfering Speech by Means of Harmonic Selection,” J. Acoust. Soc. Amer., vol. 60, no. 4, pp. 911-918, Oct. 1976.

    [20] W. H. Press, W. T. Vetterling, S. A. Teukolsky, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge, U.K.: Cambridge Univ. Press, 1992.

    [21] T. F. Quatieri, Discrete-Time Speech Signal Processing Principles and Practice, Prentice Hall, Englewood Cliffs, NJ, 2002.

    [22] B. Raj, V. N. Parikh, and R. M. Stern, “The Effects of Background Music on Speech Recognition Accuracy,” in Proc. IEEE ICASSP, vol. 2, pp. 851-854, Apr. 1997.

    [23] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Single Channel Speech and Background Segregation through Harmonic-Temporal Clustering,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.279-282, Oct. 2007.

    [24] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Monaural Speech Separation Through Harmonic-Temporal Clustering of the Power Spectrum,” in Proc. of ASJ Autumn Meeting., 3-4-3, pp.351-352, Sept., 2007.

    [25] J. Le Roux, H. Kameoka, N. Ono, A. de Cheveign□, and S. Sagayama, “Single and Multiple F0 Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1135-1145, May 2007.

    [26] W. Rudin, Principles of Mathematical Analysis, 3rd ed., McGraw-Hill, 1976.

    [27] S. Sagayama, K. Takahashi, H. Kameoka, and T. Nishimoto, “Specmurt Anasylis: A Piano-Roll-Visualization of Polyphonic Music Signal by Deconvolution of Log-Frequency Spectrum,” in Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA2004), in CD-ROM, 2004.

    [28] S. Saito, H. Kameoka, T. Nishimoto, and S. Sagayama, “Specmurt Analysis of Multi-Pitch Music Signals with Adaptive Estimation of Common Harmonic Structure,” in Proc. International Conference on Music Information Retrieval (ISMIR2005), pp. 84-91, 2005.

    [29] S. H. Srinivasan, “Auditory Blobs,” in Proc. IEEE ICASSP, vol. 4, pp. 313-316, 2004.

    [30] T. Virtanen, and A. Klapuri, “Separation of Harmonic Sounds Using Multipitch Analysis and Iterative Parameter Estimation," in Proc. IEEE WASPAA, pp. 83-86, Oct. 2001.

    [31] H.C. Wang (王小川), 語音訊號處理, 全華科技圖書股份有限公司, 2004.

    [32] A. T. Yu and H. C. Wang, “New Speech Harmonic Structure Measure and its Applications to Speech Processing,” J. Acoust. Soc. Amer., vol. 120, no. 5, pp. 2938-2949, Nov. 2006.

    [33] “MAT Speech Database – TCC-300,” [Online]. Available: www.aclclp.org.tw/doc/tcc_apply_ac_e.pdf

    [34] “RWC Music Database: Popular, Classical, and Jazz Music Database,” [Online]. Available: http://staff.aist.go.jp/m.goto/RWC-MDB

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE