研究生: |
李函軒 Li, Han-Xuan |
---|---|
論文名稱: |
基於隱藏式半馬可夫模型之中文文句轉語音系統及其模型調適與聲音轉換 Mandarin Chinese Text-to-Speech System Based on Hidden Semi-Markov Models and its Model Adaptation and Voice Conversion |
指導教授: |
王小川
Wang, Hsiao-Chuan 鐘太郎 Jong, Tai-Lang |
口試委員: |
陳信宏
Chen, Sin-Horng 王逸如 Wang, Yih-Ru |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 76 |
中文關鍵詞: | 隱藏式半馬可夫模型 、模型調適 、聲音轉換 |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
基於隱藏式半馬可夫模型(Hidden Semi-Markov Model)的文句轉語音系統(text-to-speech system),是以統計模型描述語音的合成單元及其狀態時長,將輸入的文句表示成一序列的語音合成單元,然後轉換成語音輸出。改變語音合成單元的模型參數,就可以改變合成的聲音,因此可以利用模型調適方法,使合成的語音接近於目標語者的聲音特質、情緒特徵或說話韻律節奏,達到聲音轉換(voice conversion)的目的。
本論文更進一步利用目標語者語音的剩餘訊號(residual signal),加入其語音產生模型的激發訊號中,使合成語音更接近目標語者聲音。論文中提出兩種剩餘訊號加入的方法,並對於合成的語音進行主觀評量與客觀評量。在主觀實驗中,發現其中一種剩餘訊號加入法會在聽覺上感覺到不連續聲音,而另一種方法則沒有不連續的狀況。在客觀評量中則是計算合成語音與目標語者語音的高斯混合模型,量測各高斯混合模型之間的KL距離,看出兩種加入剩餘訊號的方法,都使得其合成語音更接近目標語者語音。
[1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara. "Voice conversion through vector quantization. " J. Acoust. Soc. Jpn. (E), Vol. 11, No. 2, pp.71–76, 1990.
[2]T. Toda, A.W. Black, K. Tokuda. "Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. " IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2222-2235, Nov. 2007.
[3] M. Charlier, Y. Ohtani, T. Toda, A. Moinet, T. Dutoit. "Cross-language voice conversion based on eigenvoices. " Proc. INTERSPEECH, pp. 1635-1638, Brighton, UK, Sep. 2009.
[4] A. Kain and M.W. Macon. "Spectral voice conversion for text-to-speech synthesis. " Proc. ICASSP, pp. 285–288, Seattle, USA, May 1998.
[5] A. Hunt and A.W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database." in Proc. ICASSP, 1996, pp. 373-376.
[6] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Hidden semi-Markov model based speech synthesis. "¸in Proc. of ICSLP, 2004, pp. 1185–1180.
[7] K. Tokuda ,T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. "Speech parameter generation algorigthms for HMM-based speech synthesis. " In Proc. ICASSP 2000, pages 1315–1318, June 2000.
[8] T. Toda and K. Tokuda, “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 5, pp. 816–824, May 2007
[9] S. Imai, “Cepstral analysis synthesis on the mel frequency scale,”Proc. of ICASSP, pp.93–96, 1983.
[10]J.Yamagishil,H.Zen,T Toda and K. Tokuda ,“Speaker-Independent HMM-based Speech Synthesis System— HTS-2007 System for the Blizzard Challenge 2007.”Blizzard 2007,2007
[11] T. Kobayashi and S. Imai, “Spectral analysis using generalized cepstrum,”IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32,
pp.1087–1089, Oct. 1984.
[12] K. Tokuda, T. Kobayashi, T. Masuko and S. Imai,” Mel-generalized cepstral analysis— a unified approach to speech spectral estimation.” Proc.ICASSP,pp.1043-1064,1994
[13]G. Muhammad“Extended Average Magnitude Difference Function Based Pitch Detection“ IAJIT Vol. 8, No. 2, April 2011 pp.197-203
[14] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi. Hidden Markov
models based on multi-space probability distribution for pitch pattern
modeling. In Proc. ICASSP-99, pages 229–232, March 1999.
[15] T. Yoshimuray, K.Tokuday, T. Masukoyy, T. Kobayashiyy and T. Kitamuray.” Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis” Proc. of Eurospeech 1999. Budapest, Hungary.
[16] H. Zen, T. Masuko, T. Yoshimura, K. Tokuda, T. Kobayashi, and
T. Kitamura, “State duration modeling for HMM-based speech synthesis,” IEICE Trans. on Inf. & Syst., vol. E90-D, no. 3, pp.692–693, 2007.
[17] K. Shinoda and T. Watanabe. MDL-based context-dependent subword
modeling for speech recognition. J. Acoust. Soc. Japan (E), 21:79–86,March 2000.
[18] L.F. Uebel, P.C. Woodland, “An Investigation into Vocal Tract Length Normalisation”, Proc. Eurospeech, Vol. 6, pp. 2527- 2530, Budapest, Hungary, Sep. 1999.
[19] C.J. Leggetter & P.C. Woodland.”Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models.” Computer Speech & Language, Vol. 9, pp. 171-185.
[20] O. Shiohan, T. Myrvoll, and C. Lee, “Structural maximum a posteriori
linear regression for fast HMM adaptation,” Computer Speech & Language,
vol. 16, no. 3, pp. 5–24, 2002.
[21] J. Yamagishi, T. Kobayashi, Y.Nakano,K.Ogata and J.Isogai “Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm .“, IEEE.Speech,Audio,Lang.Process., Vol.17 JANUARY 2009
[22] O. Shiohan, T. Myrvoll, and C. Lee, “Structural maximum a posteriori linear regression for fast HMM adaptation,” Computer Speech & Language, vol. 16, no. 3, pp. 5–24, 2002.
[23] Jacob Goldberger, Shiri Gordon, and Hayit Greenspan, “An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures,” in Proc. of ICCV 2003, Nice, October 2003, vol. 1, pp. 487–493.
[24] John R. Hershey and Peder A. Olsen “Approximating the kullback leibler divergence between gaussian mixture models”in ICASSP 2007,vol 4 pp. 317-320
[25] Speech Signal Processing Toolkit (SPTK)
http://sp-tk.sourceforge.net/
[26] The Hidden Markov Model Toolkit (HTK)
http://htk.eng.cam.ac.uk/
[27] HMM-based Speech Synthesis System (HTS)
http://hts.sp.nitech.ac.jp/
[28] HTS Engine
http://hts-engine.sourceforge.net/
[29]wavesurfer
http://www.speech.kth.se/wavesurfer/
[30]中研院斷詞系統
http://ckipsvr.iis.sinica.edu.tw/
[31] 王小川,語音訊號處理, 全華圖書公司 2007