研究生: |
石文俐 Wen-Li Shih |
---|---|
論文名稱: |
中文語音合成之韻律產生器的改良與研究 Prosodic Modeling for Mandarin Text-To-Speech |
指導教授: |
張智星
Jyh-Shing Roger Jang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 38 |
中文關鍵詞: | 語音合成器 、韻律參數 、迴歸模型 |
外文關鍵詞: | TTS, Prosody, Regression Model |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
以大量語料庫為基礎(Corpus-Based)的中文語音合成系統,因為單元選取的不一致,在片段接合上容易造成自然度下降,而且語料的收集亦不容易。因此,有別於前人的設計,我們發展了一套中文的語音合成系統,採用承載句(Carrier Sentence)語料庫設計為基本的合成單元,以解決語料收集困難的問題,並且建構適當的韻律參數模型,以期能在語音合成實作上,同時達到合成速度、語料庫大小、與自然度皆有不錯的水準。
本論文探討中文語音合成之韻律參數產生的幾種常見的方法:(1)使用類神經網路(Neural Network)方法為基礎的韻律產生器,(2)線性迴歸器(Linear Regression)與(3)支撐向量機(Support Vector Machine, SVM)的迴歸模型訓練,來設計中文語音的韻律模型。
為了使模型最佳化,我們對倒傳遞類神經網路(Back Propagation Network, BPN)與支撐向量機的設定參數做實驗,並且把各種韻律參數分別訓練,以提高預測力。
我們對上述各個模型做內外部測試,並以Root Mean Square Error (RMSE)值的高低作為比較基準,最後選取最佳模型進行聽測評估。根據RMSE值的的實驗結果,以類神經網路與支撐向量機較低,線性迴歸法的RMSE誤差值較其他二者稍高。因為支撐向量機的模型穩定度較高,所以我們又針對支撐向量機的設定參數作測試,以提高支撐向量機的預測準確性。由聽測實驗結果得知,經過韻律模型產生的合成語句,其自然度比原先以承載句為基礎的語音合成系統有較佳的表現。
A corpus-based TTS system is likely to have degradation in naturalness due to the acoustic mismatch of between selected synthesis units. Moreover, the collection of the speech corpus is also a labor-intensive task. Therefore, we have developed a carrier-sentence-based TTS system for Mandarin Chinese. Our lab is consistently trying to improve the TTS system such that a balance can be achieved considering synthesis speed, corpus size, and naturalness of the output utterances.
In this thesis, several methods that generate the prosodic parameters of a Mandarin TTS system are investigated. These methods include linear regression, the artificial neural network, and the regression model of support vector machine (SVM). We compare the RMSE of both inside and outside tests of these methods to find out the best regression model for prosody generation, and carry out a listening test. The neural network and SVM can achieve better performance in terms of RMSE. We have also performed additional optimization on the parameters of SVM.
Listening test shows that after our prosody modification, the TTS system indeed generates more natural-sounding utterances.
[1] L.S. Lee, C.Y. Tseng, and M. Ouh-Young, The synthesis rules in a Chinese text-to-speech system. Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, 1989. 37(9): p.1309-1320.
[2] M.S. Liang, R.C. Yang, Y.C. Chiang, D.C. Lyu, and R.Y. Lyu, A Taiwanese text-to-speech system with applications to language learning. Advanced Learning Technologies, 2004. Proceedings. IEEE International Conference on: p.91-95.
[3] 古鴻炎 and 楊仲捷, 基於VQ/HMM之國語語句基週軌跡產生之方法. 國立台灣科技大學電機所, 碩士論文, 1999.
[4] S.H. Chen, S.H. Hwang, and Y.R. Wang, An RNN-based prosodic information synthesizer for Mandarin text-to-speech. Speech and Audio Processing, IEEE Transactions on, 1998. 6(3): p.226-239.
[5] 古鴻炎 and 曹亦岑, 使用小型語料類神經網路之國語語音合成韻律參數產生. 國立台灣科技大學電機所, 碩士論文, 2003.
[6] S.H. Chen, W.H. Lai, and Y.R. Wang, A new duration modeling approach for Mandarin speech. Speech and Audio Processing, IEEE Transactions on, 2003. 11(4): p.308-320.
[7] S.H. Chen, W.H. Lai, and Y.R. Wang, A statistics-based pitch contour model for Mandarin speech. The Journal of the Acoustical Society of America, 2005. 117: p.908.
[8] S.H. Pin, Y Lee, Y Chen, H Wang, and C Tseng, A Mandarin TTS system with an integrated prosodic model. Chinese Spoken Language Processing, 2004 International Symposium on, 2004: p.169-172.
[9] 葉怡成, 類神經網路模式應用與實作. 1993: 格致總經銷 臺北市.
[10] 張智星, MATLAB 程式設計與應用. 2000: 清蔚科技 台北市. p.28.2-28.13
[11] 黃永廣 and 林長青, 支撐向量機應用於科學探索. 國立雲林科技大學電子與資訊工程研究所, 碩士論文, 2003.
[12] F. Charpentier, and Moulines, Pitch-synchronous Waveform Processing Technique for Text-to-Speech Synthesis Using Diphone., European Conf. On Speech Communication and Technology: p.13-19.
[13] W. Verhelst, and M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, 1993. 2: p.554-557.
[14] 謝明峰, 使用大量語料庫的中文語音合成系統實作., 國立清華大學資訊工程所, 碩士論文, 2004.
[15] C.Y. Lin and J.S. Jang, A Two-Phase Pitch Marking Method for TD-PSOLA Synthesis. GESTS International Transaction on Speech Science and Engineering, 2004. 1(2): p.211–212.
[16] C.A. Cortes, and V.A. Vapnik, Support-vector networks. Machine Learning, 1995. 20(3): p.273-297.
[17] R.G. Steve, Support vector machines for classification and regression. ISIS technical report, Image speech and intelligent systems group of University of Southampton, 1998.
[18] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models. 2001: MIT Press Cambridge, MA, USA.
[19] S.H. Chen, and Y.R. Wang, Vector quantization of pitch information in Mandarin speech. Communications, IEEE Transactions on, 1990. 38(9): p.1317-1320.
[20] C.C. Chang, and C.J. Lin, LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001. 80: p.604–611.
[21] Scholkopf B., et al., New Support Vector Algorithms. Neural Computation. 2000. 12: p.1207-1245.