研究生: |
江克敬 Ke-Ching Chiang |
---|---|
論文名稱: |
華語韻律轉換之研究與實作 Research and Implementation of Prosody Conversion for Mandarin Chinese |
指導教授: |
張智星
Jyh-Shing Roger Jang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2008 |
畢業學年度: | 96 |
語文別: | 中文 |
論文頁數: | 33 |
中文關鍵詞: | 韻律轉換 、語音合成 |
外文關鍵詞: | voice conversion, speech synthesis |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文建置一個華語韻律轉換的系統,可達成語音的:(1)聲調轉換,(2)長度轉換與(3)音量轉換。
我們先對整句語句使用UPDUDP (unbroken pitch determination using dynamic programming)進行音高追蹤,對整個語句取出平均幅度差函數法的(average magnitude difference function, AMDF)時域特徵,再以動態程式最佳化(DP, dynamic programming)擷取出連續而不中斷的音高特徵曲線。同時我們也用維特比譯碼器(viterbi decoding)進行強制對位(forced alignment)切出音素。
接下來取出非氣音的部分進行韻律轉換,合成方式是在時域上作處理,利用基週同步疊加法(PSOLA, pitch synchronous overlap and add)調整基週頻率以達成聲調轉換。用波形相似性疊加法(WSOLA, waveform similarity overlap and add)調整音長。再將原始語音與目標語音的每個基週標位到下一個基週標位的間隔當成音框大小(frame size),求取其音量,兩者進行線性轉換(linear mapping)之後,調整原始語音的音量以達成音量轉換。
我們還嘗試了另一種基頻擷取與合成的方法STRAIGHT(speech transformation and representation using adaptive interpolation of weighted spectrum)進行聲調轉換,並比較STRAIGHT與PSOLA兩種不同方法的合成結果。STRIGHT可以達到比較好的合成效果,但是會花費較長的程式執行時間。
In this thesis, a prosody conversion system for Mandarin Chinese was developed in order to achieve (1) pitch conversion, (2) duration conversion, (3) energy conversion.
UPDUDP (unbroken pitch determination using dynamic programming) is first applied to each utterance for pitch tracking which extracts an unbroken pitch contour from a given utterance based on time-domain acoustic feature of AMDF (average magnitude difference function) and DP (dynamic programming). Utterances are then segmented into phonemes using Viterbi forced alignment.
The voiced parts of each utterance are then extracted to perform prosody conversion. Voice conversion is achieved by using PSOLA (pitch synchronous overlap and add) in time domain to adjust the fundamental frequency. WSOLA (waveform similarity overlap and add) is employed to adjust duration. Frame size is set to be the interval between one pitch mark and the next pitch mark of source wave and target wave. The volume for each frame is thus computed and linear mapping is performed. Energy conversion is then achieved by adjusting the source energy to the target energy.
We have also made an attempt of using another approach of pitch tracking and speech synthesize called STRAIGHT (speech transformation and representation using adaptive interpolation of weighted spectrum) for voice conversion. The synthesized result of STRAIGHT is compared with that of PSOLA, and it is found that STRAIGHT can achieve better performance than PSOLA but requires more computation time.
[1] Y. R. Chao, “A grammar of spoken Chinese”, University of California Press, Berkeley and Los Angeles, California, 1968.
[2] 石文俐, 中文語音合成之韻律產生器的改良與研究., 國立清華大學資訊工
程所, 碩士論文, 2006.
[3] 詹詩涵, 基於音高調節之歌聲合成系統., 國立清華大學資訊系統及應用研 究所, 碩士論文, 2006.
[4] T. Styger and E. Keller, “Formant synthesis”, In E. Keller (Ed.), Fundamentals in Speech Synthesis and Speech Recognition, pp. 109–128. Wiley, 1994.
[5] D.H. Klatt, “Software for a cascade/parallel formant synthesizer”, Journal of the Acoustical Society of America, 67, 971-995, 1980.
[6] D. O’Shaughnessy, L. Barbeau, D. Bernardi, & D. Archambault, “Diphone speech synthesis”, Speech Communication, 7, 55-65, 1988.
[7] H. Kawahara, “Speech Representation and Transformation Using Adaptive
Interpolation of Weighted Spectrum: Vocoder Revisited,” in Proc. of ICASSP 1997, vol. 2, pp. 1303-1306, Munich, Germany, Apr. 1997.
[8] H. Kawahara, I. Masuda-Katsuse and A. de Cheveigné, “Restructuring Speech Representations Using a Pitch Adaptive Time-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds,” Speech Communication, vol. 27, no. 3-4, pp. 187-207, Apr. 1999.
[9] H. Dudley, ‘‘Remaking speech,’’ J. Acoust. Soc. Am., 11, pp. 169-177, 1939.
[10] F. Charpentier, and Moulines, “Pitch-synchronous Waveform Processing
Technique for Text-to-Speech Synthesis Using Diphone”, European Conf. On Speech Communication and Technology: p.13-19.
[11] Jiang-Chun Chen, J.-S. Roger Jang, "TRUES: Tone Recognition Using Extended Segments", ACM Transactions on Asian Language Information Processing, 2008.
[12] W. Verhelst, and M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, 1993. 2: p.554-557.
[13] M. J. Ross et al., “Average Magnitude Difference Function Pitch Extractor,” IEEE Trans. Acoustics, Speech, Signal Processing, vol. ASSP 22, pp. 353-362, 1974
[14] Cheng-Yuan Lin and J.-S. Roger Jang, "A Two-Phase Pitch Marking Method for TD-PSOLA Synthesis", GESTS International Transaction on Speech Science and Engineering, No. 2, Vol. 1, PP. 211-221, Dec 2004.
[15] Speech Filing System: UCL open tools for speech research. Software available at http://www.phon.ucl.ac.uk/resource/sfs/
[16] Hideki Banno, Hiroaki Hata, Masanori Morise, Toru Takahashi, Toshio Irino, Hideki Kawahara, "Implementation of realtime STRAIGHT speech manipulation system", Acoust. Sci. & Tech. 2007. Vol.28, No.3, pp.140--146, 2007.
[17] H. Kawahara, H. Katayose, A. de Cheveigne’ and R. D. Patterson, ‘‘Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,’’ Proc. Eurospeech ’99, Vol. 6, Budapest, pp. 2781–2784 (1999).