研究生: |
饒珮綺 Pei-Chi Jao |
---|---|
論文名稱: |
華語歌聲合成之聲韻母音長預測與音量模型研究 A Study on Initial/Final Duration Prediction and Energy Modeling for Corpus-based Mandarin Singing Voice Synthesis |
指導教授: |
張智星
Jyh-Shing Roger Jang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 英文 |
論文頁數: | 39 |
中文關鍵詞: | 歌聲合成 、聲韻母音長預測 、音量模型 |
外文關鍵詞: | Singing Voice Synthesis, I/F Duration Prediction, Energy Modeling |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本論文中,針對以語料庫為主的中文歌聲合成,提出一個有效預測聲韻母音長及音量的方法,主要目標為改善合成歌聲的清晰度與自然度。
首先,我們介紹聲韻母音長預測模組的架構概念。在此,我們依據中文子音類別分別建造聲母/韻母音長預測模組,使用語言學與聲韻學特徵以及樂譜資訊當作輸入參數,並採用支撐向量機 (SVM, Support Vector Machine) 迴歸方法來預測聲母長度。
接著,在音量模型設計方面,我們提出三種調整方法。方法一、給定所有音節相同音量強度。方法二、使用語言學與聲韻學特徵以及樂譜資訊藉由分類與迴歸樹狀圖分析 (CART, Classification and Regression Trees) 來預測音量。方法三、根據不同音高與音長分佈組合,定義規則對音量做調整。最後,我們進行相關實驗與聽測並歸納結論。由實驗結果證實本論文所提出的聲韻母音長預測與音量模型確實能提升合成歌聲的清晰度與自然度。
In this research, we propose several effective methods for initial/final (I/F) duration prediction and energy modeling for corpus-based Mandarin singing voice synthesis (SVS). Our goal is to improve the clarity and naturalness of the synthesized singing voices.
Firstly, the framework of the I/F duration prediction model is presented. We construct an individual I/F duration prediction model for each category of consonants. Both linguistic/phonetic attributes and music-score information are used as the input features. The support vector machine (SVM) is employed to train each I/F duration prediction model.
Secondly, three methods for energy modeling are proposed. In the first method, we use an identical volume to specify the energy of each syllable. In the second method, we adopt the same features used in the I/F duration prediction to predict energy. In the third method, a rule-based approach is designed to modify the energy according to different combinations of pitch and duration.
Finally, several experiments and listening tests are conducted to demonstrate the feasibility of the proposed methods. The experimental results indicate that our methods are able to improve both the clarity and naturalness of the synthesized singing voices.
[1]
P. R. Cook, “SPASM, a real-time vocal tract physical model controller and singer, the companion software synthesis system”, Computer Music Journal., Vol. 17, pp. 30-43, 1993.
[2]
E. B. George and M. J. T. Smith, “Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model”, IEEE Trans. Speech and Audio Proc., Vol. 5, pp. 389-406, 1997.
[3]
Michael W. Macon and Mark A. Clements, “Sinusoidal modeling and modification of unvoiced speech”, IEEE Trans. Speech and Audio Proc., Vol. 5, No. 6, pp. 557-560, Nov. 1997.
[4]
H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction”, Speech Commun., 27, pp.187-207 (1999).
[5]
H. Dudley, “Remaking speech”, J. Acoust. Soc. Am., 11, pp. 169-177, 1939.
[6]
Yamaha Corporation Advanced System Development Center. New Yamaha VOCALOID Singing Synthesis Software Generates Superb Vocals on a PC, 2003.
<http://www.global.yamaha.com/news/20030304b.html.>
[7]
D. Klatt, “Software for a cascade/parallel formant synthesizer”, Journal of the
Acoustical Society of America, Vol. 67, No. 3, pp. 871-995, 1980.
[8]
Rodet, X. The Chant project: From the synthesis for the singing voice to synthesis in general. Computer Music Journal, 8(3): pp. 15-31. 1984.
[9]
M. W. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and E. B. George, “A singing voice synthesis system based on sinusoidal modeling”, in Proc. ICASSP, pp. 435-438, 1997.
[10]
C. Y. Lin, T. Y. Lin, and J. -S. Roger Jang, “A corpus-based singing voice synthesis system for Mandarin Chinese”, in Proc. ACM Multimedia, pp. 359-362, 2005.
[11]
C. Y. Lin, P. C. Jao, and J. -S. Roger Jang, “Effective Initial/Final Duration Prediction Method for Corpus-based Singing Voice Synthesis of Mandarin Chinese”, to appear in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 2007.
[12]
F. C. Chou, C. Y. Tseng, and L. S. Lee, “Automatic segmental and prosodic labeling of Mandarin speech”, in Proc. ICSLP, pp.1263-1266, 1998.
[13]
D. Talkin, “A robust algorithm for pitch tracking (RAPT) ”, in Speech Coding and Synthesis, pp. 495-518, 1995.
[14]
C. Y. Lin and J. -S. Roger Jang, “A two-phase pitch marking method for TD-PSOLA synthesis”, in Proc. ICSLP, pp. 1189-1192, 2004.
[15]
Meron Y., High quality singing synthesis using the selection-base synthesis scheme, PhD dissertation, Univ. of Tokyo, 1999.
[16]
F. J. Charpentier and M. G. Stella, Diphone synthesis using an overlap-add technique for speech waveforms concatenation, International Conference on Acoustics, Speech, and Signal Processing, 1986.
[17]
W. Verhelst and M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high-quality time-scale modifications of speech, International Conference on Acoustics, Speech, and Signal Processing, 1993.
[18]
<http://www.mathworks.com/> MATLAB: Statistics Toolbox
[19]
L. Breiman, Classification and Regression Trees, Chapman & Hall, Boca Raton, 1993.
[20]
C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
[21]
V. Vapnik, Statistical learning theory, Wiley, New York, 1998.
[22]
B. Schölkopf, A. Smola, R. Williamson, and P. L. Bartlett, “New support vector algorithms”, in Proc. Neural Computation, pp. 1207-1245, 2000.
[23]
C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification”, Technical Report, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan.
[24]
P. T. Pope and J. T. Webster, “The use of an f-statistic in stepwise regression procedures”, Technometrics, 1972.