華語歌聲合成之聲韻母音長預測與音量模型研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	饒珮綺 Pei-Chi Jao
論文名稱：	華語歌聲合成之聲韻母音長預測與音量模型研究 A Study on Initial/Final Duration Prediction and Energy Modeling for Corpus-based Mandarin Singing Voice Synthesis
指導教授：	張智星 Jyh-Shing Roger Jang
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2007
畢業學年度：	95
語文別：	英文
論文頁數：	39
中文關鍵詞：	歌聲合成、聲韻母音長預測、音量模型
外文關鍵詞：	Singing Voice Synthesis, I/F Duration Prediction, Energy Modeling
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本論文中，針對以語料庫為主的中文歌聲合成，提出一個有效預測聲韻母音長及音量的方法，主要目標為改善合成歌聲的清晰度與自然度。
首先，我們介紹聲韻母音長預測模組的架構概念。在此，我們依據中文子音類別分別建造聲母/韻母音長預測模組，使用語言學與聲韻學特徵以及樂譜資訊當作輸入參數，並採用支撐向量機 (SVM, Support Vector Machine) 迴歸方法來預測聲母長度。
接著，在音量模型設計方面，我們提出三種調整方法。方法一、給定所有音節相同音量強度。方法二、使用語言學與聲韻學特徵以及樂譜資訊藉由分類與迴歸樹狀圖分析 (CART, Classification and Regression Trees) 來預測音量。方法三、根據不同音高與音長分佈組合，定義規則對音量做調整。最後，我們進行相關實驗與聽測並歸納結論。由實驗結果證實本論文所提出的聲韻母音長預測與音量模型確實能提升合成歌聲的清晰度與自然度。

In this research, we propose several effective methods for initial/final (I/F) duration prediction and energy modeling for corpus-based Mandarin singing voice synthesis (SVS). Our goal is to improve the clarity and naturalness of the synthesized singing voices.
Firstly, the framework of the I/F duration prediction model is presented. We construct an individual I/F duration prediction model for each category of consonants. Both linguistic/phonetic attributes and music-score information are used as the input features. The support vector machine (SVM) is employed to train each I/F duration prediction model.
Secondly, three methods for energy modeling are proposed. In the first method, we use an identical volume to specify the energy of each syllable. In the second method, we adopt the same features used in the I/F duration prediction to predict energy. In the third method, a rule-based approach is designed to modify the energy according to different combinations of pitch and duration.
Finally, several experiments and listening tests are conducted to demonstrate the feasibility of the proposed methods. The experimental results indicate that our methods are able to improve both the clarity and naturalness of the synthesized singing voices.

List of Figures    V
List of Tables    VI
Chapter 1.    Introduction    1
1.    Motivation    1
2.    Related Work    1
2.1.    Physical Models    2
2.2.    Sinusoid-based Synthesis    2
2.3.    Formant-based Synthesis    3
2.4.    Corpus-based Synthesis    3
2.5.    Goal of this Work    3
3.    Summary of the Thesis    4
4.    Research Overview    5
Chapter 2.    Prior Work Review    6
1.    Singing Voice Corpora    6
2.    System Overview    7
3.    Unit Selection    8
3.1.    Design of Two Distance Functions    8
4.    Unit Modification and Concatenation    11
5.    Advanced Vocal Control for Singing Voice    11
Chapter 3.    Prosodic Modeling    12
1.    Initial/Final Duration Prediction    12
1.1.    Classification of Consonants    13
1.2.    Individual Prediction Model    14
1.3.    Input/Output Feature Selection    14
1.4.    Comparisons of Regression Approaches    18
2.    Energy Modeling    20
2.1.    Same-volume Criterion    20
2.2.    Energy Prediction Model    21
2.3.    A Rule-based Approach    23
Chapter 4.    Results and Discussions    26
1.    Initial/Final Duration Prediction    27
2.    Energy Prediction    31
3.    Listening Tests    33
3.1.    Listening Test for Duration Prediction    33
3.2.    Listening Test for Energy Modeling    35
Chapter 5.    Conclusions and Future Work    36
Bibliography    38

                                

[1]
P. R. Cook, “SPASM, a real-time vocal tract physical model controller and singer, the companion software synthesis system”, Computer Music Journal., Vol. 17, pp. 30-43, 1993.
[2]
E. B. George and M. J. T. Smith, “Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model”, IEEE Trans. Speech and Audio Proc., Vol. 5, pp. 389-406, 1997.
[3]
Michael W. Macon and Mark A. Clements, “Sinusoidal modeling and modification of unvoiced speech”, IEEE Trans. Speech and Audio Proc., Vol. 5, No. 6, pp. 557-560, Nov. 1997.
[4]
H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction”, Speech Commun., 27, pp.187-207 (1999).
[5]
H. Dudley, “Remaking speech”, J. Acoust. Soc. Am., 11, pp. 169-177, 1939.
[6]
Yamaha Corporation Advanced System Development Center. New Yamaha VOCALOID Singing Synthesis Software Generates Superb Vocals on a PC, 2003.
<http://www.global.yamaha.com/news/20030304b.html.>

[7]
D. Klatt, “Software for a cascade/parallel formant synthesizer”, Journal of the
Acoustical Society of America, Vol. 67, No. 3, pp. 871-995, 1980.
[8]
Rodet, X. The Chant project: From the synthesis for the singing voice to synthesis in general. Computer Music Journal, 8(3): pp. 15-31. 1984.
[9]
M. W. Macon, L. Jensen-Link, J. Oliverio, M. Clements, and E. B. George, “A singing voice synthesis system based on sinusoidal modeling”, in Proc. ICASSP, pp. 435-438, 1997.
[10]
C. Y. Lin, T. Y. Lin, and J. -S. Roger Jang, “A corpus-based singing voice synthesis system for Mandarin Chinese”, in Proc. ACM Multimedia, pp. 359-362, 2005.
[11]
C. Y. Lin, P. C. Jao, and J. -S. Roger Jang, “Effective Initial/Final Duration Prediction Method for Corpus-based Singing Voice Synthesis of Mandarin Chinese”, to appear in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), 2007.
[12]
F. C. Chou, C. Y. Tseng, and L. S. Lee, “Automatic segmental and prosodic labeling of Mandarin speech”, in Proc. ICSLP, pp.1263-1266, 1998.
[13]
D. Talkin, “A robust algorithm for pitch tracking (RAPT) ”, in Speech Coding and Synthesis, pp. 495-518, 1995.
[14]
C. Y. Lin and J. -S. Roger Jang, “A two-phase pitch marking method for TD-PSOLA synthesis”, in Proc. ICSLP, pp. 1189-1192, 2004.
[15]
Meron Y., High quality singing synthesis using the selection-base synthesis scheme, PhD dissertation, Univ. of Tokyo, 1999.
[16]
F. J. Charpentier and M. G. Stella, Diphone synthesis using an overlap-add technique for speech waveforms concatenation, International Conference on Acoustics, Speech, and Signal Processing, 1986.
[17]
W. Verhelst and M. Roelands, An overlap-add technique based on waveform similarity (WSOLA) for high-quality time-scale modifications of speech, International Conference on Acoustics, Speech, and Signal Processing, 1993.
[18]
<http://www.mathworks.com/> MATLAB: Statistics Toolbox

[19]
L. Breiman, Classification and Regression Trees, Chapman & Hall, Boca Raton, 1993.
[20]
C. C. Chang and C. J. Lin, LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
[21]
V. Vapnik, Statistical learning theory, Wiley, New York, 1998.
[22]
B. Schölkopf, A. Smola, R. Williamson, and P. L. Bartlett, “New support vector algorithms”, in Proc. Neural Computation, pp. 1207-1245, 2000.
[23]
C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification”, Technical Report, Department of Computer Science & Information Engineering, National Taiwan University, Taiwan.
[24]
P. T. Pope and J. T. Webster, “The use of an f-statistic in stepwise regression procedures”, Technometrics, 1972.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文