基於雙向時間遞歸神經網路之中文歌聲合成｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	李依哲 Lee, Yi-Jhe
論文名稱：	基於雙向時間遞歸神經網路之中文歌聲合成 Mandarin Singing Voice Synthesis Based on Bidirectional LSTM Recurrent Neural Networks
指導教授：	劉奕汶 Liu, Yi-Wen
口試委員:	吳誠文 Wu, Cheng-Wen 吳尚鴻 Wu, Shan-Hung 楊奕軒 Yang, Yi-Hsuan
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	108
語文別：	英文
論文頁數：	49
中文關鍵詞：	歌聲合成、歌聲資料庫、聲調語言
外文關鍵詞：	Singing voice synthesis, Singing voice database, Tone language
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

歌聲合成是一個可以從樂譜合成出歌聲的技術，在日文、英文、韓文、西班牙等語言中，目前已有許多合成方法已經被提出。但是，不同於上述的語言，中文是聲調語言，也就是聲調會影響中文的語意，而且關於中文歌聲合成的研究現在還是相對少見的，此外，也還沒有一個公開的資料庫適合給中文歌聲合成使用。因此，在本篇論文中，我們首先建立了一個中文歌聲資料庫，並提出、且實作了基於雙向遞歸神經網路的歌聲合成架構。在我們所建立的資料庫中，共蒐集了由4位有經驗的歌手所演唱的600首中文流行歌，並半自動的標記了語言和音樂的資訊。而在歌聲合成架構中，我們提出了「有考慮中文聲調」的上下文本組成方法，以作為合成系統的輸入。接著，我們也提出可以模擬包含休息或換氣片段的參數化節奏模擬方法。另外，在此歌聲合成架構中，能在分別讓模型學習基頻和其他聲音特徵後，生成中文歌聲。最後，在觀察生成的聲音特徵後發現，本研究所提出之模型可以模擬人類唱歌時的基頻特徵，且發音和音色也能被模型習得。根據客觀評斷的結果，在考量中文的聲調後，歌聲合成的結果會較好，且相比於使用深度神經網路或遞歸神經網路的模型，基於雙向遞歸神經網路之生成模型具有最好的合成結果。

A singing voice synthesis system is able to generate singing voice from a given musical score. So far, several approaches have been proposed to synthesize singing voice in Japanese, English, Korean, Spanish, and so on. However, different from the languages mentioned before, Mandarin is tonal, which means that the linguistic tonality may influence the word meanings. In addition, Mandarin singing voice synthesis is rarely investigated compared to other languages. Besides, appropriate database for Mandarin singing voice synthesis are not publicly available yet. In this research, we create a Mandarin singing voice database and present a Mandarin singing voice synthesis system based on a bidirectional long short-term memory recurrent neural network (BiLSTM). The database is constructed with 600 Mandarin pop songs which were sung by 4 experienced singers, and the linguistic and musical information was also semiautomatic annotated. In the proposed framework for singing voice synthesis, we design a new set of contextual factor definitions with the linguistic tonality consideration to be used as the input to the synthesis system. Then, a new parametric method is proposed for modeling the rhythm and tempo with the additional rests. Next, the singing voice could be learned and generated by training the models of fundamental frequency and other acoustic features separately. Finally, by observing the generated acoustic features, the proposed system could model the characteristics of humanize F0 fluctuation. The pronunciation and the timbre was also learned by the models. According to the objective evaluations, taking linguistic tonality into account could produce better results. Besides, compared to the model with deep neural network (DNN) or long short-term memory neural network (LSTM), the generated model based on the BiLSTM outperformed.

CHAPTER 1 Introduction 1
1 Motivation 1
2 Problem Statement 2
2.1 Database creation 2
2.2 Contextual factors definition 2
2.3 Rhythm and tempo modeling 3
2.4 Mandarin singing voice synthesis system 3
3 Goals and Contributions 4
4 Thesis Organization 4
CHAPTER 2 Related Work 5
1 Singing voice synthesis 5
1.1 Performance Driven Approaches 5
1.2 Unit Concatenation Approaches 5
1.3 Statistical Approaches 6
2 Rhythm and tempo modeling approaches 7
CHAPTER 3 Database creation 9
1 Design and recordings 10
2 Labeling 11
2.1 Singing voice segmentation 12
2.2 Phoneme labeling 12
2.3 Automatic singing voice transcription 12
3 Selected dataset coverage 14
CHAPTER 4 The proposed system 16
1 System design 16
2 Contextual factors 18
3 Acoustic features 20
4 Rhythm and tempo modeling: Preliminary design of methods 21
5 F0 modeling 21
6 Acoustic modeling 23
CHAPTER 5 Experiments 24
1 Training 24
1.1 Training technique 24
1.2 Architecture of the models 24
2 Observations of example synthesis results 25
3 Objective evaluation 26
3.1 Comparison of different models 26
3.2 Evaluation of the linguistic tonality influence 26
CHAPTER 6 Results 27
1 Observation of synthesis results 27
1.1 F0 modeling 27
1.2 Acoustic features modeling 30
2 Objective evaluation 35
CHAPTER 7 Conclusions 36
CHAPTER 8 Future works 37
1 Manual correction for the score 37
2 Rhythm and tempo modeling 37
3 Phoneme duration modeling 37
4 A general singing voice synthesis system 38
References 39
Appendix 42
A.1 Contextual factors for Mandarin 42
                                

[1] K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, “HMM-based singing voice synthesis and its application to Japanese and English,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 265–269, May 2014.
[2] M. Blaauw and J. Bonada, “A neural parametric singing synthesizer modeling timbre and expression from natural songs,” Applied Sciences, vol. 7, p. 1313, Dec 2017.
[3] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J. Kim, “Korean singing voice synthesis based on an LSTM recurrent neural network,” in INTERSPEECH 2018, pp. 1551–1555, Sep 2018.
[4] Y.-J. Lee, B.-Y. Chen, Y.-T. Lai, H.-W. Liao, T.-C. Liao, S.-L. Kao, K.-Y. Kang, C.-T. Hsu, and Y.-W. Liu, “Examining the influence of word tonality on pitch contours when singing in mandarin,” in 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments, pp. 89–94, May 2018.
[5] Y.-J. Lee, T.-C. Liao, and Y.-W. Liu, “A simple strategy for natural mandarin spoken word stretching via the vocoder,” in the International Congress on Acoustics 2019, Sep 2019.
[6] K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “HMM-based singing voice synthesis system,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 2274–2277, 2006.
[7] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang, “Vocal activity informed singing voice separation with the ikala dataset,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 718–722, April 2015.
[8] J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “Vocalset: A singing voice dataset,” in International Society for Music Information Retrieval (ISMIR), 2018.
[9] T. Nakano and M. Goto, “Vocalistener: A singing-to-singing synthesis system based on iterative parameter estimation,” Proceedings of Sound and Music Computing (SMC), pp. 343–348, Jan 2009.
[10] T. Nakano and M. Goto, “Vocalistener2: A singing synthesis system able to mimic a user’s singing in terms of voice timbre changes as well as pitch and dynamics,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 453–456, May 2011.
[11] H.-Y. Gu and H.-L. Liau, “Mandarin singing voice synthesis using an HMM based scheme,” Journal of Information Science and Engineering - JISE, vol. 27, pp. 347–351, Jun 2008.
[12] X. Li and Z. Wang, “A HMM-based mandarin chinese singing voice synthesis system,” IEEE/CAA Journal of Automatica Sinica, vol. 3, pp. 192–202, April 2016.
[13] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks,” in INTERSPEECH 2016, pp. 2478–2482, Sep 2016.
[14] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “WGANSing: A multivoice singing voice synthesizer based on the Wasserstein-GAN,” ArXiv, vol. abs/1903.10729, 2019.
[15] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6955–6959, May 2019.
[16] M. o. E. Organized by Department of Lifelong Education, The Manual of the Phonetic Symbols of Mandarin Chinese (Digital Version). Ministry of Education, 1 ed., Jan. 2017.
[17] P. Boersma and D. Weenink, “Praat, a system for doing phonetics by computer,” Glot international, vol. 5, pp. 341–345, Jan 2001.
[18] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Transactions, vol. 99-D, pp. 1877–1884, 2016.
[19] M. Morise, “Harvest: A high-performance fundamental frequency estimator from speech signals,” in INTERSPEECH 2017, pp. 2321–2325, Aug 2017.
[20] M. Morise, “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol. 67, pp. 1–7, 2015.
[21] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57–65, 2016.
[22] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 1043–1046, 1994.
[23] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of the nitech hmmbased speech synthesis system for the blizzard challenge 2005,” IEICE Transactions, vol. 90-D, pp. 325–333, 01 2007.
[24] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning,” in From Natural to Artificial Neural Computation (J. Mira and F. Sandoval, eds.), (Berlin, Heidelberg), pp. 195–201, Springer Berlin Heidelberg, 1995.
[25] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
[26] T. Saitou, M. Unoki, and M. Akagi, “Development of an f0 control model based on f0 dynamic characteristics for singing-voice synthesis,” Speech Communication, vol. 46, no. 3, pp. 405 – 417, 2005.
[27] J. Kominek, T. Schultz, and A. W. Black, “Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion,” in Spoken Languages Technologies for Under-Resourced Languages (SLTU), 2008.
[28] H. Mori, W. Odagiri, and H. Kasuya, “F0 dynamics in singing: Evidence from the data of a baritone singer,” IEICE Transactions on Information and Systems, pp. 1086–1092, May 2004.
[29] Y. Hono, S. Murata, K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Recent development of the dnn-based singing voice synthesis system – sinsy,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1003–1009, Nov 2018.
[30] K. Hua, “Modeling singing f0 with neural network driven transition-sustain models,” ArXiv, vol. abs/1803.04030, 2018.
[31] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura, “Speaker interpolation in hmm-based speech synthesis system,” in EUROSPEECH, 1997.

簡易檢索 / 詳目顯示

相關論文