研究生: |
李依哲 Lee, Yi-Jhe |
---|---|
論文名稱: |
基於雙向時間遞歸神經網路之中文歌聲合成 Mandarin Singing Voice Synthesis Based on Bidirectional LSTM Recurrent Neural Networks |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
吳誠文
Wu, Cheng-Wen 吳尚鴻 Wu, Shan-Hung 楊奕軒 Yang, Yi-Hsuan |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2019 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 49 |
中文關鍵詞: | 歌聲合成 、歌聲資料庫 、聲調語言 |
外文關鍵詞: | Singing voice synthesis, Singing voice database, Tone language |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
歌聲合成是一個可以從樂譜合成出歌聲的技術,在日文、英文、韓文、西班牙等語言中,目前已有許多合成方法已經被提出。但是,不同於上述的語言,中文是聲調語言,也就是聲調會影響中文的語意,而且關於中文歌聲合成的研究現在還是相對少見的,此外,也還沒有一個公開的資料庫適合給中文歌聲合成使用。因此,在本篇論文中,我們首先建立了一個中文歌聲資料庫,並提出、且實作了基於雙向遞歸神經網路的歌聲合成架構。在我們所建立的資料庫中,共蒐集了由4位有經驗的歌手所演唱的600首中文流行歌,並半自動的標記了語言和音樂的資訊。而在歌聲合成架構中,我們提出了「有考慮中文聲調」的上下文本組成方法,以作為合成系統的輸入。接著,我們也提出可以模擬包含休息或換氣片段的參數化節奏模擬方法。另外,在此歌聲合成架構中,能在分別讓模型學習基頻和其他聲音特徵後,生成中文歌聲。最後,在觀察生成的聲音特徵後發現,本研究所提出之模型可以模擬人類唱歌時的基頻特徵,且發音和音色也能被模型習得。根據客觀評斷的結果,在考量中文的聲調後,歌聲合成的結果會較好,且相比於使用深度神經網路或遞歸神經網路的模型,基於雙向遞歸神經網路之生成模型具有最好的合成結果。
A singing voice synthesis system is able to generate singing voice from a given musical score. So far, several approaches have been proposed to synthesize singing voice in Japanese, English, Korean, Spanish, and so on. However, different from the languages mentioned before, Mandarin is tonal, which means that the linguistic tonality may influence the word meanings. In addition, Mandarin singing voice synthesis is rarely investigated compared to other languages. Besides, appropriate database for Mandarin singing voice synthesis are not publicly available yet. In this research, we create a Mandarin singing voice database and present a Mandarin singing voice synthesis system based on a bidirectional long short-term memory recurrent neural network (BiLSTM). The database is constructed with 600 Mandarin pop songs which were sung by 4 experienced singers, and the linguistic and musical information was also semiautomatic annotated. In the proposed framework for singing voice synthesis, we design a new set of contextual factor definitions with the linguistic tonality consideration to be used as the input to the synthesis system. Then, a new parametric method is proposed for modeling the rhythm and tempo with the additional rests. Next, the singing voice could be learned and generated by training the models of fundamental frequency and other acoustic features separately. Finally, by observing the generated acoustic features, the proposed system could model the characteristics of humanize F0 fluctuation. The pronunciation and the timbre was also learned by the models. According to the objective evaluations, taking linguistic tonality into account could produce better results. Besides, compared to the model with deep neural network (DNN) or long short-term memory neural network (LSTM), the generated model based on the BiLSTM outperformed.
[1] K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, “HMM-based singing voice synthesis and its application to Japanese and English,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 265–269, May 2014.
[2] M. Blaauw and J. Bonada, “A neural parametric singing synthesizer modeling timbre and expression from natural songs,” Applied Sciences, vol. 7, p. 1313, Dec 2017.
[3] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J. Kim, “Korean singing voice synthesis based on an LSTM recurrent neural network,” in INTERSPEECH 2018, pp. 1551–1555, Sep 2018.
[4] Y.-J. Lee, B.-Y. Chen, Y.-T. Lai, H.-W. Liao, T.-C. Liao, S.-L. Kao, K.-Y. Kang, C.-T. Hsu, and Y.-W. Liu, “Examining the influence of word tonality on pitch contours when singing in mandarin,” in 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments, pp. 89–94, May 2018.
[5] Y.-J. Lee, T.-C. Liao, and Y.-W. Liu, “A simple strategy for natural mandarin spoken word stretching via the vocoder,” in the International Congress on Acoustics 2019, Sep 2019.
[6] K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “HMM-based singing voice synthesis system,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 2274–2277, 2006.
[7] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang, “Vocal activity informed singing voice separation with the ikala dataset,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 718–722, April 2015.
[8] J. Wilkins, P. Seetharaman, A. Wahl, and B. Pardo, “Vocalset: A singing voice dataset,” in International Society for Music Information Retrieval (ISMIR), 2018.
[9] T. Nakano and M. Goto, “Vocalistener: A singing-to-singing synthesis system based on iterative parameter estimation,” Proceedings of Sound and Music Computing (SMC), pp. 343–348, Jan 2009.
[10] T. Nakano and M. Goto, “Vocalistener2: A singing synthesis system able to mimic a user’s singing in terms of voice timbre changes as well as pitch and dynamics,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 453–456, May 2011.
[11] H.-Y. Gu and H.-L. Liau, “Mandarin singing voice synthesis using an HMM based scheme,” Journal of Information Science and Engineering - JISE, vol. 27, pp. 347–351, Jun 2008.
[12] X. Li and Z. Wang, “A HMM-based mandarin chinese singing voice synthesis system,” IEEE/CAA Journal of Automatica Sinica, vol. 3, pp. 192–202, April 2016.
[13] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on deep neural networks,” in INTERSPEECH 2016, pp. 2478–2482, Sep 2016.
[14] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “WGANSing: A multivoice singing voice synthesizer based on the Wasserstein-GAN,” ArXiv, vol. abs/1903.10729, 2019.
[15] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6955–6959, May 2019.
[16] M. o. E. Organized by Department of Lifelong Education, The Manual of the Phonetic Symbols of Mandarin Chinese (Digital Version). Ministry of Education, 1 ed., Jan. 2017.
[17] P. Boersma and D. Weenink, “Praat, a system for doing phonetics by computer,” Glot international, vol. 5, pp. 341–345, Jan 2001.
[18] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Transactions, vol. 99-D, pp. 1877–1884, 2016.
[19] M. Morise, “Harvest: A high-performance fundamental frequency estimator from speech signals,” in INTERSPEECH 2017, pp. 2321–2325, Aug 2017.
[20] M. Morise, “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol. 67, pp. 1–7, 2015.
[21] M. Morise, “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, vol. 84, pp. 57–65, 2016.
[22] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized cepstral analysis,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 1043–1046, 1994.
[23] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of the nitech hmmbased speech synthesis system for the blizzard challenge 2005,” IEICE Transactions, vol. 90-D, pp. 325–333, 01 2007.
[24] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning,” in From Natural to Artificial Neural Computation (J. Mira and F. Sandoval, eds.), (Berlin, Heidelberg), pp. 195–201, Springer Berlin Heidelberg, 1995.
[25] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
[26] T. Saitou, M. Unoki, and M. Akagi, “Development of an f0 control model based on f0 dynamic characteristics for singing-voice synthesis,” Speech Communication, vol. 46, no. 3, pp. 405 – 417, 2005.
[27] J. Kominek, T. Schultz, and A. W. Black, “Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion,” in Spoken Languages Technologies for Under-Resourced Languages (SLTU), 2008.
[28] H. Mori, W. Odagiri, and H. Kasuya, “F0 dynamics in singing: Evidence from the data of a baritone singer,” IEICE Transactions on Information and Systems, pp. 1086–1092, May 2004.
[29] Y. Hono, S. Murata, K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Recent development of the dnn-based singing voice synthesis system – sinsy,” in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1003–1009, Nov 2018.
[30] K. Hua, “Modeling singing f0 with neural network driven transition-sustain models,” ArXiv, vol. abs/1803.04030, 2018.
[31] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura, “Speaker interpolation in hmm-based speech synthesis system,” in EUROSPEECH, 1997.