研究生: |
廖亭鈞 Liao, Ting-Chun |
---|---|
論文名稱: |
透過World vocoder進行中文語音至歌聲之自動轉換 Automatic Mandarin Speech-to-singing conversion via the World vocoder |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
鄭博泰
Cheng, Po-Tai 蘇豐文 Soo, Feng-Wen 陳宜欣 Chen, Yi-Hsin |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2019 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 46 |
中文關鍵詞: | 語音至歌聲轉換 、中文 、World vocoder |
外文關鍵詞: | speech-to-singing conversion, Mandarin, World vocoder |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
歌聲合成是近年熱門的研究領域。透過輸入歌詞及旋律等資訊,系統自動生成指定對象的歌聲。若僅需歌曲的資訊與目標對象的聲音特徵便能自動合成以某人嗓音哼唱該首歌,將大幅降低時間與金錢的負擔成本;於日常應用層面,使用者也能利用該系統對創作的結果進行調整修改,為其作品提供一個大致的輪廓。本研究旨在輸入短時間中文連續語音與歌曲旋律的前提下,透過World vocoder對語音提取之各項參數的調整和運算,建立中文的自動語音轉歌聲系統。此系統分為三部份,第一為單字分割子系統:自動判定語音中每個字的起迄時間;第二為時間尺度變化子系統:對每個中文字進行時域上的延長,同時保持聽覺上不失真。在主觀聽力測驗中,本論文提出的LI-VD和ODP方法分別得到4.43與3.73分(滿分5分),改善語音延長所導致的機械音和不自然咬字現象;第三為音高變化子系統:藉由調整聲碼器提取的頻率參數以改變音高。通過三個子系統的處理,將調整後的參數送回World vocoder合成以某人嗓音哼唱的歌聲。
Singing voice synthesis becomes a popular research area in recent years. Through the input of lyrics and melody, a system can automatically synthesize a target singer’s singing voice. If the requirements for automatically generating someone’s singing voice on a certain song are only the information about the song and his/her speech features, the burden of time and money would decrease substantially. For daily application, the system could help users modify their creative works. This research proposed a Mandarin speech-to-singing system. Through adjustment and computation of parameters extracted from input speech by the World vocoder, one can generate synthesized singing voice. The proposed system consists of three parts. The first is the single word segmentation subsystem: it automatically determines the start and end time of each word in the speech. The second is the time scale modification subsystem: time domain extension or shortening of each Mandarin syllable, while minimizing perceptible distortion. In subjective listening tests, the proposed LI-VD and ODP methods got average 4.43 and 3.73 scores (full score is 5), respectively. LI-VD solved the mechanical-sound problem and ODP improved the unnatural pronunciation of Mandarin stretched speech. The third is the pitch shifting subsystem: the change on pitch is realized by adjusting the frequency parameter extracted by the vocoder. After the modification by the three subsystems, the World vocoder could synthesize a target singer’s singing voice.
[1] Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang,Y., and Shoeybi, M. (2017). “Deep voice: Real-time neural text-to-speech,” 34th International Conference on Machine Learning, ICML 2017, 1, pp. 264–273.
[2] Wang, R. H., Chen, S. H., Tao, J., and Chu, M. (2006). “Mandarin text-to-speech synthesis,” Advances in Chinese Spoken Language Processing, pp. 99–124.
[3] Scherer, K. R., Sundberg, J., Fantini, B., Trznadel, S., and Eyben, F. (2017). “The expression of emotion in the singing voice: Acoustic patterns in vocal performance,” The Journal of the Acoustical Society of America, 142(4), 1805–1815.
[4] Umbert, M., Bonada, J., and Blaauw, M. (2013). “Generating singing voice expression contours based on unit selection,” Stockholm Music Acoustics Conference (SMAC), 315–320.
[5] Nishimura, M., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2016). “Singing voice synthesis based on deep neural networks,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-September-2016, 2478–2482.
[6] Nakamura, K., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2019).“Singing voice synthesis based on convolutional neural networks,” arXiv preprint arXiv:1904.06868.
[7] Ehab Sakran, A., Mahdy Abdou, S., Eldeen Hamid, S., and Rashwan, M. (2017).“International Journal of Computer Science and Mobile Computing A Review:Automatic Speech Segmentation,” International Journal of Computer Science and Mobile Computing, 6(4), 308–315.
[8] Cai, R. (2012). “An Automatic Syllable Segmentation Method for Mandarin Speech,” Computer Science & Information Engineering College, Tianjin University of Science and Technology, Tianjin, China.
[9] Driedger, J. (2011). “Time-Scale Modification Algorithms for Music Audio Signals,” Master’s thesis, Saarland University.
[10] Pruša, Z., &Holighaus, N. (2017). “Phase vocoder done right,” 25th European Signal Processing Conference, EUSIPCO 2017, 2017–January (2), 976–980.
[11] Lewis, Eric, and Mark Tatham (1999). “Word and syllable concatenation in text-to-speech synthesis,” Sixth European Conference on Speech Communication and Technology.
[12] Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari (2017). “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 84-96.
[13] Morise, M., Yokomori, F., and Ozawa, K. (2016). “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and Systems, E99.D(7), 1877–1884.
[14] Kawahara, H. (2006). “STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds,” Acoustical Science and Technology, 27(6), 349–353.
[15] Morise, M. (2017). “Harvest: A high-performance fundamental frequency estimator from speech signals,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017–August, 2321–2325.
[16] Morise, M. (2015). “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, 67, 1–7.
[17] Morise, M. (2016). “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, 84, 57–65.
[18] Mittmann, M. M., and Barbosa, P. A. (2016). “An automatic speech segmentation tool based on multiple acoustic parameters,” CHIMERA: Romance Corpora and Linguistic Studies, 3(2), 133–147.
[19] Kalamani, M., Valarmathy, S., Anitha, S., and Mohan, R. (2014). “Review of Speech
Segmentation Algorithms for Speech Recognition,” International Journal of Advanced Research in Electronics and Communication Engineering, 3(11), 1572–1574.
[20] Moulines, E., and Laroche, J. (1995). “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Science, 6393(94).
[21] Driedger, J., &Müller, M. (2016). “A review of time-scale modification of music signals,” Applied Sciences (Switzerland), 6(2), p. 57.
[22] Lin, M.-T., Lee, C.-K., Lin, and C.-Y (1999). “Consonant/vowel segmentation for Mandarin syllable recognition,” Computer Speech and Language, 13(3), pp 207–222.
[23] Liu, Y. T., Tsao, Y., and Chang, R. Y. (2015). “A deep neural network based approach to Mandarin consonant/vowel separation,” 2015 IEEE International Conference on Consumer Electronics - Taiwan, ICCE-TW 2015, (4), 324–325.
[24] Haghparast, A., Penttinen, H., and Välimäki, V. (2007). “Real-time pitch-shifting of musical signals by a time-varying factor using Normalized Filtered Correlation Time-Scale Modification (NFC-TSM),” Proceedings of the 10th International Conference on Digital Audio Effects, DAFx 2007, 7–13.
[25] Gerhard, D. (2003). “Pitch Extraction and Fundamental Frequency : History and Current Techniques Theory of Pitch,” Technical Report TR-CS, pp. 1–22.
[26] Lenarczyk, M. (2017). “Real time pitch shifting with formant structure preservation using the phase vocoder,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017–August, 2032–2033.