簡易檢索 / 詳目顯示

研究生: 廖亭鈞
Liao, Ting-Chun
論文名稱: 透過World vocoder進行中文語音至歌聲之自動轉換
Automatic Mandarin Speech-to-singing conversion via the World vocoder
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 鄭博泰
Cheng, Po-Tai
蘇豐文
Soo, Feng-Wen
陳宜欣
Chen, Yi-Hsin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 108
語文別: 中文
論文頁數: 46
中文關鍵詞: 語音至歌聲轉換中文World vocoder
外文關鍵詞: speech-to-singing conversion, Mandarin, World vocoder
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 歌聲合成是近年熱門的研究領域。透過輸入歌詞及旋律等資訊,系統自動生成指定對象的歌聲。若僅需歌曲的資訊與目標對象的聲音特徵便能自動合成以某人嗓音哼唱該首歌,將大幅降低時間與金錢的負擔成本;於日常應用層面,使用者也能利用該系統對創作的結果進行調整修改,為其作品提供一個大致的輪廓。本研究旨在輸入短時間中文連續語音與歌曲旋律的前提下,透過World vocoder對語音提取之各項參數的調整和運算,建立中文的自動語音轉歌聲系統。此系統分為三部份,第一為單字分割子系統:自動判定語音中每個字的起迄時間;第二為時間尺度變化子系統:對每個中文字進行時域上的延長,同時保持聽覺上不失真。在主觀聽力測驗中,本論文提出的LI-VD和ODP方法分別得到4.43與3.73分(滿分5分),改善語音延長所導致的機械音和不自然咬字現象;第三為音高變化子系統:藉由調整聲碼器提取的頻率參數以改變音高。通過三個子系統的處理,將調整後的參數送回World vocoder合成以某人嗓音哼唱的歌聲。


    Singing voice synthesis becomes a popular research area in recent years. Through the input of lyrics and melody, a system can automatically synthesize a target singer’s singing voice. If the requirements for automatically generating someone’s singing voice on a certain song are only the information about the song and his/her speech features, the burden of time and money would decrease substantially. For daily application, the system could help users modify their creative works. This research proposed a Mandarin speech-to-singing system. Through adjustment and computation of parameters extracted from input speech by the World vocoder, one can generate synthesized singing voice. The proposed system consists of three parts. The first is the single word segmentation subsystem: it automatically determines the start and end time of each word in the speech. The second is the time scale modification subsystem: time domain extension or shortening of each Mandarin syllable, while minimizing perceptible distortion. In subjective listening tests, the proposed LI-VD and ODP methods got average 4.43 and 3.73 scores (full score is 5), respectively. LI-VD solved the mechanical-sound problem and ODP improved the unnatural pronunciation of Mandarin stretched speech. The third is the pitch shifting subsystem: the change on pitch is realized by adjusting the frequency parameter extracted by the vocoder. After the modification by the three subsystems, the World vocoder could synthesize a target singer’s singing voice.

    摘要................................................................Ⅰ Abstract...........................................................Ⅱ 目錄..............................................................Ⅲ 圖目錄.............................................................Ⅴ 表目錄.......................................................... III 第一章 緒論........................................................1 1.1 研究動機.....................................................1 1.2 系統架構.....................................................1 1.3 文獻回顧.....................................................2 1.4 章節大綱.....................................................3 第二章 WORLD vocoder..............................................4 2.1 介紹......................................................4 2.2 基頻......................................................5 2.3 頻譜包絡......................................................9 2.3.1 自適應基頻加窗 (F0-adaptive windowing) ...................10 2.3.2 功率頻譜平滑化 (smoothing of the power spectrum) ............10 2.3.3 倒頻譜之濾波 (liftering in the quefrency domain) ............10 2.4 非週期參數..................................................12 第三章 自動中文語音分割...........................................13 3.1 介紹......................................................13 3.2 中文語音分割方法.............................................13 3.3 實驗結果.....................................................14 3.4 討論......................................................16 第四章 自動中文語音延長............................................17 4.1 介紹......................................................17 4.2 系統流程....................................................18 4.3 中文語音延長方法.............................................19 4.4 建立Optimal Dividing Point字典................................19 4.4.1 中文的適當延長區域………………………………………………...19 4.4.2 中文拼音資料彙整.......................................22 4.4.3 錄音環境、器材與過程....................................27 4.4.4 歐式距離(Euclidean distance) ..............................27 4.4.5 最佳分割點(Optimal dividing point) .........................27 4.5 主觀聽力測驗.................................................29 4.6 結果與討論...................................................31 第五章 音高調整....................................................33 5.1 介紹........................................................33 5.2 音高公式....................................................33 5.3 音高調整結果與討論...........................................34 第六章 合成結果與討論..............................................36 6.1 中文語音轉歌聲合成結果.......................................36 6.2 主觀聽力測驗結果與討論.......................................39 第七章 結論與未來發展..............................................41 參考文獻..........................................................42 附錄..........................................................46 圖1.1  中文語音至歌聲轉換架構圖.....................................2 圖2.1  World vocoder系統.........................................4 圖2.2  正弦波取4段波長示意圖.......................................5 圖2.3  候選基頻....................................................6 圖2.4  voiced sound判定示意圖......................................8 圖2.5  語音的SP範例圖.............................................11 圖2.6  語音的AP範例圖.............................................12 圖3.1  「當一陣風吹來」自動語音分割與正確邊界比較圖.................14 圖3.2  「風箏飛上天空」自動語音分割與正確邊界比較圖.................15 圖3.3  「蝴蝶眨幾次眼睛」自動語音分割與正確邊界比較圖...............15 圖4.1  中文語音延長流程圖.........................................18 圖4.2 語音一之DTW路徑…………………………………………………20 圖4.3 語音二之DTW路徑…………………………………………………20 圖4.4 語音三之DTW路徑…………………………………………………21 圖4.5  中文字「幾」的最佳分割點....................................28 圖4.6  中文字「次」的最佳分割點....................................28 圖4.7  optimal dividing point字典檔...............................29 圖5-1  MIDI音符碼對應表..........................................33 圖5.2  「蝴蝶眨幾次眼睛」在音高調整之前的基頻曲線圖.................34 圖5.3  「蝴蝶眨幾次眼睛」經過音高調整後的基頻曲線圖.................34 圖6.1  「蝴蝶眨幾次眼睛」語音時頻圖................................36 圖6.2  「蝴蝶眨幾次眼睛」合成歌聲時頻圖............................36 圖6.3  「當一陣風吹來」語音時頻圖..................................37 圖6.4  「當一陣風吹來」合成歌聲時頻圖..............................37 圖6.5  「風箏飛上天空」語音時頻圖..................................38 圖6.6  「風箏飛上天空」合成歌聲時頻圖..............................38 圖6.7   根據三個原則評分的主觀聽測結果............................39 表4.1 DTW聽測平均分數與標準差……………………………………………21 表4.2 第一部份的主觀聽力測驗結果..................................31 表4.3 第二部份的主觀聽力測驗結果.................................31

    [1] Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang,Y., and Shoeybi, M. (2017). “Deep voice: Real-time neural text-to-speech,” 34th International Conference on Machine Learning, ICML 2017, 1, pp. 264–273.
    [2] Wang, R. H., Chen, S. H., Tao, J., and Chu, M. (2006). “Mandarin text-to-speech synthesis,” Advances in Chinese Spoken Language Processing, pp. 99–124.
    [3] Scherer, K. R., Sundberg, J., Fantini, B., Trznadel, S., and Eyben, F. (2017). “The expression of emotion in the singing voice: Acoustic patterns in vocal performance,” The Journal of the Acoustical Society of America, 142(4), 1805–1815.
    [4] Umbert, M., Bonada, J., and Blaauw, M. (2013). “Generating singing voice expression contours based on unit selection,” Stockholm Music Acoustics Conference (SMAC), 315–320.
    [5] Nishimura, M., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2016). “Singing voice synthesis based on deep neural networks,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08-12-September-2016, 2478–2482.
    [6] Nakamura, K., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2019).“Singing voice synthesis based on convolutional neural networks,” arXiv preprint arXiv:1904.06868.
    [7] Ehab Sakran, A., Mahdy Abdou, S., Eldeen Hamid, S., and Rashwan, M. (2017).“International Journal of Computer Science and Mobile Computing A Review:Automatic Speech Segmentation,” International Journal of Computer Science and Mobile Computing, 6(4), 308–315.
    [8] Cai, R. (2012). “An Automatic Syllable Segmentation Method for Mandarin Speech,” Computer Science & Information Engineering College, Tianjin University of Science and Technology, Tianjin, China.
    [9] Driedger, J. (2011). “Time-Scale Modification Algorithms for Music Audio Signals,” Master’s thesis, Saarland University.
    [10] Pruša, Z., &Holighaus, N. (2017). “Phase vocoder done right,” 25th European Signal Processing Conference, EUSIPCO 2017, 2017–January (2), 976–980.
    [11] Lewis, Eric, and Mark Tatham (1999). “Word and syllable concatenation in text-to-speech synthesis,” Sixth European Conference on Speech Communication and Technology.
    [12] Saito, Yuki, Shinnosuke Takamichi, and Hiroshi Saruwatari (2017). “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 84-96.
    [13] Morise, M., Yokomori, F., and Ozawa, K. (2016). “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and Systems, E99.D(7), 1877–1884.
    [14] Kawahara, H. (2006). “STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds,” Acoustical Science and Technology, 27(6), 349–353.
    [15] Morise, M. (2017). “Harvest: A high-performance fundamental frequency estimator from speech signals,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017–August, 2321–2325.
    [16] Morise, M. (2015). “CheapTrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, 67, 1–7.
    [17] Morise, M. (2016). “D4C, a band-aperiodicity estimator for high-quality speech synthesis,” Speech Communication, 84, 57–65.
    [18] Mittmann, M. M., and Barbosa, P. A. (2016). “An automatic speech segmentation tool based on multiple acoustic parameters,” CHIMERA: Romance Corpora and Linguistic Studies, 3(2), 133–147.
    [19] Kalamani, M., Valarmathy, S., Anitha, S., and Mohan, R. (2014). “Review of Speech
    Segmentation Algorithms for Speech Recognition,” International Journal of Advanced Research in Electronics and Communication Engineering, 3(11), 1572–1574.
    [20] Moulines, E., and Laroche, J. (1995). “Non-parametric techniques for pitch-scale and time-scale modification of speech,” Science, 6393(94).
    [21] Driedger, J., &Müller, M. (2016). “A review of time-scale modification of music signals,” Applied Sciences (Switzerland), 6(2), p. 57.
    [22] Lin, M.-T., Lee, C.-K., Lin, and C.-Y (1999). “Consonant/vowel segmentation for Mandarin syllable recognition,” Computer Speech and Language, 13(3), pp 207–222.
    [23] Liu, Y. T., Tsao, Y., and Chang, R. Y. (2015). “A deep neural network based approach to Mandarin consonant/vowel separation,” 2015 IEEE International Conference on Consumer Electronics - Taiwan, ICCE-TW 2015, (4), 324–325.
    [24] Haghparast, A., Penttinen, H., and Välimäki, V. (2007). “Real-time pitch-shifting of musical signals by a time-varying factor using Normalized Filtered Correlation Time-Scale Modification (NFC-TSM),” Proceedings of the 10th International Conference on Digital Audio Effects, DAFx 2007, 7–13.
    [25] Gerhard, D. (2003). “Pitch Extraction and Fundamental Frequency : History and Current Techniques Theory of Pitch,” Technical Report TR-CS, pp. 1–22.
    [26] Lenarczyk, M. (2017). “Real time pitch shifting with formant structure preservation using the phase vocoder,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017–August, 2032–2033.

    QR CODE