研究生: |
江甘霖 GIANG, KIM-LAM |
---|---|
論文名稱: |
使用語音轉換從樂譜資訊進行多說話者的歌聲合成 Multi-Speaker Singing Voice Synthesis from Musical Scores using Voice Conversion |
指導教授: |
蘇豐文
Soo, Von-Wun |
口試委員: |
劉瑞瓏
Liu, Rey-Long 胡敏君 Hu, Min-Chun |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 28 |
中文關鍵詞: | 歌聲生成 、語音轉換 、深度學習 |
外文關鍵詞: | Singing Synthesis, Voice Conversion, Deep Learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在這篇論文裡,我們提出一個使用語音轉換從樂譜資訊進行多說話者的歌聲合成方法。我們的想法來自於結合了拼接法歌聲合成器以及多說話者聲音轉換模型。我們先用拼接法歌聲合成器來從MusicXML,一個音樂的資料表示合成一個虛擬的歌聲。之後再用聲音轉換模型把此虛擬歌聲轉換成不同的音色。我們會先錄好所有中文發音來實現拼接法歌聲合成器。對於聲音轉換模型,是使用了StanGAN-VC的模型來做各個說話者之間的聲音轉換。總體來說,我們用了八個說話者的發言資料以及上面有提到的預收錄的虛擬聲音來完成實驗。我們也用了一個客觀評估及一個主觀評估來證明我們提出的方法在聲音轉換的部分有很好的表現,使得該模型可以實現生成多個可以區分出來的聲音。
In this paper, we propose a method for the task of synthesizing multiple speakers’ Chinese singing voice from musical scores. Our model was inspired by combining a concatenative singing synthesizer and a multi-speaker voice conversion model. We first use the concatenative singing synthesizer to synthesize singing voices from a symbolic representation of music called MusicXML. Then use the voice conversion model to convert to different speaker’s voices. We prerecord a set of all Chinese pronunciation for the concatenative singing synthesis. And for the voice conversion model, StarGAN-VC is employed to carry out voice conversion between speakers. We use eight speakers’ speech data from a Chinese dataset and the prerecord voice mention above. An objective evaluation and an subjective evaluation revealed that our proposed methods have well performance in the voice conversion part so that it can generate multiple distinguishing voices.
[1] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. Starganvc: Non-parallel many-to-many voice conversion with star generative adversarial networks, 2018.
[2] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Choo. Stargan: Unified generative adversarial networks for multi-domain image-toimage translation, 2018.
[3] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam
Wester, Zhizheng Wu, and Junichi Yamagishi. The voice conversion challenge 2016.
In Interspeech 2016, pages 1632–1636, 2016. doi: 10.21437/Interspeech.2016-1066.
URL http://dx.doi.org/10.21437/Interspeech.2016-1066.
[4] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando
Villavicencio, Tomi Kinnunen, and Zhenhua Ling. The voice conversion challenge
2018: Promoting development of parallel and nonparallel methods, 2018.
[5] Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das,
24
Tomi Kinnunen, Zhenhua Ling, and Tomoki Toda. Voice conversion challenge 2020:
Intra-lingual semi-parallel and cross-lingual voice conversion, 2020.
[6] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen,
Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer
learning from speaker verification to multispeaker text-to-speech synthesis, 2019.
[7] Jordi Bonada, Mart Umbert, and Merlijn Blaauw. Expressive singing synthesis based
on unit selection for the singing synthesis challenge 2016. In Interspeech 2016, pages
1230–1234, 2016. doi: 10.21437/Interspeech.2016-872. URL http://dx.doi.
org/10.21437/Interspeech.2016-872.
[8] Keijiro Saino, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, and Keiichi Tokuda.
An hmm-based singing voice synthesis system. In INTERSPEECH 2006 - ICSLP,
Ninth International Conference on Spoken Language Processing, Pittsburgh, PA,
USA, September 17-21, 2006. ISCA, 2006. URL http://www.isca-speech.
org/archive/interspeech\_2006/i06\_2077.html.
[9] X. Li and Z. Wang. A hmm-based mandarin chinese singing voice synthesis system.
IEEE/CAA Journal of Automatica Sinica, 3(2):192–202, 2016. doi: 10.1109/JAS.
2016.7451107.
[10] K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda. Hmm-based singing voice synthesis and its application to japanese and english. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 265–269, 2014.
doi: 10.1109/ICASSP.2014.6853599.
25
[11] Merlijn Blaauw and Jordi Bonada. A neural parametric singing synthesizer, 2017.
[12] Merlijn Blaauw, Jordi Bonada, and Ryunosuke Daido. Data efficient voice cloning
for neural singing synthesis, 2019.
[13] Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. Mellotron: Multispeaker
expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,
2019.
[14] Merlijn Blaauw and Jordi Bonada. Sequence-to-sequence singing synthesis using the
feed-forward transformer, 2020.
[15] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez. Wgansing:
A multi-voice singing voice synthesizer based on the wasserstein-gan. 2019 27th
European Signal Processing Conference (EUSIPCO), Sep 2019. doi: 10.23919/
eusipco.2019.8903099. URL http://dx.doi.org/10.23919/EUSIPCO.
2019.8903099.
[16] K. Kobayashi, T. Toda, Graham Neubig, Sakriani Sakti, and S. Nakamura. Statistical
singing voice conversion based on direct waveform modification with global variance.
In INTERSPEECH, 2015.
[17] K. Kobayashi, T. Toda, G. Neubig, Sakriani Sakti, and Satoshi Nakamura. Statistical
singing voice conversion with direct waveform modification based on the spectrum
differential. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 2514–2518, 01 2014.
26
[18] Fernando Villavicencio and Jordi Bonada. Applying voice conversion to concatenative singing-voice synthesis. pages 2162–2165, 01 2010.
[19] Eliya Nachmani and Lior Wolf. Unsupervised singing voice conversion, 2019.
[20] Adam Polyak, Lior Wolf, Yossi Adi, and Yaniv Taigman. Unsupervised cross-domain
singing voice conversion, 2020.
[21] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark HasegawaJohnson. Autovc: Zero-shot voice style transfer with only autoencoder loss, 2019.
[22] Ruobai Wang, Yu Ding, Lincheng Li, and Changjie Fan. One-shot voice conversion using star-gan. pages 7729–7733, 05 2020. doi: 10.1109/ICASSP40776.2020.
9053842.
[23] 漢語拼音音節列表, 2020. URL https://zh.wikipedia.org/w/index.
php?title=%E6%B1%89%E8%AF%AD%E6%8B%BC%E9%9F%B3%E9%9F%B3%
E8%8A%82%E5%88%97%E8%A1%A8&oldid=62486704.
[24] Masanori MORISE, Fumiya YOKOMORI, and Kenji OZAWA. World: A vocoderbased high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016. doi: 10.1587/
transinf.2015EDP7457.
[25] Takuhiro Kaneko and Hirokazu Kameoka. Parallel-data-free voice conversion using
cycle-consistent adversarial networks, 2017.
[26] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018.
27
[27] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks, 2017.
[28] aidatatang 200zh. a free Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd. URL https://www.datatang.com.