研究生: |
詹秉宸 Chan, Ping-Chen |
---|---|
論文名稱: |
藉由轉移學習的人類平均意見分數預測器來增強學習的虛擬歌手 Virtual Singers Using Reinforcement Learning with Transfer Learning of a Human MOS Predictor |
指導教授: |
蘇豐文
Soo, Von-Wun |
口試委員: |
邱瀞德
Chiu, Ching-Te 林桂傑 Lin, Kwei-Jay |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2023 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 72 |
中文關鍵詞: | 歌聲合成 、虛擬歌手 、增強式學習 、自督導學習 、轉移學習 、歌聲品質評估 |
外文關鍵詞: | singing voice synthesis, virtual singer, reinforcement learning, self-supervised learning, transfer learning, singing voice evaluation |
相關次數: | 點閱:55 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在傳統的虛擬歌手領域,通常採用歌聲合成模型來生成類似人類的歌聲,然而一旦模型建立,就不易修改虛擬歌手的演唱風格。本研究在多層感知歌手歌聲合成模型中引入增強式學習方法,來提高虛擬歌手的表現。這種方法允許模型自我提高合成歌唱的品質,從而合成出更悅耳的歌聲。我們先以轉移學習使用以語音數據集上預訓練的自監督模型的特徵來改進現有的歌唱的人類平均意見分數預測器,此預測器模擬人類對於歌唱品質的喜好分數,作為增強式學習中品質回饋的獎勵函數。再透過虛擬歌手和歌唱品質預測器的迭代互動,使虛擬歌手學習最大化獎勵,從而優化歌唱合成的決策參數。
In the traditional domain of virtual singers, singing voice synthesis models are typically employed to generate human-like singing. However, once the model is established, altering the singing style of the virtual singer becomes a challenging endeavor. We propose a reinforcement learning method for the Multi-Layer Perceptron Singer (MLP Singer) singing synthesis model to enhance the performance of the virtual singer. This method enables the virtual singer to self-improve the quality of synthesized singing, thereby generating more pleasing singing. We initially employ a transfer learning approach to leverage the features of a self-supervised model pre-trained on speech datasets aiming to improve an existing Mean Opinion Score (MOS) predictor for singing. This predictor simulates the human preference score for singing quality, serving as a reward function for quality feedback in reinforcement learning. Through iterative interaction between the virtual singer and the MOS predictor, the virtual singer learns to maximize the reward, optimizing decision parameters for singing synthesis.
[1] Jaesung Tae, Hyeongju Kim, and Younggun Lee. Mlp singer: Towards rapid par- allel korean singing voice synthesis. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2021.
[2] ITUT Rec. P. 800: Methods for subjective determination of transmission quality.
International Telecommunication Union, Geneva, 22, 1996.
[3] Li-Chia Yang and Alexander Lerch. On the evaluation of generative models in music.
Neural Computing and Applications, 32(9):4773–4784, 2020.
[4] Jinhu Li, Chitralekha Gupta, and Haizhou Li. Training explainable singing quality assessment network with augmented data. In 2021 Asia-Pacific Signal and Informa- tion Processing Association Annual Summit and Conference (APSIPA ASC), pages 904–911. IEEE, 2021.
[5] Chitralekha Gupta, Jinhu Li, and Haizhou Li. Towards reference-independent rhythm assessment of solo singing. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 912–919. IEEE, 2021.
[6] Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. General- ization ability of mos prediction networks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442–8446. IEEE, 2022.
[7] Yingzhi Wang, Abdelmoumene Boumadane, and Abdelwahab Heba. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding, 2021.
[8] Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Yannick Esteve, and Laurent Be- sacier. Investigating self-supervised pre-training for end-to-end speech translation. In Interspeech 2020, 2020.
[9] Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee.
Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine- grained voice fragments with attention. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5939–5943. IEEE, 2021.
[10] Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu, and Dong Yu. Self-supervised text- independent speaker verification using prototypical momentum contrastive learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 6723–6727. IEEE, 2021.
[11] Alexei Baevski and Abdelrahman Mohamed. Effectiveness of self-supervised pre- training for asr. In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 7694–7698. IEEE, 2020.
[12] Yuma Koizumi, Kenta Niwa, Yusuke Hioka, Kazunori Kobayashi, and Yoichi Haneda.
Dnn-based source enhancement to increase objective sound quality assessment score. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1780– 1792, 2018.
[13] Taku Kala and Takahiro Shinozaki. Reinforcement learning of speech recognition system based on policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5759–5763. IEEE, 2018.
[14] Rui Liu, Berrak Sisman, and Haizhou Li. Reinforcement learning for emotional text- to-speech synthesis with improved emotion discriminability, 2021.
[15] Chuan Cao, Ming Li, Jian Liu, and Yonghong Yan. A study on singing performance evaluation criteria for untrained singers. In 2008 9th International Conference on Signal Processing, pages 1475–1478, 2008. doi: 10.1109/ICOSP.2008.4697411.
[16] Partha Lal. A comparison of singing evaluation algorithms. volume 5, 09 2006. doi: 10.21437/Interspeech.2006-590.
[17] Chang-Hung Lin, Yuan-Shan Lee, Ming-Yen Chen, and Jia-Ching Wang. Auto- matic singing evaluating system based on acoustic features and rhythm. In 2014 International Conference on Orange Technologies, pages 165–168, 2014. doi: 10.1109/ICOT.2014.6956625.
[18] Emilio Molina, Isabel Barbancho, Emilia Gmez, Ana Mara Barbancho, and Lorenzo J. Tardn. Fundamental frequency alignment vs. note-based melodic similar- ity for singing voice assessment. In 2013 IEEE International Conference on Acous- tics, Speech and Signal Processing, pages 744–748, 2013. doi: 10.1109/ICASSP. 2013.6637747.
[19] Wei-Ho Tsai and Hsin-Chieh Lee. Automatic evaluation of karaoke singing based on pitch, volume, and rhythm features. IEEE Transactions on Audio, Speech, and Language Processing, 20(4):1233–1243, 2012. doi: 10.1109/TASL.2011.2174224.
[20] Chitralekha Gupta, Haizhou Li, and Ye Wang. Perceptual evaluation of singing qual- ity. In 2017 Asia-Pacific Signal and Information Processing Association Annual Sum- mit and Conference (APSIPA ASC), pages 577–586, 2017. doi: 10.1109/APSIPA. 2017.8282110.
[21] Chitralekha Gupta, Haizhou Li, and Ye Wang. Automatic leaderboard: Evaluation of singing quality without a standard reference. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 28:13–26, 2020. doi: 10.1109/TASLP.2019. 2947737.
[22] Chitralekha Gupta, Lin Huang, and Haizhou Li. Automatic rank-ordering of singing vocals with twin-neural network. International Symposium/Conference on Music In- formation Retrieval, Oct 2020. doi: 10.5281/zenodo.4245458.
[23] Lin Huang, Chitralekha Gupta, and Haizhou Li. Spectral features and pitch histogram for automatic singing quality evaluation with crnn. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 492–499, 2020.
[24] Christian Scho¨rkhuber and Anssi Klapuri. Constant-q transform toolbox for music processing. In 7th sound and music computing conference, Barcelona, Spain, pages 3–64, 2010.
[25] Yichong Leng, Xu Tan, Sheng Zhao, Frank Soong, Xiang-Yang Li, and Tao Qin. Mb- net: Mos prediction for synthesized speech with mean-bias network. In ICASSP 2021
- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 391–395, 2021. doi: 10.1109/ICASSP39728.2021.9413877.
[26] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018.
[27] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech repre- sentation learning by masked prediction of hidden units, 2021.
[28] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
[29] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass. An Unsupervised Au- toregressive Model for Speech Representation Learning. In Proc. Interspeech 2019, pages 146–150, 2019. doi: 10.21437/Interspeech.2019-1473.
[30] Andy T. Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021. doi: 10.1109/TASLP.2021. 3095662.
[31] Gautam Bhattacharya, Md Jahangir Alam, and Patrick Kenny. Deep speaker em- beddings for short-duration speaker verification. pages 1517–1521, 08 2017. doi: 10.21437/Interspeech.2017-1575.
[32] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. Ecapa-tdnn: Empha- sized channel attention, propagation and aggregation in tdnn based speaker verifica- tion. 10 2020. doi: 10.21437/Interspeech.2020-2650.
[33] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo, and Li-Rong Dai. An at- tention pooling based representation learning method for speech emotion recognition. In Interspeech 2018. International Speech Communication Association, September 2018. URL https://kar.kent.ac.uk/67453/.
[34] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li, Jianwei Yu, Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu, Xunying Liu, and Helen Meng. Speech emotion recogni- tion using capsule networks. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6695–6699, 2019. doi: 10.1109/ICASSP.2019.8683163.
[35] Leonardo Pepino, Pablo Riera, and Luciana Ferrer. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proc. Interspeech 2021, pages 3400–3404, 2021. doi: 10.21437/Interspeech.2021-703.
[36] Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Generaliza- tion ability of mos prediction networks. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8442–8446, 2022. doi: 10.1109/ICASSP43922.2022.9746395.
[37] Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. Data for the voicemos challenge 2022, May 2022. URL https://doi.org/10.5281/ zenodo.6572573.
[38] Wei-Cheng Tseng, Chien-Yu Huang, Wei-Tsung Kao, Yist Y. Lin, and Hung yi Lee. Utilizing self-supervised representations for mos prediction. In Interspeech, 2021.
[39] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods, 2018.
[40] Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, and Tomoki Toda. Ldnet: Uni- fied listener dependent modeling in mos prediction for synthetic speech, 2021.
[41] Hideki Kenmochi and Hayato Ohshita. Vocaloid - commercial singing synthesizer based on sample concatenation. pages 4009–4010, 01 2007.
[42] KIM-LAM GIANG. Multi-speaker singing voice synthesis from musical scores using voice conversion. Master’s thesis, National Tsinghua University Institute of Informa- tion Systems and Applications, 12 2020. URL https://hdl.handle.net/ 11296/4b9ma7. An optional note.
[43] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. Stargan- vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 266–273. IEEE, 2018.
[44] Keijiro Saino, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, and Keiichi Tokuda. An hmm-based singing voice synthesis system. volume 5, 09 2006. doi: 10.21437/ Interspeech.2006-584.
[45] Jiatong Shi, Shuai Guo, Nan Huo, Yuekai Zhang, and Qin Jin. Sequence-to-sequence singing voice synthesis with perceptual entropy loss. In ICASSP 2021 - 2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 76–80, 2021. doi: 10.1109/ICASSP39728.2021.9414348.
[46] Merlijn Blaauw and Jordi Bonada. A neural parametric singing synthesizer, 2017.
[47] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio, 2016.
[48] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gomez. Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan. 2019. doi: 10. 23919/EUSIPCO.2019.8903099.
[49] Po-Wei Chen and Von-Wun Soo. A few shot learning of singing technique conver- sion based on cycle consistency generative adversarial networks. In ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[50] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5919–5923. IEEE, 2021.
[51] Soonbeom Choi, Wonil Kim, Saebyul Park, Sangeon Yong, and Juhan Nam. Ko- rean singing voice synthesis based on auto-regressive boundary equilibrium gan. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7234–7238. IEEE, 2020.
[52] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism, 2021.
[53] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/ book/the-book-2nd.html.
[54] Kelvin Xavier Munguia Velez and Von-Wun Soo. Polyphonic music composition: An adversarial inverse reinforcement learning approach. 2021.
[55] Nan Jiang, Sheng Jin, Zhiyao Duan, and Changshui Zhang. Rl-duet: Online music accompaniment generation using deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 710–718, 2020.
[56] Douglas Eck and Juergen Schmidhuber. A first look at music composition using lstm recurrent neural networks. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 103(4):48, 2002.
[57] Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose´ Miguel Herna´ndez-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017.
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[59] Mark JF Gales, Kate M Knill, Anton Ragni, and Shakti P Rath. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), pages 16–23. International Speech Communication Associ- ation (ISCA), 2014.
[60] Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor We- ber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019.
[61] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Col- lobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
[62] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
[63] Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazare´, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, et al. Libri-light: A benchmark for asr with limited or no super- vision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
[64] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[65] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
[66] Digital archive mobile performances (damp), 2018. URL https://ccrma. stanford.edu/damp/publications/.
[67] Soonbeom Choi, Wonil Kim, Saebyul Park, Sangeon Yong, and Juhan Nam. Chil- dren’s song dataset for singing voice research. In International Society for Music Information Retrieval Conference (ISMIR), 2020.
[68] Bayashi Kan. Parallelwavegan. URL kan-bayashi.github.io/ ParallelWaveGAN/.
[69] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.