基於降噪擴散 Wasserstein 對抗式神經網路之中文歌聲合成

簡易檢索 / 詳目顯示

回結果列表

研究生：	卓引平 Cho, Yin-Ping
論文名稱：	基於降噪擴散 Wasserstein 對抗式神經網路之中文歌聲合成 Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN
指導教授：	劉奕汶 Liu, Yi-Wen
口試委員:	丁川康 Ting, Chuan-Kang 吳尚鴻 Wu, Shan-Hung 曹昱 Tsao, Yu 王新民 Wang, Hsin-Min
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2022
畢業學年度：	111
語文別：	英文
論文頁數：	69
中文關鍵詞：	深度神經網路、對抗式神經網路、降噪擴散機率模型、歌聲合成
外文關鍵詞：	deep neural network, Wasserstein generative adversarial network, denoising diffusion probabilistic model, singing voice synthesis
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

歌聲合成的任務是按照給定的歌譜由電腦合成近似真人歌唱的聲音訊號，尤其以自然、表達豐富為合成歌聲的主要目標。本研究使用了在語音、歌聲合成應用中被大量使用且實證有效的聲學模型-神經聲碼器架構。爲了在聽感自然度上超越現有模型，本研究結合了擴散式降噪模型和 Wasserstein 對抗式神經網路兩種架構來設計聲學模型。擴散式降噪模型是在近年受到關注的生成模型，其衍生架構在多種生成任務中皆取得了最先進的成果。在擴散式降噪模型的基礎上，本研究加上了Wasserstein 對抗式神經網路，並設計了整合歌譜資訊為條件的的判別器，藉由後者逼近訓練資料的真實生成機率分佈來取代一般的最小重建誤差損失函數，作爲增强聲學特徵細節與變化的手段。此聲學模型搭配 HiFiGAN 神經聲碼器，並且本研究提出了將兩者整合進行共同微調的做法，完成了端到端的歌聲合成系統。此系統使用了本實驗室在這個系列研究中收集標注的 Mpop600 中文多人歌聲資料集。在實驗中，本研究提出的系統在主觀聽測中得到了比過往指標性聲學模型架構更豐富的音樂性與高頻細節。同時，本研究提出的聲學模型在無需最小重建誤差目標的情況下依然訓練穩定，展現了所提出的擴散式降噪模型和 Wasserstein 對抗式神經網路結合架構的收斂性，在此方面相較其他基於對抗式神經網路的歌聲合成系統有其優勢。

Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. In particular, an SVS system aims to generate singing voices as natural and expressive as those of human singers. To accomplish end-to-end SVS tasks effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining diffusion denoising probabilistic model (DDPM) and Wasserstein generative adversarial network (WGAN) to construct the backbone of the acoustic model. With the DDPM formulated in an adversarial training setup, the proposed acoustic model enhances the synthesized acoustic features’
details and variations by replacing DDPM’s common acoustic feature reconstruction objectives with the true data-generating distribution estimated by a Musical-ScoreConditioned discriminator. On top of the proposed acoustic model, a HiFiGAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset produced by our lab in this series of research. In the experiments, the proposed system exhibits improvements over previous landmark counterparts in terms of musical expressiveness and high-frequency acoustic details. Moreover, the adversarial acoustic model converged stably without the need of enforcing reconstruction objectives, which shows the convergence stability of the proposed DDPM and WGAN combined architecture over alternative GAN-based SVS systems.

Introduction 1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 The Challenges for the Acoustic Model . . . . . . . . . . . 2
2.2 Acoustic Feature Generation with Diffusion-WGAN . . . . 3
2.3 Constructing an End-to-End SVS System . . . . . . . . . . 4
3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Related Work 6
1 SVS with the Acoustic Model - Vocoder Structure . . . . . . . . . . 6
2 Previous SVS Acoustic Models . . . . . . . . . . . . . . . . . . . . 7
3 SVS Acoustic Models beyond Reconstructive Objectives . . . . . . 8
Denoising Diffusion Wasserstein GAN for SVS 9
1 Diffusion Denoising Probabilistic Model . . . . . . . . . . . . . . . 9
1.1 Noising Diffusion Process . . . . . . . . . . . . . . . . . . 9
1.2 Denoising Diffusion Process . . . . . . . . . . . . . . . . . 10
1.3 Diffusion Process Parameterization . . . . . . . . . . . . . 10
1.4 Efficient Diffusion . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Training Goal for Denoising Diffusion Wasserstein GAN . . 11
2 Diffusion Wasserstein GAN . . . . . . . . . . . . . . . . . . . . . 12
2.1 Variance Scheduling for few-steps Diffusion . . . . . . . . 12
2.2 WGAN for the few-steps Diffusion Process . . . . . . . . . 13
2.3 Diff-WGAN formulated for SVS . . . . . . . . . . . . . . . 13
2.4 Diff-WGAN-SVS Discriminator Objective . . . . . . . . . 14
2.5 Diff-WGAN-SVS Generator Objective . . . . . . . . . . . 15
The Proposed End-to-End SVS system 16
1 Mel-spectrogram Normalization . . . . . . . . . . . . . . . . . . . 16
2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Musical Score Encoder . . . . . . . . . . . . . . . . . . . . 18
2.2 Variance Adaptor . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Diffusion Decoder . . . . . . . . . . . . . . . . . . . . . . 22
3 Conditional Discriminator . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . 25
3.2 Residual Block . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Locally-Biased Modulation . . . . . . . . . . . . . . . . . . 28
3.4 Diff-WGAN Acoustic Model Training and Inference Algorithms . . . . . 29
4 Integrated HiFiGAN neural vocoder . . . . . . . . . . . . . . . . . 32
4.1 Generator Overview . . . . . . . . . . . . . . . . . . . . . 32
4.2 Discriminator Overview . . . . . . . . . . . . . . . . . . . 32
4.3 HiFiGAN Training Objective . . . . . . . . . . . . . . . . 33
4.4 Integrated Fine-tuning . . . . . . . . . . . . . . . . . . . . 34
Experiments 36
1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.1 The Mpop600 Mandarin Singing Voice Dataset . . . . . . . 36
1.2 Dataset Segmentation . . . . . . . . . . . . . . . . . . . . . 36
2 Compared Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Training Configurations . . . . . . . . . . . . . . . . . . . . . . . . 38
Results 41
1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.1 Perceptual Similarity Metrics . . . . . . . . . . . . . . . . 41
1.2 F0 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . 43
1.3 Objective Results . . . . . . . . . . . . . . . . . . . . . . . 44
2 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1 Subjective Tests . . . . . . . . . . . . . . . . . . . . . . . 45
2.2 Subjective Results . . . . . . . . . . . . . . . . . . . . . . 47
3 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Discussion on Quantitative Evaluation Results . . . . . . . 48
3.2 Qualitative Study on Generated Mel-Spectrograms . . . . . 49
3.3 Remark on Training Stability and Convergence . . . . . . . 51
Conclusions 53
Future Works 54
1 Conditioning on Singing Styles and Techniques . . . . . . . . . . . 54
2 SVS with Biomedical Insights . . . . . . . . . . . . . . . . . . . . 54
3 Lyrics-based Automatic Emotion Conditioning . . . . . . . . . . . 55
4 Relative Note-pitch Encoding . . . . . . . . . . . . . . . . . . . . . 56
References 58
Appendix 63
A.1 Specifications of the Mel-spectrogram extraction algorithm . . . . . 63
A.2 Hyperparameters of the Diff-WGAN acoustic model Generator . . . 64
A.3 Hyperparameters of the MSC-Discriminator . . . . . . . . . . . . . 65
A.4 Hyperparameters of HiFiGAN . . . . . . . . . . . . . . . . . . . . 67
A.5 Suggestions From the Oral Defense Committee . . . . . . . . . . . 68
A.5.1 吳尚鴻教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.5.2 丁川康教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.5.3 曹昱教授 . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.5.4 王新民教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.5.5 劉奕汶教授 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
                                

[1] L. Dreamtonics Co., “Synthesizer v studio.” Accessed: 2022-06-30.
[2] H. Kenmochi and H. Ohshita, “VOCALOID - commercial singing synthesizer based on sample concatenation,” in Proc. Interspeech 2007, pp. 4009–4010, 2007.
[3] R. Valle, J. Li, R. J. Prenger, and B. Catanzaro, “Mellotron: Multispeaker　expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6189–6193, 2020.
[4] P. Lu, J. Wu, J. Luan, X. Tan, and L. Zhou, “XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System,” in Proc. Interspeech 2020, pp. 1306–1310, 2020.
[5] X. Zhuang, T. Jiang, S.-Y. Chou, B. Wu, P. Hu, and S. Lui, “Litesing: Towards fast, lightweight and expressive singing voice synthesis,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7078–7082, 2021.
[6] Y. Gu, X. Yin, Y. Rao, Y. Wan, B. Tang, Y. Zhang, J. Chen, Y. Wang, and Z. Ma, “Bytesing: A chinese singing voice synthesis system using duration allocated encoder-decoder acoustic models and wavernn vocoders,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021.
[7] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, (Red Hook, NY, USA), Curran Associates Inc., 2020.
[8] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” ArXiv, vol. abs/1701.07875, 2017.
[9] C.-C. Chu, F.-R. Yang, Y.-J. Lee, Y.-W. Liu, and S.-H. Wu, “MPop600: A Mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647–1652, 2020.
[10] J. Liu, C. Li, Y. Ren, F. Chen, P. Liu, and Z. Zhao, “Diffsinger: Diffusion acoustic model for singing voice synthesis,” arXiv, vol. abs/2105.02446, 2021.
[11] J. Wu, Z. Huang, J. Thoma, D. Acharya, and L. Van Gool, “Wasserstein divergence for gans,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
[12] 楊馥榕, “基於音韻學的時長模型之中文歌聲合成,” Master’s thesis, 國立清華大學電機工程學系, 十月2021. 國立清華大學電機工程學系碩士論文, https://hdl.handle.net/11296/5d946q.
[13] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, (Red Hook, NY, USA), Curran Associates Inc., 2020.
[14] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, p. 125, ISCA, 2016.
[15] Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7237–7241, 2022.
[16] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
[17] J. Lee, H.-S. Choi, C.-B. Jeon, J. Koo, and K. Lee, “Adversarially Trained Endto-End Korean Singing Voice Synthesis System,” in Proc. Interspeech 2019, pp. 2588–2592, 2019.
[18] M. Blaauw and J. Bonada, “Sequence-to-sequence singing synthesis using the feed-forward transformer,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7229–7233, 2020.
[19] J. Chen, X. Tan, J. Luan, T. Qin, and T.-Y. Liu, “Hifisinger: Towards highfidelity neural singing voice synthesis,” ArXiv, vol. abs/2009.01776, 2020.
[20] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, pp. 2410–2419, PMLR, 10–15 Jul 2018.
[21] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multiresolution spectrogram,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203, 2020.
[22] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J.-J. Kim, “Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network,” in Proc. Interspeech 2018, pp. 1551–1555, 2018.
[23] F.-R. Yang, Y.-P. Cho, Y.-H. Yang, D.-Y. Wu, S.-H. Wu, and Y.-W. Liu, “Mandarin singing voice synthesis with a phonology-based duration model,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1975–1981, 2021.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
[25] Y. Hono, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Singing voice synthesis based on generative adversarial networks,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6955–6959, 2019.
[26] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 8780–8794, Curran Associates, Inc., 2021.
[27] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations, 2021.
[28] M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A Denoising Diffusion Model for Text-to-Speech,” in Proc. Interspeech 2021, pp. 3605–3609, 2021.
[29] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” in International Conference on Learning Representations, 2022.
[30] S. Liu, D. Su, and D. Yu, “Diffgan-tts: High-fidelity and efficient text-tospeech with denoising diffusion gans,” ArXiv, vol. abs/2201.11972, 2022.
[31] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021.
[32] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks,” in International Conference on Learning Representations, 2020.
[33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
[34] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” ArXiv, vol. abs/2006.04558, 2021.
[35] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks: the official journal of the International Neural Network Society, vol. 107, pp. 3–11, 2018.
[36] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” ArXiv, vol. abs/1505.00853, 2015.
[37] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4396–4405, 2019.
[38] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8107–8116, 2020.
[39] 李依哲, “基於雙向時間遞歸神經網路之中文歌聲合成,” Master’s thesis, 國立清華大學電機工程學系, 十月2019. 國立清華大學電機工程學系碩士論文, https://hdl.handle.net/11296/yf8qks.
[40] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[42] Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, vol. 2, pp. 1398–1402 Vol.2, 2003.
[43] J. Nilsson and T. Akenine-Möller, “Understanding ssim,” ArXiv, vol. abs/2006.13846, 2020.
[44] C. Gan, X. Wang, M. Zhu, and X. Yu, “Audio quality evaluation using frequency structural similarity measure,” in IET International Communication Conference on Wireless Mobile and Computing (CCWMC 2011), pp. 299–303, 2011.
[45] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proceedings of the 35th International Conference on Machine Learning (J. Dy and A. Krause, eds.), vol. 80 of Proceedings of Machine Learning Research, pp. 5180–5189, PMLR, 10–15 Jul 2018.
[46] T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021.
[47] R. Daher, M. K. Zein, J. El Zini, M. Awad, and D. Asmar, “Change your singer: A transfer learning generative adversarial framework for song to song conversion,” in 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, 2020.
[48] S. Yong and J. Nam, “Singing expression transfer from one voice to another for a given song,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155, 2018.
[49] R. Yoneyama, Y.-C. Wu, and T. Toda, “Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN,” in Proc. Interspeech 2021, pp. 2187–2191, 2021.
[50] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform models for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 402–415, 2020.
[51] V. Hampala, M. Garcia, J. G. Švec, R. C. Scherer, and C. T. Herbst, “Relationship between the electroglottographic signal and vocal fold contact area,” Journal of Voice, vol. 30, no. 2, pp. 161–171, 2016.
[52] S. Kim, K. Na, C. Lee, J. An, and I. Kim, “U-singer: Multi-singer singing voice synthesizer that controls emotional intensity,” ArXiv, vol. abs/2203.00931, 2022.
[53] Y. Agrawal, R. G. R. Shanker, and V. Alluri, “Transformer-based approach towards music emotion recognition from lyrics,” in Advances in Information Retrieval (D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, and F. Sebastiani, eds.), (Cham), pp. 167–175, Springer International Publishing, 2021.

簡易檢索 / 詳目顯示

相關論文