研究生: |
盧人豪 Lu, Jen-Hao |
---|---|
論文名稱: |
日語與中文的跨語言歌唱口音適應:訓練策略的比較分析 Cross-lingual Singing Accent Adaptation in Japanese and Mandarin: Comparative Analysis of Training Strategies |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
蘇宜青
SU, YI-CHING 白明憲 BAI, MING-SIAN 簡仁宗 Chien, Jen-Tzung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 英文 |
論文頁數: | 81 |
中文關鍵詞: | 跨語言轉換 、歌聲轉換 、歌聲合成 、口音轉換 |
外文關鍵詞: | Cross-lingual conversion, SInging voice syhthesis, SInging voice conversion, Accent conversion |
相關次數: | 點閱:67 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
跨語言歌唱語音轉換的目標是實現在不同語言間切換歌唱口音的同 時,依然能保留音樂內容的完整性。本研究聚焦於多語種歌唱語音合成, 特別是訓練順序對口音轉換和自然度的影響。我們基於變分自編碼器
(Variational Autoencoder)和流模型(Flow)的架構,選擇日語與中文作為 研究對象,設計了三種訓練策略:先訓練日語再訓練中文的順序訓練、先 訓練中文再訓練日語的順序訓練,以及將日語和中文數據混合進行的聯合 訓練。
主觀聆聽測試結果表明,訓練順序對口音適應有影響。順序訓練在適應 後學語言方面表現出色,但通常會犧牲先學語言的某些特徵;聯合訓練雖 然表現穩定,卻缺乏在特定情境中的突出表現。客觀評測進一步量化了音 素準確性,指出不同訓練策略對語音細節的影響,例如中文的捲舌音與日 語的長元音。
本研究展示了訓練策略對跨語言歌唱語音合成的多方面影響,並在主觀 以及客觀指標都顯現出一致的趨勢,其中聯合訓練更提供了穩定的跨語言 歌唱口音適應性。未來研究將探索數據比例調整、更精準的音素萃取模型, 以及多語種擴展,進一步提升系統的適應性與跨語言合成表現。
Cross-lingual singing voice conversion aims to adapt singing accents across different languages while preserving the musical content. This study explores the application of artificial intelligence models in multilingual singing voice synthesis, focusing on the impact of training order on accent transfer and naturalness. Based on a Variational Autoencoder (VAE) and flow-based framework, experiments were conducted using Japanese and Mandarin as target languages, with three training strategies: sequential training with Japanese followed by Mandarin, sequential training with Mandarin followed by Japanese, and joint training with combined datasets.
Subjective listening tests show that training order affects accent adaptation performance. Sequential training excels when the listening test language matches the later-learned language, achieving the highest scores in such scenarios. However, joint training provides stable performance across languages but is unable to outperform sequential training in tests where the language matches the later-learned one. Objective evaluations further quantify phoneme accuracy, revealing how different training strategies impact phonetic details, such as Mandarin retroflexes and Japanese long vowels.
This study demonstrates the multifaceted impact of training strategies on cross-lingual singing voice synthesis, showing consistent trends in both subjective and objective evaluations. Joint training, in particular, provides a more stable cross-lingual singing accent adaptability. Future research will explore data proportion adjustments, more precise phoneme extraction models, and multilingual expansion to further enhance the system's adaptability and cross-lingual synthesis performance.
[1] W.-C. Huang, L. P. Violeta, S. Liu, J. Shi, and T. Toda, “The singing voice conversion challenge 2023,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, 2023.
[2] D. P. Kingma, M. Welling, et al., “An introduction to variational autoencoders,” Founda- tions and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019.
[3] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learn- ing for end-to-end text-to-speech,” in International Conference on Machine Learning, pp. 5530–5540, 2021.
[4] K.Qian,Y.Zhang,H.Gao,J.Ni,C.-I.Lai,D.Cox,M.Hasegawa-Johnson,andS.Chang, “Contentvec: An improved self-supervised speech representation by disentangling speak- ers,” in International Conference on Machine Learning, pp. 18003–18017, 2022.
[5] W. Quamer, A. Das, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, “Zero- shot foreign accent conversion without a native reference,” Proceedings of Interspeech, 2022.
[6] W.Quamer,A.Das,andR.Gutierrez-Osuna,“Decouplingsegmentalandprosodiccuesof non-native speech through vector quantization,” in Proceedings of Interspeech, pp. 2083– 2087, 2023.
[7] A. Das and R. Gutierrez-Osuna, “Improving mispronunciation detection using speech reconstruction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 4420–4433, 2024.
[8] H. Xue, X. Peng, Y. Lu, et al., “Convert and speak: Zero-shot accent conversion with minimum supervision,” in Proceedings of ACM Multimedia, 2024.
[9] T. N. Nguyen, S. Akti, N. Q. Pham, and A. Waibel, “Improving pronunciation and accent conversion through knowledge distillation and synthetic ground-truth from native tts,” arXiv preprint arXiv:2410.14997, 2024.
[10] J.Melechovsky,A.Mehrish,B.Sisman,andD.Herremans,“Accentconversionintext-to- speech using multi-level vae and adversarial training,” arXiv preprint arXiv:2406.01018, 2024.
[11] C. Tånnander, D. House, and J. Edlund, “Analysis-by-synthesis: phonetic-phonological variation indeep neural network-based text-to-speech synthesis,” in 20th International Congress of Phonetic Sciences (ICPhS), pp. 3156–3160, 2023
[12] H.Zhou,Y.Lin,Y.Shi,P.Sun,andM.Li,“Bisinger:Bilingualsingingvoicesynthesis,”in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, 2023.
[13] S. Liu, H. Zhu, K. Wang, and H. Wang, “Pitch preservation in singing voice synthesis,” arXiv preprint arXiv:2110.05033, 2021.
[14] C.-C. Chu, F.-R. Yang, Y.-J. Lee, Y.-W. Liu, and S.-H. Wu, “Mpop600: A mandarin pop- ular song database with aligned audio, lyrics, and musical scores for singing voice synthe- sis,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647–1652, 2020.
[15] L.Zhang,R.Li,S.Wang,L.Deng,J.Liu,Y.Ren,J.He,R.Huang,J.Zhu,X.Chen,etal., “M4singer: A multi-style, multi-singer and musical score provided mandarin singing cor- pus,” Advances in Neural Information Processing Systems, vol. 35, pp. 6914–6926, 2022.
[16] H. Tamaru, S. Takamichi, N. Tanji, and H. Saruwatari, “Jvs-music: Japanese multispeaker singing-voice corpus,” arXiv preprint arXiv:2001.07044, 2020.
[17] H. Li, K. Dinesh, Z. Li, E. S. Chng, and D. Lyu, “NUS sung and spoken lyrics corpus,” in
Proceedings of the Ninth International Conference on Language Resources and Evalua- tion (LREC), pp. 1566–1571, 2014.
[18] J.Shen,R.Pang,R.J.Weiss,M.Schuster,N.Jaitly,Z.Yang,Z.Chen,Y.Zhang,Y.Wang, R. Skerrv-Ryan, et al., “Natural TTS synthesis by conditioning wavenet on mel spectro- gram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pp. 4779–4783, 2018.
[19] Y. Ren, C. Hu, X. Tan, T. Qin, J. Luan, Z. Zhao, and T.-Y. Liu, “DeepSinger: Singing voice synthesis with data mined from the web,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1979–1989, 2020.
[20] Y.-P. Cho, Y. Tsao, H.-M. Wang, and Y.-W. Liu, “Mandarin singing voice synthesis with denoising diffusion probabilistic wasserstein gan,” in 2022 Asia-Pacific Signal and Infor- mation Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1956– 1963, 2022.
[21] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[22] Y. Stylianou, “Voice transformation: a survey,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3585–3588, 2009.
[23] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Sys- tems, vol. 99, no. 7, pp. 1877–1884, 2016.
[24] D. Erro, I. Sainz, E. Navas, and I. Hernáez, “HNM-based mfcc+ f0 extractor applied to statistical speech synthesis,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4728–4731, 2011.
[25] J.KimandJ.Salazar,“VQ-VAE-SVS:Sequencemodelingforsingingvoicesynthesisus- ing vector quantized variational autoencoders,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7234–7238, 2020.
[26] L.Sun,K.Li,H.Wang,S.Kang,andH.Meng,“Phoneticposteriorgramsformany-to-one voice conversion without parallel data training,” in 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2016.
[27] Z.Cai,Y.Yang,andM.Li,“Cross-lingualmultispeakertext-to-speechunderlimited-data scenario,” arXiv preprint arXiv:2005.10441, 2020.
[28] C. Team, “CrossSinger: Cross-lingual singing voice conversion using pre-trained acoustic models,” in Proceedings of the ICML Workshop on Sound Generation, 2023.
[29] F. Yang, J. Luan, and Y. Wang, “Improve bilingual TTS using dynamic language and phonology embedding,” arXiv preprint arXiv:2212.03435, 2022.
[30] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” arXiv preprint arXiv:1907.04448, 2019.
[31] Z. Cai, Y. Yang, and M. Li, “Cross-lingual multi-speaker speech synthesis with limited bilingual training data,” Computer Speech & Language, vol. 77, p. 101427, 2023.
[32] T. Li, C. Hu, J. Cong, X. Zhu, J. Li, Q. Tian, Y. Wang, and L. Xie, “Diclet-tts: Diffusion model based cross-lingual emotion transfer for text-to-speech—a study between english and mandarin,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[33] P. Klumpp, P. Chitkara, L. Sarı, P. Serai, J. Wu, I.-E. Veliche, R. Huang, and Q. He, “Syn- thetic cross-accent data augmentation for automatic speech recognition,” arXiv preprint arXiv:2303.00802, 2023.
[34] T. Nosek, S. Suzić, V. Delić, and M. Sečujski, “Cross-lingual text-to-speech with prosody embedding,” in 2023 30th International Conference on Systems, Signals and Image Pro- cessing (IWSSIP), pp. 1–5, 2023.
[35] M. Jin, P. Serai, J. Wu, A. Tjandra, V. Manohar, and Q. He, “Voice-preserving zero-shot multiple accent conversion,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
[36] X. Zhou, X. Tian, G. Lee, R. K. Das, and H. Li, “End-to-end code-switching TTS with cross-lingual language model,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7614–7618, 2020.
[37] M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding.,” in Interspeech, pp. 2105–2109, 2019.
[38] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
[39] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, vol. 10, 2022.
[40] B. Van Niekerk, M.-A. Carbonneau, J. Zaïdi, M. Baas, H. Seuté, and H. Kamper, “A comparison of discrete and soft speech units for improved voice conversion,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6562–6566, 2022.
[41] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “Crepe: A convolutional representation for pitch estimation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 161–165, 2018.
[42] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 4879–4883, 2018.
[43] A. v. d. Oord, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[44] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using Real NVP,” arXiv preprint arXiv:1605.08803, 2016.
[45] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estima- tion,” arXiv preprint arXiv:1410.8516, 2014.
[46] D.P.KingmaandP.Dhariwal,“Glow:Generativeflowwithinvertible1x1convolutions,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[47] D.RezendeandS.Mohamed,“Variationalinferencewithnormalizingflows,”inInterna- tional Conference on Machine Learning, pp. 1530–1538, 2015.
[48] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, 2019.
[49] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” arXiv preprint arXiv:2206.04658, 2022.
[50] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform models for statis- tical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 28, pp. 402–415, 2019.
[51] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias- free generative adversarial networks,” Advances in Neural Information Processing Sys- tems, vol. 34, pp. 852–863, 2021.
[52] Z.Liu,T.Hartwig,andM.Ueda,“Neuralnetworksfailtolearnperiodicfunctionsandhow to fix it,” Advances in Neural Information Processing Systems, vol. 33, pp. 1583–1594, 2020.
[53] Y. Ren, C. Zhang, and S. Yan, “Bag of tricks for unsupervised text-to-speech,” in The Eleventh International Conference on Learning Representations (ICLR), 2023.
[54] A. F. A. Aziz, S. Tiun, and N. Ruslan, “End-to-end text-to-speech synthesis for malay language using tacotron and tacotron 2,” International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, 2023.
[55] J.-S. Hwang, H. Noh, Y. Hong, I. Oh, and N. A. Center, “X-singer: Code-mixed singing voice synthesis via cross-lingual learning,” in Proceedings of Interspeech, pp. 1885–1889, 2024.
[56] C. Wang, C. Zeng, and X. He, “Xiaoicesing 2: A high-fidelity singing voice synthesizer based on generative adversarial network,” arXiv preprint arXiv:2210.14666, 2022.
[57] Y.-L. Chuang, H.-R. Hsu, Y.-W. Liu, C.-T. Hsin, et al., “Computer-assisted pronunciation training system for atayal, an indigenous language in taiwan,” in 2024 27th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisa- tion of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1–6, 2024.
[58] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross- lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020.