研究生: |
劉弘祥 Liu, Hong-Hsiang |
---|---|
論文名稱: |
代理驅動之大語言模型在中文歌詞創作中的實踐 Agent-Driven Large Language Models for Mandarin Lyric Generation |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
王新民
Wang, Hsin-Min 王道維 Wang, Daw-Wei 謝承諭 Hsieh, Chen-Yu Chester |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 86 |
中文關鍵詞: | 大語言模型 、人工智慧代理 、中文歌詞 、歌詞生成 、旋律轉歌詞 、詞曲咬合 、多代理合作 |
外文關鍵詞: | Large Language Model, AI agent, Mandarin lyric, Lyric generation, Melody-to-Lyric, Lyric-melody alignment, Multi-agent collaboration |
相關次數: | 點閱:55 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,由於生成式大型語言模型(Generative Large Language Model)僅需要給定提示詞就能在許多任務中展現良好的上下文學習(In-Context Learning)能力,它開始被廣泛應用在各種領域。旋律生成歌詞(Melody-to-Lyric)任務是藉由給定的旋律,產生出符合旋律的歌詞。在先前的研究中,由於高品質對齊資料的稀缺以及難以評斷創作標準的問題,針對旋律搭配的歌詞生成方法研究較少,多數任務中僅針對主題或情緒等大方向進行控制,而這類純文字內容的控制在目前大語言模型能力的發展趨勢下已不具備明顯價值。儘管歌詞的創作非常主觀,不同歌詞與同一旋律的匹配程度是有差異的,尤其是對於中文這類聲調語言存在詞曲咬合問題,這點在我們的 Mpop600 資料集中亦得到了驗證。透過人工智慧代理(AI Agent)的方法,將整個旋律生歌詞任務拆解給多個不同的代理,並分別賦予它們大語言模型的生成推理能力和不同對應的工具,藉由合作來完成這一複雜的任務。本研究中,透過 4 個不同的代理分別達成了押韻控制、字數控制、詞曲咬合控制、與一致性控制等目標。本研究藉由語言模型代理的方式,實現了一個多代理合作的歌詞生成系統,展現了代理方法對於大語言模型能力提升的效果。
In recent years, Generative Large Language Models have shown impressive in-context learning abilities, performing well across various tasks with just a prompt. The melody-to-lyric task generates lyrics for a given melody. Previous research has been limited by scarce high-quality aligned data and unclear creative standards. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. Despite the subjective nature of lyric creation, the fit of different lyrics to the same melody can vary significantly, especially in tonal languages like Mandarin, where pitch contours are determined by both tone and frequency. This has been validated in our Mpop600 dataset. Our research decomposes the melody-to-lyric task into sub-tasks, each handled by different agents equipped with language models and specific tools. We use four agents to control rhyme, syllable count, lyric-melody alignment, and consistency. Through the implementation of these language model agents, we have developed a multi-agent collaborative lyric generation system, demonstrating the efficacy of the agent-based approach in enhancing the capabilities of large language models.
[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” July 2020. arXiv:2005.14165 [cs].
[2] I. Sutskever, “Sequence to sequence learning with neural networks,” arXiv preprint arXiv:1409.3215, 2014.
[3] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
[4] J.-Y. Liao, “Artificial intelligence musicians: an adaptive emotion-oriented lyric-to-melody generator via transformer based language models,” Master’s thesis, National Chung Hsing University, 2022.
[5] J.-W. Chang, J. C. Hung, and K.-C. Lin, “Singability-enhanced lyric generator with music style transfer,” Computer Communications, vol. 168, pp. 33–53, Feb. 2021.
[6] Y.-F. Huang and K.-C. You, “Automated Generation of Chinese Lyrics Based on Melody Emotions,” IEEE Access, vol. 9, pp. 98060–98071, 2021. Conference Name: IEEE Access.
[7] K.-Y. Lin, “Chinese lyrics generation using sequence to sequence learning approach,” Master’s thesis, National Taiwan University, Jan 2017.
[8] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, and Z. Sui, “A Survey on In-context Learning,” June 2024. arXiv:2301.00234 [cs].
[9] K. P. Murphy, “Machine learning - a probabilistic perspective,” in Adaptive computation and machine learning series, 2012.
[10] Y. Tian, A. Narayan-Chen, S. Oraby, A. Cervone, G. Sigurdsson, C. Tao, W. Zhao, Y. Chen, T. Chung, J. Huang, et al., “Unsupervised melody-to-lyric generation,” arXiv preprint arXiv:2305.19228, 2023.
[11] A. Tsaptsinos, “Lyrics-based music genre classification using a hierarchical attention network,” arXiv preprint arXiv:1707.04678, 2017.
[12] M. Bejan, “Multi-lingual lyrics for genre classification,” Kaggle, 2021.
[13] D. Edmonds and J. Sedoc, “Multi-emotion classification for song lyrics,” in Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 221–235, 2021.
[14] Z. Sheng, K. Song, X. Tan, Y. Ren, W. Ye, S. Zhang, and T. Qin, “Songmass: Automatic song writing with pre-training and alignment constraint,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1379–13805, 2021.
[15] Y. Chen and A. Lerch, “Melody-conditioned lyrics generation with seqgans,” in 2020 IEEE International Symposium on Multimedia (ISM), pp. 189–196, IEEE, 2020.
[16] H. R. G. Oliveira, F. A. Cardoso, and F. C. Pereira, “Tra-la-lyrics: An approach to generate text based on rhythm,” in 4th International Joint Workshop on Computational Creativity, (London, UK), pp. 1–8, 2007.
[17] H. G. Oliveira, “Tra-la-lyrics 2.0: Automatic generation of song lyrics on a semantic domain,” J. Artificial General Intelligence, vol. 6, no. 1, pp. 87–110, 2015.
[18] K. Watanabe, Y. Matsubayashi, S. Fukayama, M. Goto, K. Inui, and T. Nakano, “A melody-conditioned lyrics language model,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 163–172, 2018.
[19] P. Potash, A. Romanov, and A. Rumshisky, “Ghostwriter: Using an lstm for automatic rap lyric generation,” in 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1919–1924, Association for Computational Linguistics, 2015.
[20] L. N. Ferreira and J. Whitehead, “Learning to generate music with sentiment,” arXiv preprint arXiv:2103.06125, 2021.
[21] H.-P. Lee, J.-S. Fang, and W.-Y. Ma, “iComposer: An automatic songwriting system for Chinese popular music,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 84–88, 2019.
[22] E. Nichols, D. Morris, S. Basu, and C. Raphael, “Relationships between lyrics and melody in popular music,” in Proceedings of the 11th International Society for Music Information Retrieval Conference, pp. 471–476, 2009.
[23] L.-H. Shen, P.-L. Tai, C.-C. Wu, and S.-D. Lin, “Controlling sequence-to-sequence models - a demonstration on neural-based acrostic generator,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, pp. 43–48, 2019.
[24] N. Liu, W. Han, G. Liu, D. Peng, R. Zhang, X. Wang, and H. Ruan, “ChipSong: A Controllable Lyric Generation System for Chinese Popular Song,” in Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), (Dublin, Ireland), pp. 85–95, Association for Computational Linguistics, 2022.
[25] J. D. McCawley, “What is a tone language?,” in Tone: a Linguistic Survey (V. A. Fromkin, ed.), New York: Academic Press, 1978.
[26] 趙元任, 現代吳語的研究. 科學出版社, 11 1956.
[27] S. R. Speer, C.-L. Shih, and M. L. Slowiaczek, “Prosodic structure in language understanding: evidence from tone sandhi in mandarin,” Language and Speech, vol. 32, no. 4, pp. 337–354, 1989.
[28] L. H. Wee, “Unraveling the relation between mandarin tones and musical melody,” Journal of Chinese Linguistics, vol. 35, no. 1, p. 128, 2007.
[29] 薛范, 歌曲翻譯探索與實踐. 武漢: 湖北教育出版社, 2002.
[30] S. S. Xiao-nan, The Prosody of Mandarin Chinese. Los Angeles: University of California Press, 1989.
[31] C. Y. Sun, “Xiqu changqiang han yuyan de guanxi,” in Yuyan Yu Yinyue (Y. Yang and D.K. Li, eds.), Taipei: Danqing Book co., Ltd., 1988.
[32] W.-C. Ling, “The competition between contour and register correspondence in music-to-language perception: Evidence from mandarin child songs,” in Proceedings of the 51st International Conference on Sino-Tibetan Languages and Linguistics, 第 51 回国際漢蔵語学会実行委員会 �京都大学白眉センター, 2018.
[33] W. S. V. Ho, “The tone-melody interface of popular songs written in tone languages,” in 9th international conference on music perception and cognition, Bologna, Citeseer, 2006.
[34] S. en Li, “The interaction between melodies and tones of the lyrics in mandarin folk songs,” Master’s thesis, National Kaohsiung Normal University, 2003.
[35] P. Pfordresher and S. Brown, “Enhanced production and perception of musical pitch in tone language speakers,” Attention, Perception, & Psychophysics, vol. 71, pp. 1385–1398, Aug 2009.
[36] T. Sun, Y. Shao, H. Qian, X. Huang, and X. Qiu, “Black-box tuning for language-model-as-a-service,” in International Conference on Machine Learning, pp. 20841–20855, PMLR, 2022.
[37] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
[38] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[39] Z. Yin, Q. Sun, C. Chang, Q. Guo, J. Dai, X.-J. Huang, and X. Qiu, “Exchange-of-thought: Enhancing large language model capabilities through cross-model communication,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15135–15153, 2023.
[40] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, Z. Tu, and S. Shi, “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate,” May 2023. arXiv:2305.19118 [cs].
[41] H. Soudani, E. Kanoulas, and F. Hasibi, “Fine tuning vs. retrieval augmented generation for less popular knowledge,” arXiv preprint arXiv:2403.01432, 2024.
[42] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
[43] C.-C. Chu, F.-R. Yang, Y.-J. L. Y.-W. Liu, and S.-H. Wu, “Mpop600: A mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647–1652, IEEE, 2020.
[44] D. Crystal, A Dictionary of Linguistics and Phonetics. John Wiley & Sons, 2011.
[45] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[46] OpenAI, Function Calling - OpenAI API, 2023. OpenAI API documentation.
[47] Python Software Foundation, re — Regular expression operations, 2024. Python 3.11.8 documentation.
[48] 中華民國教育部, “教育部中文譯音轉換系統.” https://crptransfer.moe.gov.tw/index.jsp, 2009.
[49] J. B. Barney Szabolcs, “Azrhymes 押韻辭典.” https://zh.azrhymes.com/, 2020.
[50] OpenAI, “Prompt engineering guide.” https://platform.openai.com/docs/guides/prompt-engineering, 2024. Accessed: 2024-08-20.
[51] Y.-P. Cho, Y. Tsao, H.-M. Wang, and Y.-W. Liu, “Mandarin singing voice synthesis with denoising diffusion probabilistic wasserstein GAN,” pp. 1956–1963, 2022.
[52] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, pp. 214–223, PMLR, 2017.
[53] F.-R. Yang, “Mandarin singing voice synthesis with a phonology-based duration model,” Master’s thesis, National Tsing Hua University, 2021.
[54] Y.-J. Lee, B.-Y. Chen, Y.-T. Lai, H.-W. Liao, T.-C. Liao, S.-L. Kao, K.-Y. Kang, C.-T. Hsu, and Y.-W. Liu, “Examining the influence of word tonality on pitch contours when singing in mandarin,” in 2018 Oriental COCOSDA-International Conference on Speech Database and Assessments, pp. 89–94, IEEE, 2018.