簡易檢索 / 詳目顯示

研究生: 劉弘祥
Liu, Hong-Hsiang
論文名稱: 代理驅動之大語言模型在中文歌詞創作中的實踐
Agent-Driven Large Language Models for Mandarin Lyric Generation
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 王新民
Wang, Hsin-Min
王道維
Wang, Daw-Wei
謝承諭
Hsieh, Chen-Yu Chester
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 113
語文別: 中文
論文頁數: 86
中文關鍵詞: 大語言模型人工智慧代理中文歌詞歌詞生成旋律轉歌詞詞曲咬合多代理合作
外文關鍵詞: Large Language Model, AI agent, Mandarin lyric, Lyric generation, Melody-to-Lyric, Lyric-melody alignment, Multi-agent collaboration
相關次數: 點閱:55下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,由於生成式大型語言模型(Generative Large Language Model)僅需要給定提示詞就能在許多任務中展現良好的上下文學習(In-Context Learning)能力,它開始被廣泛應用在各種領域。旋律生成歌詞(Melody-to-Lyric)任務是藉由給定的旋律,產生出符合旋律的歌詞。在先前的研究中,由於高品質對齊資料的稀缺以及難以評斷創作標準的問題,針對旋律搭配的歌詞生成方法研究較少,多數任務中僅針對主題或情緒等大方向進行控制,而這類純文字內容的控制在目前大語言模型能力的發展趨勢下已不具備明顯價值。儘管歌詞的創作非常主觀,不同歌詞與同一旋律的匹配程度是有差異的,尤其是對於中文這類聲調語言存在詞曲咬合問題,這點在我們的 Mpop600 資料集中亦得到了驗證。透過人工智慧代理(AI Agent)的方法,將整個旋律生歌詞任務拆解給多個不同的代理,並分別賦予它們大語言模型的生成推理能力和不同對應的工具,藉由合作來完成這一複雜的任務。本研究中,透過 4 個不同的代理分別達成了押韻控制、字數控制、詞曲咬合控制、與一致性控制等目標。本研究藉由語言模型代理的方式,實現了一個多代理合作的歌詞生成系統,展現了代理方法對於大語言模型能力提升的效果。


    In recent years, Generative Large Language Models have shown impressive in-context learning abilities, performing well across various tasks with just a prompt. The melody-to-lyric task generates lyrics for a given melody. Previous research has been limited by scarce high-quality aligned data and unclear creative standards. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. Despite the subjective nature of lyric creation, the fit of different lyrics to the same melody can vary significantly, especially in tonal languages like Mandarin, where pitch contours are determined by both tone and frequency. This has been validated in our Mpop600 dataset. Our research decomposes the melody-to-lyric task into sub-tasks, each handled by different agents equipped with language models and specific tools. We use four agents to control rhyme, syllable count, lyric-melody alignment, and consistency. Through the implementation of these language model agents, we have developed a multi-agent collaborative lyric generation system, demonstrating the efficacy of the agent-based approach in enhancing the capabilities of large language models.

    第1章 緒論 1.1 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 主要挑戰 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 缺少足夠數量的對齊資料 . . . . . . . . . . . . . . . . . 2 1.2.2 歌詞創作沒有標準答案 . . . . . . . . . . . . . . . . . . . 2 1.3 目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 主要貢獻 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.6 論文架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 第2章 文獻探討 2.1 歌詞生成 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 詞曲咬合 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 提示詞工程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 第3章 研究資料與前處理 3.1 Mpop600 中文歌唱資料集 . . . . . . . . . . . . . . . . . . . . . . 15 3.2 資料層級 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 前處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 逐字合併資料 . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.2 逐句取得聲調 . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 取得段落分割 . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.4 輸出資料 . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 第4章 相關工具 4.1 生成式語言模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 ChatGPT API . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.2 函數調用 . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 正規表達式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 中文分詞工具 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 爬蟲工具 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5 聲調解析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.1 教育部中文譯音轉換系統 . . . . . . . . . . . . . . . . . 28 4.5.2 Pypinyin . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6 AZRhymes 押韻辭典 . . . . . . . . . . . . . . . . . . . . . . . . . 29 第5章 系統設計 5.1 歌詞生成流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.1.1 逐段生成流程 . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.2 代理群 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 語言模型的輸出控制 . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2.1 前向控制 . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2.2 後向控制 . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 押韻建議代理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.1 提示詞 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 執行流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 歌詞生成代理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.4.1 提示詞 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.2 字數控制 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.3 執行流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.5 咬合控制代理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5.1 咬合規則 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.2 提示詞 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5.3 執行流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.6 判斷代理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.1 提示詞 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.2 執行流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.7 歌聲合成 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.1 歌聲合成模型 . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.2 歌聲合成發音資料 . . . . . . . . . . . . . . . . . . . . . 48 第6章 實驗與結果 6.1 Mpop600 中的詞曲咬合分析 . . . . . . . . . . . . . . . . . . . . 50 6.1.1 咬合率分析 . . . . . . . . . . . . . . . . . . . . . . . . . 50 6.1.2 咬合聲調規則驗證 . . . . . . . . . . . . . . . . . . . . . 53 6.2 語言模型輸出的字數控制 . . . . . . . . . . . . . . . . . . . . . . 57 6.2.1 實驗設計 . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2.2 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3 生成歌詞的聽測評分 . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.1 實驗與問卷設計 . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.2 結果分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 第7章 結論 7.1 驗證了詞曲咬合對於歌詞創作上的重要性 . . . . . . . . . . . . 67 7.2 控制語言模型產生特定字數內容的方法 . . . . . . . . . . . . . . 67 7.3 將多代理方式應用在歌詞創作任務上 . . . . . . . . . . . . . . . 67 8 未來展望 68 8.1 更精確的咬合差異計算與實驗 . . . . . . . . . . . . . . . . . . . 68 8.2 更複雜的代理工作流程 . . . . . . . . . . . . . . . . . . . . . . . 68 8.3 人機合作的創作模式 . . . . . . . . . . . . . . . . . . . . . . . . 68 參考文獻 . . . . . . . . . . . . . . 70 附錄 A.1 Mpop600 歌曲清單 . . . . . . . . . . . . . . . . . . . . . . . . . . 74 A.2 資料範例格式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 A.3 聽測問卷-第一部分 . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.4 聽測問卷-第二部分 . . . . . . . . . . . . . . . . . . . . . . . . . 81 A.5 口試委員建議 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.5.1 王新民教授 . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.5.2 王道維教授 . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.5.3 謝承諭教授 . . . . . . . . . . . . . . . . . . . . . . . . . 84 A.5.4 劉奕汶教授 . . . . . . . . . . . . . . . . . . . . . . . . . 85

    [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” July 2020. arXiv:2005.14165 [cs].
    [2] I. Sutskever, “Sequence to sequence learning with neural networks,” arXiv preprint arXiv:1409.3215, 2014.
    [3] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
    [4] J.-Y. Liao, “Artificial intelligence musicians: an adaptive emotion-oriented lyric-to-melody generator via transformer based language models,” Master’s thesis, National Chung Hsing University, 2022.
    [5] J.-W. Chang, J. C. Hung, and K.-C. Lin, “Singability-enhanced lyric generator with music style transfer,” Computer Communications, vol. 168, pp. 33–53, Feb. 2021.
    [6] Y.-F. Huang and K.-C. You, “Automated Generation of Chinese Lyrics Based on Melody Emotions,” IEEE Access, vol. 9, pp. 98060–98071, 2021. Conference Name: IEEE Access.
    [7] K.-Y. Lin, “Chinese lyrics generation using sequence to sequence learning approach,” Master’s thesis, National Taiwan University, Jan 2017.
    [8] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, and Z. Sui, “A Survey on In-context Learning,” June 2024. arXiv:2301.00234 [cs].
    [9] K. P. Murphy, “Machine learning - a probabilistic perspective,” in Adaptive computation and machine learning series, 2012.
    [10] Y. Tian, A. Narayan-Chen, S. Oraby, A. Cervone, G. Sigurdsson, C. Tao, W. Zhao, Y. Chen, T. Chung, J. Huang, et al., “Unsupervised melody-to-lyric generation,” arXiv preprint arXiv:2305.19228, 2023.
    [11] A. Tsaptsinos, “Lyrics-based music genre classification using a hierarchical attention network,” arXiv preprint arXiv:1707.04678, 2017.
    [12] M. Bejan, “Multi-lingual lyrics for genre classification,” Kaggle, 2021.
    [13] D. Edmonds and J. Sedoc, “Multi-emotion classification for song lyrics,” in Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 221–235, 2021.
    [14] Z. Sheng, K. Song, X. Tan, Y. Ren, W. Ye, S. Zhang, and T. Qin, “Songmass: Automatic song writing with pre-training and alignment constraint,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1379–13805, 2021.
    [15] Y. Chen and A. Lerch, “Melody-conditioned lyrics generation with seqgans,” in 2020 IEEE International Symposium on Multimedia (ISM), pp. 189–196, IEEE, 2020.
    [16] H. R. G. Oliveira, F. A. Cardoso, and F. C. Pereira, “Tra-la-lyrics: An approach to generate text based on rhythm,” in 4th International Joint Workshop on Computational Creativity, (London, UK), pp. 1–8, 2007.
    [17] H. G. Oliveira, “Tra-la-lyrics 2.0: Automatic generation of song lyrics on a semantic domain,” J. Artificial General Intelligence, vol. 6, no. 1, pp. 87–110, 2015.
    [18] K. Watanabe, Y. Matsubayashi, S. Fukayama, M. Goto, K. Inui, and T. Nakano, “A melody-conditioned lyrics language model,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 163–172, 2018.
    [19] P. Potash, A. Romanov, and A. Rumshisky, “Ghostwriter: Using an lstm for automatic rap lyric generation,” in 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pp. 1919–1924, Association for Computational Linguistics, 2015.
    [20] L. N. Ferreira and J. Whitehead, “Learning to generate music with sentiment,” arXiv preprint arXiv:2103.06125, 2021.
    [21] H.-P. Lee, J.-S. Fang, and W.-Y. Ma, “iComposer: An automatic songwriting system for Chinese popular music,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 84–88, 2019.
    [22] E. Nichols, D. Morris, S. Basu, and C. Raphael, “Relationships between lyrics and melody in popular music,” in Proceedings of the 11th International Society for Music Information Retrieval Conference, pp. 471–476, 2009.
    [23] L.-H. Shen, P.-L. Tai, C.-C. Wu, and S.-D. Lin, “Controlling sequence-to-sequence models - a demonstration on neural-based acrostic generator,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing, pp. 43–48, 2019.
    [24] N. Liu, W. Han, G. Liu, D. Peng, R. Zhang, X. Wang, and H. Ruan, “ChipSong: A Controllable Lyric Generation System for Chinese Popular Song,” in Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022), (Dublin, Ireland), pp. 85–95, Association for Computational Linguistics, 2022.
    [25] J. D. McCawley, “What is a tone language?,” in Tone: a Linguistic Survey (V. A. Fromkin, ed.), New York: Academic Press, 1978.
    [26] 趙元任, 現代吳語的研究. 科學出版社, 11 1956.
    [27] S. R. Speer, C.-L. Shih, and M. L. Slowiaczek, “Prosodic structure in language understanding: evidence from tone sandhi in mandarin,” Language and Speech, vol. 32, no. 4, pp. 337–354, 1989.
    [28] L. H. Wee, “Unraveling the relation between mandarin tones and musical melody,” Journal of Chinese Linguistics, vol. 35, no. 1, p. 128, 2007.
    [29] 薛范, 歌曲翻譯探索與實踐. 武漢: 湖北教育出版社, 2002.
    [30] S. S. Xiao-nan, The Prosody of Mandarin Chinese. Los Angeles: University of California Press, 1989.
    [31] C. Y. Sun, “Xiqu changqiang han yuyan de guanxi,” in Yuyan Yu Yinyue (Y. Yang and D.K. Li, eds.), Taipei: Danqing Book co., Ltd., 1988.
    [32] W.-C. Ling, “The competition between contour and register correspondence in music-to-language perception: Evidence from mandarin child songs,” in Proceedings of the 51st International Conference on Sino-Tibetan Languages and Linguistics, 第 51 回国際漢蔵語学会実行委員会 �京都大学白眉センター, 2018.
    [33] W. S. V. Ho, “The tone-melody interface of popular songs written in tone languages,” in 9th international conference on music perception and cognition, Bologna, Citeseer, 2006.
    [34] S. en Li, “The interaction between melodies and tones of the lyrics in mandarin folk songs,” Master’s thesis, National Kaohsiung Normal University, 2003.
    [35] P. Pfordresher and S. Brown, “Enhanced production and perception of musical pitch in tone language speakers,” Attention, Perception, & Psychophysics, vol. 71, pp. 1385–1398, Aug 2009.
    [36] T. Sun, Y. Shao, H. Qian, X. Huang, and X. Qiu, “Black-box tuning for language-model-as-a-service,” in International Conference on Machine Learning, pp. 20841–20855, PMLR, 2022.
    [37] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
    [38] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
    [39] Z. Yin, Q. Sun, C. Chang, Q. Guo, J. Dai, X.-J. Huang, and X. Qiu, “Exchange-of-thought: Enhancing large language model capabilities through cross-model communication,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15135–15153, 2023.
    [40] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, Z. Tu, and S. Shi, “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate,” May 2023. arXiv:2305.19118 [cs].
    [41] H. Soudani, E. Kanoulas, and F. Hasibi, “Fine tuning vs. retrieval augmented generation for less popular knowledge,” arXiv preprint arXiv:2403.01432, 2024.
    [42] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
    [43] C.-C. Chu, F.-R. Yang, Y.-J. L. Y.-W. Liu, and S.-H. Wu, “Mpop600: A mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1647–1652, IEEE, 2020.
    [44] D. Crystal, A Dictionary of Linguistics and Phonetics. John Wiley & Sons, 2011.
    [45] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
    [46] OpenAI, Function Calling - OpenAI API, 2023. OpenAI API documentation.
    [47] Python Software Foundation, re — Regular expression operations, 2024. Python 3.11.8 documentation.
    [48] 中華民國教育部, “教育部中文譯音轉換系統.” https://crptransfer.moe.gov.tw/index.jsp, 2009.
    [49] J. B. Barney Szabolcs, “Azrhymes 押韻辭典.” https://zh.azrhymes.com/, 2020.
    [50] OpenAI, “Prompt engineering guide.” https://platform.openai.com/docs/guides/prompt-engineering, 2024. Accessed: 2024-08-20.
    [51] Y.-P. Cho, Y. Tsao, H.-M. Wang, and Y.-W. Liu, “Mandarin singing voice synthesis with denoising diffusion probabilistic wasserstein GAN,” pp. 1956–1963, 2022.
    [52] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, pp. 214–223, PMLR, 2017.
    [53] F.-R. Yang, “Mandarin singing voice synthesis with a phonology-based duration model,” Master’s thesis, National Tsing Hua University, 2021.
    [54] Y.-J. Lee, B.-Y. Chen, Y.-T. Lai, H.-W. Liao, T.-C. Liao, S.-L. Kao, K.-Y. Kang, C.-T. Hsu, and Y.-W. Liu, “Examining the influence of word tonality on pitch contours when singing in mandarin,” in 2018 Oriental COCOSDA-International Conference on Speech Database and Assessments, pp. 89–94, IEEE, 2018.

    QR CODE