簡易檢索 / 詳目顯示

研究生: 施德睿
Derrick Seaven Bol
論文名稱: 用流派融合賽局來學習混合音樂流派的作曲
Learning to compose music with mixed genres by genre ­fusion games
指導教授: 蘇豐文
Soo, Von­-Wun
口試委員: 朱宏國
蘇英俊
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 75
中文關鍵詞: 音樂生成風格轉移流派融合對抗性潛在自動編碼器
外文關鍵詞: Music Generation, Style Transfer, Genre Fusion, Adversarial Latent Autoencoders
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 音樂創作是一種創造性的進化藝術,可以產生與以前不同的新音樂作 品。在這項工作中,我們將音樂創作過程視為一種流派融合遊戲,一種新 流派從舊流派演變而來,通過使用風格轉移以微妙的方式將它們簡單地融 合在一起,從而產生既具有音樂性又融合了先前流派的音樂。流派。

    換句話說,我們將作曲視為兩種遊戲:音樂性遊戲和流派融合遊戲。我 們提出了一種 ALAE(Adversarial Latent Autoencoder),一種類似 GAN 的自 動編碼器深度學習架構,來實現這種新的音樂創作理念。因此,我們提出 了我們系統的兩種變體:1) 僅音樂性系統和 2) 音樂性­流派融合系統。在後 者中,我們採用三元組損失函數來訓練流派融合遊戲過程。

    我們的完整音樂流派融合系統與兩個基線進行了比較:隨機採樣融合模 型和僅音樂性模型。僅音樂性模型使用相同的 ALAE 架構,但只關注匹配 真實的音樂分佈。

    我們進行了客觀和主觀評估實驗。在客觀實驗中,我們使用了各種指 標,例如重建精度、空條比率、複音、音調距離和類型分類器,再次評估 我們的系統基線。我們發現我們提出的音樂性­流派融合系統能夠學習以融 合兩種流派為特徵的風格生成音樂。複音率和空條率接近真實數據分佈的 值。雖然增加的音調距離表明我們的系統確實探索了真實音樂數據分佈以 外的其他領域,以滿足流派融合遊戲施加的額外限制。

    在主觀實驗中,我們發現如果流行和搖滾流派樣本來自各自的數據集, 他們能夠快速識別它們。然而,當被要求從我們系統中使用的音樂性­流 派­融合系統中表徵樣本時,他們努力選擇特定的流派並更喜歡“流行和搖 滾的混合"。我們還發現,我們的完整音樂風格融合系統和音樂風格遊戲都能夠勝過我們工作中使用的基線。音樂創作是一種創造性的進化藝術,可 以產生與以前不同的新音樂作品。在這項工作中,我們將音樂創作過程視 為一種流派融合遊戲,一種新流派從舊流派演變而來,通過使用風格轉移 以微妙的方式將它們簡單地融合在一起,從而產生既具有音樂性又融合了 先前流派的音樂、流派。

    換句話說,我們將音樂創作視為兩種遊戲:流派融合遊戲和音樂性遊 戲。我們提出了一種 ALAE(對抗性潛在自動編碼器),一種類似 GAN 的 自動編碼器深度學習架構,以實現這種新穎的音樂創作理念。我們還採用 三元組損失函數來訓練流派融合遊戲過程。
    我們比較的三個基線模型是隨機採樣融合模型,以及刪除了流派融合組 件的相同 ALAE 架構。

    我們進行了客觀和主觀評估實驗。在客觀實驗中,我們使用了各種指 標,例如重建準確度、空小節比率、複音、音調距離和流派分類器,我們 發現我們提出的模型能夠學習以融合為特徵的風格的音樂。兩種流派。複 音率和空條率接近真實數據分佈的值。雖然增加的音調距離表明我們的模 型確實探索了數據分佈以外的其他領域,以滿足流派融合遊戲施加的額外 約束。

    在主觀實驗中,我們發現如果流行和搖滾流派樣本來自各自的數據集, 他們能夠快速識別;然而,他們在處理來自我們系統中使用的各種模型的 樣本時遇到了困難。我們還發現,帶有流派融合遊戲和音樂性遊戲的完整 模型能夠勝過我們工作中使用的基線。


    Music composition is an evolutionary art of creativity that can produce new pieces of music in contrast to the previous ones. In this work, we view the music composition process as a genre fusion game where a piece of new genre evolves from old ones by simply fusing them together in a subtle way using style transfer that results in music that is both musical and a fusion of previous genres.

    In other words, we view music composition as two games: a musicality game and a genre fusion game. We propose an ALAE (Adversarial Latent Autoencoder), a GAN­-like auto­encoder deep learning architecture, to implement this novel idea of music composition. Thus we present two variants of our system: 1) a musicality­-only system and a 2) musicality­genre-­fusion system. In the latter, we adopt a triplet loss function to train the genre fusion game process.

    Our full musicality­-genre­-fusion system is compared with two baselines: a ran­dom sampling fusion model and the musicality-­only model. The musicality-­only model uses the same ALAE architecture but focuses only on matching the real music distribution.

    We conducted both objective and subjective evaluation experiments. In the objective experiments, we used various metrics such as reconstruction accuracy, empty bar ratio, polyphony, tonal distance, and a genre classifier to evaluate our system again the baselines. We found out that our proposed musicality­-genre­-fusion system was able to learning to generate music in a style characterized by a fusion of two genres. The polyphony rate and empty bar ratio were close to the real data distribution’s values. While the increased tonal distances showed that our system did indeed explore other areas other than the real music data distribution in order to satisfy the additional constraints imposed by the genre ­fusion games.

    In subjective experiment, we found out that listeners were able to quickly iden­tify pop and rock genres samples if they come from their respective datasets. How­ ever, they struggled to choose a specify genre and preferred ‘a mixture of pop and rock’ when asked to characterize the samples from the musicality­ genre ­fusion sys­tem used in our system. We also found that both our full musicality­ genre­ fusion system and the musicality game were able to outperform the baselines used in our work.

    1 Introduction........................1 2 Related Works........................7 2.1 Multi­-track Symbolic Music Generation ..................... 7 2.2 Music Style Transfer ...............................9 3 Background...........................11 3.1 Musical InstrumentDigital Interface.......................11 3.2 Music Representation............................... 12 3.3 Multi­ Track Music ................................ 13 3.4 Style Transfer in Music.............................. 14 3.4.1 Music Style Transfer ........................... 14 3.4.2 Music Genre Fusion ........................... 15 3.5 Convolutional Neural Networks (CNNs)..................... 16 3.6 Generative Models ................................ 17 3.6.1 Autoencoders............................... 17 3.6.2 Generative Adversarial Networks(GANs). . . . .. . . . . 19 3.6.3 The Adversarial Latent Autoencoder(ALAE) . . . . . . . 20 3.7 The Triplet Loss.................................. 23 4 Methodology..............................27 4.1 Dataset ...................................... 27 4.2 Baseline Architecture............................... 28 4.3 System Overview................................. 28 4.4 Musicality Game ................................. 32 4.5 Genre­ fusion Game................................ 35 5 Experiments 41 5.1 The Musicality ­Only System........................... 41 5.2 The Musicality ­Genre­ Fusion System ...................... 43 5.2.1 The Musicality Game........................... 44 5.2.2 The Genre­ Fusion Game......................... 44 6 Results 51 6.1 ObjectiveEvaluation ............................... 51 6.1.1 ObjectiveMetrics............................. 51 6.1.2 MetricsComparison ........................... 53 6.1.3 T­SNE Plots to Visualize Multi­track Samples Embeddings . . . . 57 6.2 SubjectiveEvaluation............................... 60 7 Conclusion and Future Work.....................65 7.1 Conclusion .................................... 65 7.2 FutureWork.................................... 67 8 Appendix....................69 8.1 Subjective Listening Test............................. 69

    [1] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015.
    [2] S. Pidhorskyi, D. A. Adjeroh, and G. Doretto, “Adversarial latent autoencoders,” CoRR, vol. abs/2004.04467, 2020.
    [3] S. Dai, Z. Zhang, and G. G. Xia, “Music style transfer: A position paper,” 2018.
    [4] G. Brunner, Y. Wang, R. Wattenhofer, and S. Zhao, “Symbolic music genre transfer with cyclegan,” 2018.
    [5] W.Marshall,“Dembow,dembow,dembo:Translation and transnation in reggaeton,”Lied und populäre Kultur / Song and Popular Culture, vol. 53, pp. 131–151, 2008.
    [6] M. M., M. R. M., M. Levy, . Leroi, and A. M, “The evolution of popular music: Usa 1960­2010,” Royal Society open science, vol. 2(5), p. 5, 2015.
    [7] S. Ji, J. Luo, and X. Yang, “A comprehensive survey on deep music generation: Multi­ level representations, algorithms, evaluations, and future directions,” 2020.
    [8] Z. Chen, C. Wu, Y. Lu, A. Lerch, and C. Lu, “Learning to fuse music genres with gener­ative adversarial dual learning,” CoRR, vol. abs/1712.01456, 2017.
    [9] H.­W. Dong, W.­Y. Hsiao, L.­C. Yang, and Y.­H. Yang, “Musegan: Multi­track sequen­tial generative adversarial networks for symbolic music generation and accompaniment,” 2017.
    [10] G. Brunner, A. Konrad, Y. Wang, and R. Wattenhofer, “Midi­vae: Modeling dynamics and instrumentation of music with applications to style transfer,” 2018.
    [11] A.Valenti, A.Carta, and D.Bacciu,“Learning style­aware symbolic music representations by adversarial autoencoders,” 2020.
    [12] C. Donahue, H. H. Mao, Y. E. Li, G. W. Cottrell, and J. McAuley, “Lakhnes: Improving multi­ instrumental music generation with cross­ domain pre­ training,” 2019.
    [13] H.­W. Dong and Y.­H. Yang, “Convolutional generative adversarial networks with binary neurons for polyphonic music generation,” 2018.
    [14] F. Guan, C. Yu, and S. Yang, “A gan model with self ­attention mechanism to generate multi ­instruments symbolic music,” in 2019 International Joint Conference on Neural Net­ works (IJCNN), pp. 1–6, 2019.
    [15] M. Oza, H. Vaghela, and K. Srivastava, “Progressive generative adversarial binary net­ works for music generation,” 2019.
    [16] G. Chen, Y. Liu, S.­h. Zhong, and X. Zhang, “Musicality­novelty generative adversarial nets for algorithmic composition,” in Proceedings of the 26th ACM International Confer­ ence on Multimedia, MM ’18, (New York, NY, USA), p. 1607–1615, Association for Computing Machinery, 2018.
    [17] A. Roberts, J. Engel, C. Raffel, C. Hawthorne, and D. Eck, “A hierarchical latent vector model for learning long­term structure in music,” 2019.
    [18] P. Ashis, “Neural style transfer for musical melodies,” 2018.
    [19] A.Makhzani,J.Shlens,N.Jaitly,I.Goodfellow,andB.Frey,“Adversarialautoencoders,” 2016.
    [20] J. Wang, C. Jin, W. Zhao, S. Liu, and X. Lv, “An unsupervised methodology for musical style translation,” 2019 15th International Conference on Computational Intelligence and Security (CIS), pp. 216–220, 2019.
    [21] H.­W. Dong, W.­Y. Hsiao, and Y.­H. Yang, “Pypianoroll: Open source python package for handling multitrack pianorolls,” 2018.
    [22] C. Raffel and D. P. W. Ellis, “Intuitive analysis, creation and manipulation of midi data with pretty_midi,” 2014.
    [23] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and Z. Wang, “Is the de­ convolution layer the same as a convolutional layer?,” 2016.
    [24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.
    [25] A. Oussidi and A. Elhassouny, “Deep generative models: Survey,” in 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), pp. 1–8, 2018.
    [26] D. P. Kingma and M. Welling, “Auto­encoding variational bayes,” 2014.
    [27] I. J. Goodfellow, J. Pouget­Abadie, M. Mirza, B. Xu, D. Warde­Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
    [28] T. Karras, S. Laine, and T. Aila, “A style ­based generator architecture for generative ad­ versarial networks,” 2019.
    [29] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” 2017.
    [30] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein gans,” 2017.
    [31] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” 2018.
    [32] L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods for gans do actually converge?,” 2018.
    [33] F.Schroff, D.Kalenichenko, and J.Philbin,“Facenet: A unified embedding for face recog­nition and clustering,” 2015 IEEE Conference on Computer Vision and Pattern Recogni­tion (CVPR), Jun 2015.
    [34] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re­ identification,” 2017.
    [35] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” 2018.
    [36] H. Xuan, A. Stylianou, and R. Pless, “Improved embeddings with easy positive triplet mining,” 2020.
    [37] C.Raffel,“Learning­basedmethodsforcomparingsequences,withapplicationstoaudio­ to­midi alignment and matching,” PhD Thesis, 2016.
    [38] T.Bertin­Mahieux,D.P.Ellis,B.Whitman,andP.Lamere,“Themillionsongdataset,”in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
    [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te­ jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high­performance deep learning library,” in Advances in Neural Information Pro­ cessing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché­Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
    [40] A. Elnekave, “Pytorch alae.” https://github.com/ariel415el/SimplePytorch-ALAE, 2020.
    [41] H.­W. Dong, “Generating music with gans ­ ismir 2019 tutorial.” https://github.com/ salu133445/ismir2019tutorial, 2020.
    [42] K. W. Cheuk, “Pytorch triplet loss and online mining.” https://github.com/ KinWaiCheuk/pytorch-triplet-loss, 2020.
    [43] D. M. Chan, R. Rao, F. Huang, and J. F. Canny, “Gpu accelerated t­distributed stochastic neighbor embedding,” Journal of Parallel and Distributed Computing, vol. 131, pp. 1–13, 2019.
    [44] L. van der Maaten, “Accelerating t­sne using tree­based algorithms,” Journal of Machine Learning Research, vol. 15, no. 93, pp. 3221–3245, 2014.

    QR CODE