簡易檢索 / 詳目顯示

研究生: 蕭文逸
Hsiao, Wen-Yi
論文名稱: 基於捲積式生成對抗網路之自動作曲系統之探討
Automatic Symbolic Music Generation Based on Convolutional Generative Adversarial Networks
指導教授: 黃婷婷
Hwang, Ting-Ting
楊奕軒
Yang, Yi-Hsuan
口試委員: 陳煥宗
Chen, Hwann-Tzong
劉奕汶
Liu, Yi-Wen
學位類別: 碩士
Master
系所名稱:
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 68
中文關鍵詞: 音樂自動作曲音樂資訊檢索深度學習生成對抗模型
外文關鍵詞: automatic music generation, music information retrieval, deep learning, generative adversarial nets
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 生成音樂與生成影像與影片有著一些顯著的差異。首先,音樂是時間上的藝術,所以我們需要時序上面的模型。接著,音樂通常由多個樂器/音軌來組成,且各自都具有自己的織體與演奏模式,但當合奏時又必須彼此和諧的箱呼應。最後,音符不僅僅是純粹時序上的關係,鄰近的音群可以組成各式的音樂語法,例如和弦、琶音與音階等等。本篇論文,在基於捲積式生成網路的框架下,我們探討了數個關於多軌與複音音樂聲成的議題,包括:音軌操控性、自動伴奏、神經網路的設計與時間模型。我們也把模型訓練在簡譜與團譜兩種常見的音樂格式。為了分析,我們提出了數個指標,來衡量生成音樂的品質,與音軌之間的和諧度。本篇論文會完整的從音樂的表示法、前處理到模型間的量化分析做一個通盤性的探討,希望藉此可以得到更多深刻的見解,並從而瞭解深度學習技術的有效性與侷限性。


    Generating music has a few notable differences from generating
    images and videos. First, music is an art of the time, necessitating
    a temporal model. Second, music is usually composed
    of multiple instruments/tracks with their own temporal
    dynamics, but collectively they unfold over time interdependently.
    Lastly, musical notes are often grouped into chords,
    arpeggios or melodies in polyphonic music, and thereby introducing
    a chronological ordering of notes is not naturally
    suitable. In this thesis, we investigate several topics about symbolic
    multi-track polyphonic music generation under the framework of Conlutional generative adversarial networks (GANs), including controllability, accompaniment, network architecture, temporal modeling. We trained and compared the models on two common formats: lead sheet and band score. To evaluate the generative results, a few intra-track and inter-track objective metrics are also proposed. The integrated survey from data representation, pre-processing, to qualitative evaluation between various architectures offers us more insights about composing music and also re-examining the efficiency and limitation of the deep learning models.

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Multi-track and Polyphony . . . . . . . . . . . . . . . . . . 2 1.2.2 Multi-track Interdependency . . . . . . . . . . . . . . . . . 3 1.2.3 Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 3 1.2.4 Design of Networks . . . . . . . . . . . . . . . . . . . . . 4 1.2.5 Temporal Networks . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 7 2.2 Video Generation using GANs . . . . . . . . . . . . . . . . . . . . 8 2.3 Symbolic Music Generation . . . . . . . . . . . . . . . . . . . . . 9 3 Proposed System 10 3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Multi-track Interdependency . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 Jamming Model . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Composer Model . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.3 Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Modeling the Temporal Structure . . . . . . . . . . . . . . . . . . . 13 3.3.1 Generation from Scratch . . . . . . . . . . . . . . . . . . . 13 3.3.2 Track-conditional Generation . . . . . . . . . . . . . . . . 14 3.4 Integrated System . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Implementation 17 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 LPD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.2 Lead Sheet dataset . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Experiments 24 5.1 Objective Metrics for Evaluation . . . . . . . . . . . . . . . . . . . 24 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2.1 Analysis of Training Data . . . . . . . . . . . . . . . . . . 25 5.2.2 Example Results . . . . . . . . . . . . . . . . . . . . . . . 25 5.2.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . 26 5.2.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . 28 5.3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.4.1 Interpolation on inter-track random vectors . . . . . . . . . 38 5.4.2 Interpolation on intra-track random vectors . . . . . . . . . 39 5.4.3 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . 39 6 Design of Networks 46 6.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3.1 Filter and Rhythm . . . . . . . . . . . . . . . . . . . . . . 52 6.3.2 Revisit the Tonal Distance . . . . . . . . . . . . . . . . . . 54 7 Temporal networks 57 7.1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . 58 7.2.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 8 Conclusions 65 References 66

    [1] Adam Roberts, Jesse Engel, C. R. I. S. C. H. Musicvae: Creating a palette for
    musical scores with machine learning., 2018. https://magenta.tensorflow.
    org/music-vae.
    [2] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint
    arXiv:1701.07875 (2017).
    [3] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The Million
    Song Dataset. In ISMIR (2011).
    [4] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal
    dependencies in high-dimensional sequences: Application to polyphonic
    music generation and transcription. In ICML (2012).
    [5] Briot, J.-P., Hadjeres, G., and Pachet, F. Deep learning techniques for music
    generation: A survey. arXiv preprint arXiv:1709.01620 (2017).
    [6] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel,
    P. InfoGAN: Interpretable representation learning by information maximizing
    generative adversarial nets. In Proc. Advances in Neural Information Processing
    Systems (2016), pp. 2172–2180.
    [7] Chris Donahue, Julian McAuley, M. P. Synthesizing audio with generative
    adversarial networks. https://arxiv.org/pdf/1802.04208.pdf (2018).
    [8] Chu, H., Urtasun, R., and Fidler, S. Song from PI: A musically plausible
    network for pop music generation. In ICLR Workshop (2017).
    [9] CShuiwang Ji, Wei Xu, M. Y. K. Y. 3d convolutional neural networks for
    human action recognition. In IEEE Transactions on Pattern Analysis and Machine
    Intelligence (Volume: 35, Issue: 1, Jan. 2013) (2016), pp. 221 – 231.
    [10] Daniel Stoller, Sebastian Ewert, S. D. Wave-u-net: A multi-scale neural network
    for end-to-end audio source separation. https://arxiv.org/abs/1806.03185
    (2018).
    [11] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
    Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS
    (2014).
    [12] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.
    Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028
    (2017).
    [13] Hadjeres, G., Pachet, F., and Nielsen, F. DeepBach: A steerable model for
    Bach chorales generation. In ICML (2017).
    [14] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
    audio. In ACM MM workshop on Audio and music computing multimedia
    (2006).
    64
    [15] Herremans, D., and Chew, E. MorpheuS: generating structured music with
    constrained patterns and tension. IEEE Trans. Affective Computing (2017).
    [16] Konstantinos Bousmalis, George Trigeorgis, N. S. D. K. D. E. Domain separation
    networks. https://arxiv.org/pdf/1608.06019.pdf (2016).
    [17] Luan Tran, Xi Yin, X. L. Disentangled representation learning gan for poseinvariant
    face recognition. CVPR (2017).
    [18] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization
    for Generative Adversarial Networks. ArXiv e-prints (Feb. 2018).
    [19] Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial
    training. In NIPS Worshop on Constructive Machine Learning Workshop
    (2016).
    [20] Nieto, O., and Bello, J. P. Systematic exploration of computational music
    structure research. In ISMIR (2016).
    [21] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning
    with deep convolutional generative adversarial networks. In ICLR (2016).
    [22] Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications
    to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia
    University, 2016.
    [23] Raffel, C., and Ellis, D. P. W. Intuitive analysis, creation and manipulation
    of MIDI data with pretty_midi. In ISMIR Late Breaking and Demo Papers
    (2014).
    [24] Raffel, C., and Ellis, D. P. W. Extracting ground truth information from MIDI
    files: A MIDIfesto. In ISMIR (2016).
    [25] Saito, M., Matsumoto, E., and Saito, S. Temporal generative adversarial nets
    with singular value clipping. In ICCV (2017).
    [26] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and
    Chen, X. Improved techniques for training GANs. In Proc. Advances in Neural
    Information Processing Systems (2016), pp. 2226–2234.
    [27] Serrà, J., Müller, M., Grosche, P., and Arcos, J. L. Unsupervised detection of
    music boundaries by time series structure features. In AAAI (2012).
    [28] Sturm, B. L., Santos, J. F., Ben-Tal, O., and Korshunova, I. Music transcription
    modelling and composition using deep learning. In Conference on Computer
    Simulation of Musical Creativity (2016).
    [29] Takeru Miyato, M. K. cgans with projection discriminator.
    [30] Tero Karras, Timo Aila, S. L. J. L. Progressive growing of gans for improved
    quality, stability, and variation.
    [31] Tomas Mikolov, Ilya Sutskev Kai Chen, G. C. J. D. Distributed representations
    of words and phrases and their compositionality.
    65
    [32] Tulyakov, S., Liu, M., Yang, X., and Kautz, J. MoCoGAN: Decomposing
    motion and content for video generation. arXiv preprint arXiv:1707.04993
    (2017).
    [33] Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene
    dynamics. In NIPS (2016).
    [34] Wen-Cheng Chen, Chien-Wen Chen, M.-C. H. Syncgan: Synchronize the latent
    space of cross-modal generative adversarial networks. ICME (2018).
    [35] Yang, L.-C., Chou, S.-Y., and Yang, Y.-H. MidiNet: A convolutional generative
    adversarial network for symbolic-domain music generation. In ISMIR
    (2017).
    [36] Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence generative adversarial
    nets with policy gradient. In AAAI (2017).

    QR CODE