基於捲積式生成對抗網路之自動作曲系統之探討

簡易檢索 / 詳目顯示

回結果列表

研究生：	蕭文逸 Hsiao, Wen-Yi
論文名稱：	基於捲積式生成對抗網路之自動作曲系統之探討 Automatic Symbolic Music Generation Based on Convolutional Generative Adversarial Networks
指導教授：	黃婷婷 Hwang, Ting-Ting 楊奕軒 Yang, Yi-Hsuan
口試委員:	陳煥宗 Chen, Hwann-Tzong 劉奕汶 Liu, Yi-Wen
學位類別：	碩士 Master
系所名稱：
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	68
中文關鍵詞：	音樂自動作曲、音樂資訊檢索、深度學習、生成對抗模型
外文關鍵詞：	automatic music generation, music information retrieval, deep learning, generative adversarial nets
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

生成音樂與生成影像與影片有著一些顯著的差異。首先，音樂是時間上的藝術，所以我們需要時序上面的模型。接著，音樂通常由多個樂器/音軌來組成，且各自都具有自己的織體與演奏模式，但當合奏時又必須彼此和諧的箱呼應。最後，音符不僅僅是純粹時序上的關係，鄰近的音群可以組成各式的音樂語法，例如和弦、琶音與音階等等。本篇論文，在基於捲積式生成網路的框架下，我們探討了數個關於多軌與複音音樂聲成的議題，包括:音軌操控性、自動伴奏、神經網路的設計與時間模型。我們也把模型訓練在簡譜與團譜兩種常見的音樂格式。為了分析，我們提出了數個指標，來衡量生成音樂的品質，與音軌之間的和諧度。本篇論文會完整的從音樂的表示法、前處理到模型間的量化分析做一個通盤性的探討，希望藉此可以得到更多深刻的見解，並從而瞭解深度學習技術的有效性與侷限性。

Generating music has a few notable differences from generating
images and videos. First, music is an art of the time, necessitating
a temporal model. Second, music is usually composed
of multiple instruments/tracks with their own temporal
dynamics, but collectively they unfold over time interdependently.
Lastly, musical notes are often grouped into chords,
arpeggios or melodies in polyphonic music, and thereby introducing
a chronological ordering of notes is not naturally
suitable. In this thesis, we investigate several topics about symbolic
multi-track polyphonic music generation under the framework of Conlutional generative adversarial networks (GANs), including controllability, accompaniment, network architecture, temporal modeling. We trained and compared the models on two common formats: lead sheet and band score. To evaluate the generative results, a few intra-track and inter-track objective metrics are also proposed. The integrated survey from data representation, pre-processing, to qualitative evaluation between various architectures offers us more insights about composing music and also re-examining the efficiency and limitation of the deep learning models.

Introduction 1
1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Multi-track and Polyphony . . . . . . . . . . . . . . . . . . 2
2.2 Multi-track Interdependency . . . . . . . . . . . . . . . . . 3
2.3 Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Design of Networks . . . . . . . . . . . . . . . . . . . . . 4
2.5 Temporal Networks . . . . . . . . . . . . . . . . . . . . . . 4
3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Related Work 7
1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . 7
2 Video Generation using GANs . . . . . . . . . . . . . . . . . . . . 8
3 Symbolic Music Generation . . . . . . . . . . . . . . . . . . . . . 9
Proposed System 10
1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Multi-track Interdependency . . . . . . . . . . . . . . . . . . . . . 10
2.1 Jamming Model . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Composer Model . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Modeling the Temporal Structure . . . . . . . . . . . . . . . . . . . 13
3.1 Generation from Scratch . . . . . . . . . . . . . . . . . . . 13
3.2 Track-conditional Generation . . . . . . . . . . . . . . . . 14
4 Integrated System . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Implementation 17
1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 LPD dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Lead Sheet dataset . . . . . . . . . . . . . . . . . . . . . . 19
3 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Experiments 24
1 Objective Metrics for Evaluation . . . . . . . . . . . . . . . . . . . 24
2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Analysis of Training Data . . . . . . . . . . . . . . . . . . 25
2.2 Example Results . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . 26
2.4 Training Process . . . . . . . . . . . . . . . . . . . . . . . 28
3 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Interpolation on inter-track random vectors . . . . . . . . . 38
4.2 Interpolation on intra-track random vectors . . . . . . . . . 39
4.3 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . 39
Design of Networks 46
1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1 Filter and Rhythm . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Revisit the Tonal Distance . . . . . . . . . . . . . . . . . . 54
Temporal networks 57
1 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . 58
2.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Conclusions 65
References 66
                                

[1] Adam Roberts, Jesse Engel, C. R. I. S. C. H. Musicvae: Creating a palette for
musical scores with machine learning., 2018. https://magenta.tensorflow.
org/music-vae.
[2] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint
arXiv:1701.07875 (2017).
[3] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The Million
Song Dataset. In ISMIR (2011).
[4] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal
dependencies in high-dimensional sequences: Application to polyphonic
music generation and transcription. In ICML (2012).
[5] Briot, J.-P., Hadjeres, G., and Pachet, F. Deep learning techniques for music
generation: A survey. arXiv preprint arXiv:1709.01620 (2017).
[6] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel,
P. InfoGAN: Interpretable representation learning by information maximizing
generative adversarial nets. In Proc. Advances in Neural Information Processing
Systems (2016), pp. 2172–2180.
[7] Chris Donahue, Julian McAuley, M. P. Synthesizing audio with generative
adversarial networks. https://arxiv.org/pdf/1802.04208.pdf (2018).
[8] Chu, H., Urtasun, R., and Fidler, S. Song from PI: A musically plausible
network for pop music generation. In ICLR Workshop (2017).
[9] CShuiwang Ji, Wei Xu, M. Y. K. Y. 3d convolutional neural networks for
human action recognition. In IEEE Transactions on Pattern Analysis and Machine
Intelligence (Volume: 35, Issue: 1, Jan. 2013) (2016), pp. 221 – 231.
[10] Daniel Stoller, Sebastian Ewert, S. D. Wave-u-net: A multi-scale neural network
for end-to-end audio source separation. https://arxiv.org/abs/1806.03185
(2018).
[11] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS
(2014).
[12] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.
Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028
(2017).
[13] Hadjeres, G., Pachet, F., and Nielsen, F. DeepBach: A steerable model for
Bach chorales generation. In ICML (2017).
[14] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
audio. In ACM MM workshop on Audio and music computing multimedia
(2006).
64
[15] Herremans, D., and Chew, E. MorpheuS: generating structured music with
constrained patterns and tension. IEEE Trans. Affective Computing (2017).
[16] Konstantinos Bousmalis, George Trigeorgis, N. S. D. K. D. E. Domain separation
networks. https://arxiv.org/pdf/1608.06019.pdf (2016).
[17] Luan Tran, Xi Yin, X. L. Disentangled representation learning gan for poseinvariant
face recognition. CVPR (2017).
[18] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization
for Generative Adversarial Networks. ArXiv e-prints (Feb. 2018).
[19] Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial
training. In NIPS Worshop on Constructive Machine Learning Workshop
(2016).
[20] Nieto, O., and Bello, J. P. Systematic exploration of computational music
structure research. In ISMIR (2016).
[21] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning
with deep convolutional generative adversarial networks. In ICLR (2016).
[22] Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications
to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia
University, 2016.
[23] Raffel, C., and Ellis, D. P. W. Intuitive analysis, creation and manipulation
of MIDI data with pretty_midi. In ISMIR Late Breaking and Demo Papers
(2014).
[24] Raffel, C., and Ellis, D. P. W. Extracting ground truth information from MIDI
files: A MIDIfesto. In ISMIR (2016).
[25] Saito, M., Matsumoto, E., and Saito, S. Temporal generative adversarial nets
with singular value clipping. In ICCV (2017).
[26] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and
Chen, X. Improved techniques for training GANs. In Proc. Advances in Neural
Information Processing Systems (2016), pp. 2226–2234.
[27] Serrà, J., Müller, M., Grosche, P., and Arcos, J. L. Unsupervised detection of
music boundaries by time series structure features. In AAAI (2012).
[28] Sturm, B. L., Santos, J. F., Ben-Tal, O., and Korshunova, I. Music transcription
modelling and composition using deep learning. In Conference on Computer
Simulation of Musical Creativity (2016).
[29] Takeru Miyato, M. K. cgans with projection discriminator.
[30] Tero Karras, Timo Aila, S. L. J. L. Progressive growing of gans for improved
quality, stability, and variation.
[31] Tomas Mikolov, Ilya Sutskev Kai Chen, G. C. J. D. Distributed representations
of words and phrases and their compositionality.
65
[32] Tulyakov, S., Liu, M., Yang, X., and Kautz, J. MoCoGAN: Decomposing
motion and content for video generation. arXiv preprint arXiv:1707.04993
(2017).
[33] Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene
dynamics. In NIPS (2016).
[34] Wen-Cheng Chen, Chien-Wen Chen, M.-C. H. Syncgan: Synchronize the latent
space of cross-modal generative adversarial networks. ICME (2018).
[35] Yang, L.-C., Chou, S.-Y., and Yang, Y.-H. MidiNet: A convolutional generative
adversarial network for symbolic-domain music generation. In ISMIR
(2017).
[36] Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence generative adversarial
nets with policy gradient. In AAAI (2017).

簡易檢索 / 詳目顯示

相關論文