研究生: |
蕭文逸 Hsiao, Wen-Yi |
---|---|
論文名稱: |
基於捲積式生成對抗網路之自動作曲系統之探討 Automatic Symbolic Music Generation Based on Convolutional Generative Adversarial Networks |
指導教授: |
黃婷婷
Hwang, Ting-Ting 楊奕軒 Yang, Yi-Hsuan |
口試委員: |
陳煥宗
Chen, Hwann-Tzong 劉奕汶 Liu, Yi-Wen |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 68 |
中文關鍵詞: | 音樂自動作曲 、音樂資訊檢索 、深度學習 、生成對抗模型 |
外文關鍵詞: | automatic music generation, music information retrieval, deep learning, generative adversarial nets |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
生成音樂與生成影像與影片有著一些顯著的差異。首先,音樂是時間上的藝術,所以我們需要時序上面的模型。接著,音樂通常由多個樂器/音軌來組成,且各自都具有自己的織體與演奏模式,但當合奏時又必須彼此和諧的箱呼應。最後,音符不僅僅是純粹時序上的關係,鄰近的音群可以組成各式的音樂語法,例如和弦、琶音與音階等等。本篇論文,在基於捲積式生成網路的框架下,我們探討了數個關於多軌與複音音樂聲成的議題,包括:音軌操控性、自動伴奏、神經網路的設計與時間模型。我們也把模型訓練在簡譜與團譜兩種常見的音樂格式。為了分析,我們提出了數個指標,來衡量生成音樂的品質,與音軌之間的和諧度。本篇論文會完整的從音樂的表示法、前處理到模型間的量化分析做一個通盤性的探討,希望藉此可以得到更多深刻的見解,並從而瞭解深度學習技術的有效性與侷限性。
Generating music has a few notable differences from generating
images and videos. First, music is an art of the time, necessitating
a temporal model. Second, music is usually composed
of multiple instruments/tracks with their own temporal
dynamics, but collectively they unfold over time interdependently.
Lastly, musical notes are often grouped into chords,
arpeggios or melodies in polyphonic music, and thereby introducing
a chronological ordering of notes is not naturally
suitable. In this thesis, we investigate several topics about symbolic
multi-track polyphonic music generation under the framework of Conlutional generative adversarial networks (GANs), including controllability, accompaniment, network architecture, temporal modeling. We trained and compared the models on two common formats: lead sheet and band score. To evaluate the generative results, a few intra-track and inter-track objective metrics are also proposed. The integrated survey from data representation, pre-processing, to qualitative evaluation between various architectures offers us more insights about composing music and also re-examining the efficiency and limitation of the deep learning models.
[1] Adam Roberts, Jesse Engel, C. R. I. S. C. H. Musicvae: Creating a palette for
musical scores with machine learning., 2018. https://magenta.tensorflow.
org/music-vae.
[2] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint
arXiv:1701.07875 (2017).
[3] Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. The Million
Song Dataset. In ISMIR (2011).
[4] Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal
dependencies in high-dimensional sequences: Application to polyphonic
music generation and transcription. In ICML (2012).
[5] Briot, J.-P., Hadjeres, G., and Pachet, F. Deep learning techniques for music
generation: A survey. arXiv preprint arXiv:1709.01620 (2017).
[6] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel,
P. InfoGAN: Interpretable representation learning by information maximizing
generative adversarial nets. In Proc. Advances in Neural Information Processing
Systems (2016), pp. 2172–2180.
[7] Chris Donahue, Julian McAuley, M. P. Synthesizing audio with generative
adversarial networks. https://arxiv.org/pdf/1802.04208.pdf (2018).
[8] Chu, H., Urtasun, R., and Fidler, S. Song from PI: A musically plausible
network for pop music generation. In ICLR Workshop (2017).
[9] CShuiwang Ji, Wei Xu, M. Y. K. Y. 3d convolutional neural networks for
human action recognition. In IEEE Transactions on Pattern Analysis and Machine
Intelligence (Volume: 35, Issue: 1, Jan. 2013) (2016), pp. 221 – 231.
[10] Daniel Stoller, Sebastian Ewert, S. D. Wave-u-net: A multi-scale neural network
for end-to-end audio source separation. https://arxiv.org/abs/1806.03185
(2018).
[11] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS
(2014).
[12] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A.
Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028
(2017).
[13] Hadjeres, G., Pachet, F., and Nielsen, F. DeepBach: A steerable model for
Bach chorales generation. In ICML (2017).
[14] Harte, C., Sandler, M., and Gasser, M. Detecting harmonic change in musical
audio. In ACM MM workshop on Audio and music computing multimedia
(2006).
64
[15] Herremans, D., and Chew, E. MorpheuS: generating structured music with
constrained patterns and tension. IEEE Trans. Affective Computing (2017).
[16] Konstantinos Bousmalis, George Trigeorgis, N. S. D. K. D. E. Domain separation
networks. https://arxiv.org/pdf/1608.06019.pdf (2016).
[17] Luan Tran, Xi Yin, X. L. Disentangled representation learning gan for poseinvariant
face recognition. CVPR (2017).
[18] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization
for Generative Adversarial Networks. ArXiv e-prints (Feb. 2018).
[19] Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial
training. In NIPS Worshop on Constructive Machine Learning Workshop
(2016).
[20] Nieto, O., and Bello, J. P. Systematic exploration of computational music
structure research. In ISMIR (2016).
[21] Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning
with deep convolutional generative adversarial networks. In ICLR (2016).
[22] Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications
to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia
University, 2016.
[23] Raffel, C., and Ellis, D. P. W. Intuitive analysis, creation and manipulation
of MIDI data with pretty_midi. In ISMIR Late Breaking and Demo Papers
(2014).
[24] Raffel, C., and Ellis, D. P. W. Extracting ground truth information from MIDI
files: A MIDIfesto. In ISMIR (2016).
[25] Saito, M., Matsumoto, E., and Saito, S. Temporal generative adversarial nets
with singular value clipping. In ICCV (2017).
[26] Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and
Chen, X. Improved techniques for training GANs. In Proc. Advances in Neural
Information Processing Systems (2016), pp. 2226–2234.
[27] Serrà, J., Müller, M., Grosche, P., and Arcos, J. L. Unsupervised detection of
music boundaries by time series structure features. In AAAI (2012).
[28] Sturm, B. L., Santos, J. F., Ben-Tal, O., and Korshunova, I. Music transcription
modelling and composition using deep learning. In Conference on Computer
Simulation of Musical Creativity (2016).
[29] Takeru Miyato, M. K. cgans with projection discriminator.
[30] Tero Karras, Timo Aila, S. L. J. L. Progressive growing of gans for improved
quality, stability, and variation.
[31] Tomas Mikolov, Ilya Sutskev Kai Chen, G. C. J. D. Distributed representations
of words and phrases and their compositionality.
65
[32] Tulyakov, S., Liu, M., Yang, X., and Kautz, J. MoCoGAN: Decomposing
motion and content for video generation. arXiv preprint arXiv:1707.04993
(2017).
[33] Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene
dynamics. In NIPS (2016).
[34] Wen-Cheng Chen, Chien-Wen Chen, M.-C. H. Syncgan: Synchronize the latent
space of cross-modal generative adversarial networks. ICME (2018).
[35] Yang, L.-C., Chou, S.-Y., and Yang, Y.-H. MidiNet: A convolutional generative
adversarial network for symbolic-domain music generation. In ISMIR
(2017).
[36] Yu, L., Zhang, W., Wang, J., and Yu, Y. SeqGAN: Sequence generative adversarial
nets with policy gradient. In AAAI (2017).