研究生: |
范綵均 Fan, Tsai-Jyun |
---|---|
論文名稱: |
優化多峰分部之音色特徵強化音訊風格轉換 Improved Timbre-enhanced Multi-modal Music Style Transfer |
指導教授: |
李哲榮
Lee, Che-Rung 蘇黎 Su, Li |
口試委員: |
邱維辰
Chiu, Wei-Chen 周志遠 Chou, Chi -Yuan |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 24 |
中文關鍵詞: | 深度學習 、風格轉換 、音樂 |
外文關鍵詞: | Deep Learning, Style Transfer, Music |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在複音音樂之間做風格上的轉換一直是一項具有挑戰性的任務。為了達成這個目的,能在無監督的學習方式下學習音樂的域變(即風格)和域不變(即內容)的資訊是非常關鍵的。我們前一篇研究提出了一個在不使用成對的資料情況下完成無監督式的風格轉換的方法,而這個方法是建立在 Multi-modal Unsupervised Image-to-Image Translation (MUNIT)這個框架上。這個方法成功的表現出將圖片到圖片的風格轉換模型應用在在音樂間做風格轉換一樣可以產生出可觀的結果。然而,經過風格轉換後生成的音樂跟真實的音樂聽起來仍然有一段很大的差距,如何縮小這段差距仍然是一個挑戰。為了減少這段差異,我們提出了一個簡單但是又有效率的方法來改善轉換後的結果。我們的實驗總共包含了四種曲風,分別為鋼琴獨奏,吉他獨奏,弦樂四重奏和晶片音樂,並在它們之間實現雙邊風格的轉移。我們使用主觀測試評估上述方法的成效,結果顯示我們提出的方法在音樂風格轉換上帶來的進步。我們也對於一系列其他的實驗提出一些分析,這些論點可以對日後的研究帶來幫助。
Style transfer of polyphonic music recordings has always been a challenging task, because it is hard to learn stable representations for both domain invariant (i.e., content) and domain-variant (i.e., style) of recorded music. Our previous work, which employs the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework, is an unsupervised music style transfer method. Although it successfully shows that the techniques for image-to-image translation can also generate promising results for music style transfer, there is a notable gap between the real target and the transferred music. To reduce this gap, we propose a simple yet effective loss that can improve the transferred results. We conduct experiments on bilateral style transfer tasks among four different genres, namely piano solo, guitar solo, string quartet and chiptune. The proposed methods are evaluated through a subjective test, whose results demonstrate that the effectiveness of the proposed methods in music style transfer. We also design a novel objective test method and give some analysis according to a series of other experiments which we believe will be helpful for future research.
[1] Shuqi Dai, Zheng Zhang, and Gus Xia. “Music Style Transfer Issues: A Position Paper” (Mar. 2018).
[2] Marcelo Caetano and Xavier Rodet. “Sound morphing by feature interpolation”. June 2011, pp. 161 –164. doi: 10.1109/ICASSP.2011.5946365.
[3] Jonathan Driedger, Thomas Prätzlich, and Meinard Müller. “Let It Bee – Towards NMF-Inspired Audio Mosaicing”. Jan. 2015, pp. 350–356.
[4] Shih-Yang Su et al. “Automatic conversion of Pop music into chiptunes for 8-bit pixel art”. Mar. 2017, pp. 411–415. doi: 10.1109/ICASSP.2017.7952188.
[5] Ian J. Goodfellow et al. “Generative Adversarial Nets”. Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 2014, pp. 2672–2680.
[6] D. Ulyanov and V. Lebedev. “Singing style transfer” (2016). url: https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/.
[7] O. B. Bohan. “Singing style transfer” (2017). url: http://madebyoll.in/posts/singing_style_transfer/.
[8] Cheng-Wei Wu et al. Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks. July 2018.
[9] Prateek Verma and Julius Smith. “Neural Style Transfer for Audio Spectograms” (Jan. 2018).
[10] Albert Haque, Michelle Guo, and Prateek Verma. “Conditional End-to-End Audio Transforms”. Sept. 2018, pp. 2295–2299. doi: 10 . 21437 / Interspeech. 2018-38.
[11] Noam Mor et al. A Universal Music Translation Network. May 2018.
[12] Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola. “A Matlab Toolbox for Music Information Retrieval”. Vol. 4. Jan. 2007, pp. 261–268. doi: 10.1007/978-3-540-78246-9_31.
[13] Geoffroy Peeters et al. “The Timbre Toolbox: Extracting audio descriptors from musical signals”. The Journal of the Acoustical Society of America 130 (Nov. 2011), pp. 2902–16. doi: 10.1121/1.3642604.
[14] John Grey. “Multidimensional perceptual scaling of musical timbre”. The Journal of the Acoustical Society of America 61 (June 1977), pp. 1270–7. doi: 10.1121/1.381428.
[15] Vinoo Alluri and Petri Toiviainen. “Exploring Perceptual and Acoustical Correlates of Polyphonic Timbre”. Music Perception - MUSIC PERCEPT 27 (Feb.2010), pp. 223–242. doi: 10.1525/mp.2010.27.3.223.
[16] Anne Caclin et al. “Acoustic correlates of timbre space dimensions: A confirmatory study using synthetic tones”. The Journal of the Acoustical Society of America 118 (Aug. 2005), pp. 471–82. doi: 10.1121/1.1929229.
[17] Kai Siedenburg, Ichiro Fujinaga, and Stephen Mcadams. “A Comparison of Approaches to Timbre Descriptors in Music Information Retrieval and Music Psychology”. Journal of New Music Research 45 (Jan. 2016), pp. 1–15. doi: 10.1080/09298215.2015.1132737.
[18] Jean-Julien Aucouturier and Emmanuel Bigand. “Seven problems that keep MIR from attracting the interest of cognition and neuroscience”. Journal of Intelligent Information Systems 41 (Dec. 2013). doi: 10.1007/s10844-013-0251-x.
[19] Aaron oord et al. “WaveNet: A Generative Model for Raw Audio” (Sept. 2016).
[20] Chien-Yu Lu et al. “Play as You Like: Timbre-Enhanced Multi-Modal Music Style Transfer”. Proceedings of the AAAI Conference on Artificial Intelligence33 (July 2019), pp. 1061–1068. doi: 10.1609/aaai.v33i01.33011061.
[21] Xun Huang et al. “Multimodal Unsupervised Image-to-Image Translation” (Apr.2018).
[22] Lantao Yu et al. “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient” (Sept. 2016).
[23] JunYoung Gwak et al. “Weakly Supervised Generative Adversarial Networks for 3D Reconstruction” (May 2017).
[24] Yijun Li et al. “Generative Face Completion”. July 2017, pp. 5892–5900. doi:10.1109/CVPR.2017.624.
[25] Ishaan Gulrajani et al. “Improved Training of Wasserstein GANs” (Mar. 2017).
[26] Tero Karras et al. “Progressive Growing of GANs for Improved Quality, Stability, and Variation” (Oct. 2017).
[27] Tero Karras, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks”. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (Jan. 2020), pp. 1–1. doi: 10.1109/TPAMI.2020.2970919.
[28] Phillip Isola et al. “Image-to-Image Translation with Conditional Adversarial Networks”. July 2017, pp. 5967–5976. doi: 10.1109/CVPR.2017.632.
[29] Jun-Yan Zhu et al. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks”. Oct. 2017, pp. 2242–2251. doi: 10.1109/ICCV.2017.244.
[30] Ting-Chun Wang et al. “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs”. June 2018, pp. 8798–8807. doi: 10.1109/CVPR.2018.00917.
[31] Caroline Chan et al. “Everybody Dance Now”. Oct. 2019, pp. 5932–5941. doi:10.1109/ICCV.2019.00603.
[32] Ting-Chun Wang et al. Video-to-Video Synthesis. Aug. 2018.
[33] Chris Donahue, Julian McAuley, and Miller Puckette.“Synthesizing Audio with Generative Adversarial Networks” (Feb. 2018).
[34] Alexia Jolicoeur-Martineau. The relativistic discriminator: a key element missing from standard GAN. July 2018.
[35] Aayush Bansal et al. Recycle-GAN: Unsupervised Video Retargeting. Aug. 2018.
[36] Hsin-Ying Lee et al. “DRIT++: Diverse Image-to-Image Translation via Disentangled Representations”. International Journal of Computer Vision (Feb.2020). doi: 10.1007/s11263-019-01284-z.
[37] Ju-chieh Chou et al. “Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations”. Sept. 2018, pp. 501–505. doi: 10.21437/Interspeech.2018-1830.