研究生: |
李 享 Lee, Shiang |
---|---|
論文名稱: |
使用超長程關聯卷積來處理泛音與相位回復以解決歌聲分離問題 Using a Long Range U-Net to Deal with Overtones and Phase Restoration in Singing Voice Separation Problems |
指導教授: |
蘇豐文
Soo, Von-Wun |
口試委員: |
沈之涯
Shen, Chih-Ya 邱瀞德 Chiu, Ching-Te |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 34 |
中文關鍵詞: | 歌聲分離 、卷積 、神經網路 、相位回復 |
外文關鍵詞: | singing voice separation, convolutional layers, deep learning, phase reconstruction |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
音訊分離如今是一個具有挑戰的研究方向,不僅吸引許多的研究者投入開發相關技術之外,甚至從2008年開始近乎每年一度的頻繁舉辦Signal SeparationEvaluation Campaign (SiSEC)比賽。觀察了2018年的比賽參加選手與大會的分析報告後我們發現,絕大多數的參賽者因為缺乏效果與效率兼具的資料壓縮技術,導致參賽者們皆會捨去訓練資料中較高頻率的資訊與頻譜圖相位的資訊。經這樣的觀察後,我們針對相對應的問題研發了一種新的深度學習模型—OvertoneNet(OveNet),其中利用了兩個新的技術:頻率1x1卷積層(F1x1 convolution layers)與複數的頻譜圖訓練法(complex-spectrogram channels),使得我們可以處理完整的44.1千赫的音訊(高解析度的音訊),也使得我們有能力利用音樂中頻繁存在的泛音關係加強訓練模型的效率與效果,這樣的優勢在其他模型中是無法被實現的。這次實驗結果顯示,我們在客觀與主觀的測試中之分離能力完全超越所有參賽SiSEC2018的對手,這樣的結果也證明了我們的方法效果顯著。
Audio source separation is a challenging topic that attracted various research teamsto attend the Signal Separation Evaluation Campaign (SiSEC) in 2018. Most top-rankedcompetitors based on deep learning methods ignored higher-frequency signals of the har-monic and the phase information due to the lack of efficient data compression method. Wepropose a new deep learning model named OvertoneNet (OveNet) that adopts two novelconcepts, frequency 1x1 convolution layers and complex-spectrogram channels, to handlethe 44.1k audio signals (Hi-Res audio signals) with a wide range of overtones. The re-sults of our experiment show that OveNet performs well in both objective and subjectiveevaluation on interference using limited training data from SiSEC2018.
[1] Fabian-Robert St ̈oter, Antoine Liutkus, and Nobutaka Ito. The 2018 signal separationevaluation campaign.ArXiv, abs/1804.06267, 2018.
[2] Emmanuel Vincent, R ́emi Gribonval, and C ́edric F ́evotte. Performance measurementin blind audio source separation.IEEE Transactions on Audio, Speech, and LanguageProcessing, 14:1462–1469, 2006.
[3] Zafar Rafii, Antoine Liutkus, Fabian-Robert St ̈oter, Stylianos Ioannis Mimilakis, andRachel M. Bittner. Musdb18 - a corpus for music separation. 2017.
[4] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia G ́omez. A vocoder basedmethod for singing voice extraction.ICASSP 2019 - 2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), pages 990–994, 2019.
[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networksfor biomedical image segmentation.ArXiv, abs/1505.04597, 2015.
[6] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, AparnaKumar, and Tillman Weyde. Singing voice separation with deep u-net convolutionalnetworks. InISMIR, 2017.
[7] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji. Mmdenselstm: An effi-cient combination of convolutional and recurrent neural networks for audio sourceseparation.2018 16th International Workshop on Acoustic Signal Enhancement(IWAENC), pages 106–110, 2018.
[8] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networksfor semantic segmentation. InCVPR, 2015.
[9] Zhengxin Zhang, Qingjie Liu, and Yunhong Wang. Road extraction by deep residualu-net.IEEE Geoscience and Remote Sensing Letters, 15:749–753, 2018.
[10] ̈Ozg ̈un iek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ron-neberger. 3d u-net: Learning dense volumetric segmentation from sparse annotation.InMICCAI, 2016.
[11] Stefan Uhlich, Marcello Porcu, Franck Giron, Michael Enenkl, Thomas Kemp, NaoyaTakahashi, and Yuki Mitsufuji. Improving music source separation based on deepneural networks through data augmentation and network blending.2017 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages261–265, 2017.
[12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions.2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 1–9, 2014.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition.2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2015.
[14] A ̈aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu.Wavenet: A generative model for raw audio. InSSW, 2016.