簡易檢索 / 詳目顯示

研究生: 卡橋安
Jorge Andre de Carvalho Coelho
論文名稱: 音訊的音樂結構區隔
Music Structural Segmentation from Audio Signals using CNN Bidirectional LSTM
指導教授: 蘇豐文
Soo, Von-Wun
口試委員: 沈之涯
Shen, Chih-Ya
黃志方
Huang, Zhi-Fang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 115
中文關鍵詞: 音樂結構
外文關鍵詞: Music Structural Segmentation
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在研究論文中,我們從音訊中把音樂與其結構分割成幾個小部份,並以深度學習神經網路設計出一個名為CNN Bidirectional LSTM的構造,借此與「卷積神經網路 」(CNN,convolutional neural networks) 及「雙向長短期記憶網路」(BiLSTM,Bidirectional Long Short-Term Memory)作連接並偵測出音樂在結構中的分界線。

    在實驗中,輸入的音樂音頻信號會被轉化成一個光譜圖及兩個自相似矩陣(SSM,self similarity matrix)並且被深度神經網路分類。

    我們也使用音色能量正規化統計(Chroma Energy Normalized Statistics)法, 一種把音調與音色轉換成圖的方法, 並且得到比過往作為使用方法,並且比過往的研究報告得到更好的精確率與召回率,F1值在±0.5秒與±3秒公差分別為11.2\%與6.58\%。


    In this paper, we investigate the problems of segmenting a piece of music into its structural components from its audio signals.
    We devise a deep learning neural network architecture called CNN Bidirectional LSTM model which combines convolutional neural networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) to perform music boundary detection.

    The music audio input to the model is first converted into one spectrogram and two SSMs that can be classified by the deep neural network.
    We also propose the use of Chroma Energy Normalized Statistics on this task. We show the resulting improvements over previous work with respect to precision and recall.

    We verified improvement of 11.2\% and 6.58\% F1-score at $
    m0.5$ seconds and $
    m3$ seconds tolerance, respectively.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3Organization of the paper . . . . . . . . . . . . . . . . . 6 2 Theoretical background . . . . . . . . . . . . . . . . . . .7 2.1 Audio Signal Processing . . . . . . . . . . . . . . . . . . . 7 2.1.1 From Waveform to Spectrogram Representation . . . 8 2.1.2 Loudness and Pitch . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Chromagram Representation . . . . . . . . . . . . . . 11 2.1.4 Mel-Frequency Cepstral Coefficients . . . . . . . . . . 14 2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Deep FeedForward Networks . . . . . . . . . . . . . . 15 2.2.2 Convolutional Neural Networks . . . . . . . . . . . . . 18 2.2.3 Long Short Term Memory . . . . . . . . . . . . . . . . 20 2.2.4 Bidirectional LSTM Networks . . . . . . . . . . . . . . . 23 2.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Timbre Features . . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Pitch related Features . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Rhythmic Features . . . . . . . . . . . . . . . . . . . . . .30 3.2 Structural Segmentation Types of Approaches . . . . . . .30 3.2.1 Novelty-based Approaches . . . . . . . . . . . . . . . . . 31 3.2.2 Homogeneity-based Approaches . . . . . . . . . . . . . 33 3.2.3 Repetition-based Approaches . . . . . . . . . . . . . . .35 3.2.4 Deep Learning Approaches . . . . . . . . . . . . . . . . . 36 4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . .44 5 Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . .45 5.1.1 Computation of Mel Log Spectrogram . . . . . . . . . . . 45 5.1.2 Computation of SSMMFCC. . . . . . . . . . . . . . . . . . 47 5.1.3 Computation of SSMCENS. . . . . . . . . . . . . . . . . . 50 5.1.4 Input Representations . . . . . . . . . . . . . . . . . . . . 52 5.2 Deep Neural Network Architecture . . . . . . . . . . . . . .55 5.2.1 CNN Bidirectional LSTM . . . . . . . . . . . . . . . . . . . 56 5.2.2 Training Methodology . . . . . . . . . . . . . . . . . . . 60 5.3 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . 64 6 Experiments and Results . . . . . . . . . . . . . . . . . . . . 66 6.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2 Performance Measures . . . . . . . . . . . . . . . . . . . . 68 6.3 Initial Setup Experiments . . . . . . . . . . . . . . . . . . . 70 6.3.1 Evaluation of Different Target Smearing Parameters . . . 71 6.3.2 Evaluation of Undersampling Paramenters . . . . . . . . 73 6.3.3 Evaluation of Different Chroma Features . . . . . . . . . 74 6.3.4 Evaluation of Different Peak Peaking Parameters . . . . 78 6.4 Experiments with Deep Neural Network Architectures . . . 80 6.4.1 Network Architectures . . . . . . . . . . . . . . . . . . . . 80 6.4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Further Results . . . . . . . . . . . . . . . . . . . . . . . . . 88 7 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . 90 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A Table of Target Smearing Parameters. . . . . . . . . . . . . . .105 B One hundred song results . . . . . . . . . . . . . . . . . . . . .109

    [1] Meinard M ̈uller, Frank Kurth, and Michael Clausen. Audio matchingvia chroma-based statistical features. InISMIR, volume 2005, page6th, 2005.
    [2] Meinard M ̈uller and Sebastian Ewert. Chroma toolbox: Matlab implementations for extracting variants of chroma-based audio features. InProceedings of the 12th International Conference on Music Informa-tion Retrieval (ISMIR), 2011. hal-00727791, version 2-22 Oct 2012.Citeseer, 2011.
    [3] Gerhard Heinzel, Albrecht R ̈udiger, and Roland Schilling. Spectrumand spectral density estimation by the discrete fourier transform (dft),including a comprehensive list of window functions and some new at-top windows. 2002.
    [4] T. O’Brien. Musical structure segmentation with convolutional neural networks. In17th International Society for Music Information Re-trieval Conference, 2016.
    [5] Karen Ullrich, Jan Schl ̈uter, and Thomas Grill. Boundary detectionin music structure analysis using convolutional neural networks. InISMIR, pages 417–422, 2014.
    [6] John Makhoul and Lynn Cosell. Lpcw: An lpc vocoder with linearpredictive spectral warping. InAcoustics, Speech, and Signal Process-ing, IEEE International Conference on ICASSP’76., volume 1, pages466–469. IEEE, 1976.
    [7] Kyogu Lee. Automatic chord recognition from audio using enhancedpitch class profile. InICMC, 2006.
    [8] Takuya Fujishima. Real-time chord recognition of musical sound: Asystem using common lisp music.Proc. ICMC, Oct. 1999, pages 464–467, 1999.
    [9] Emilia G ́omez. Tonal description of music audio signals.Departmentof Information and Communication Technologies, 2006.
    [10] Julien Osmalsky, Jean-Jacques Embrechts, Marc Van Droogenbroeck and S ́ebastien Pierard. Neural networks for musical chords recogni-tion. InJourn ́ees d’informatique musicale, pages 39–46, 2012.
    [11] Judith C Brown. Calculation of a constant q spectral transform.TheJournal of the Acoustical Society of America, 89(1):425–434, 1991.
    [12] Matthew L Cooper and Jonathan Foote. Automatic music summariza-tion via similarity analysis. InISMIR, 2002.
    [13] Xuancheng Shao and Steven G Johnson. Type-ii/iii dct/dst algorithmswith reduced number of arithmetic operations.Signal Processing, 88(6):1553–1564, 2008.
    [14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparserectifier neural networks. InProceedings of the fourteenth interna-tional conference on artificial intelligence and statistics, pages 315–323, 2011.
    [15] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.Deep learning, volume 1. MIT press Cambridge, 2016.
    [16] Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990.96
    [17] Sepp Hochreiter and J ̈urgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
    [18] Felix A Gers, J ̈urgen Schmidhuber, and Fred Cummins. Learning toforget: Continual prediction with lstm. 1999.
    [19] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neuralnetworks.IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
    [20] Michael J Bruderer, Martin F McKinney, and Armin Kohlrausch.Structural boundary perception in popular music. InISMIR, pages198–201, 2006.
    [21] Robert Erickson.Sound structure in music. Univ of California Press,1975.
    [22] Meinard M ̈uller.Information retrieval for music and motion, volume 2.Springer, 2007.
    [23] Peter Grosche, Meinard M ̈uller, and Frank Kurth.Cyclic tem-pogram—a mid-level tempo representation for musicsignals. InAcous-tics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5522–5525. IEEE, 2010.
    [24] Kristoffer Jensen. A causal rhythm grouping. InInternational Sym-posium on Computer Music Modeling and Retrieval, pages 83–95.Springer, 2004.
    [25] Jouni Paulus, Meinard M ̈uller, and Anssi Klapuri. State of the artreport: Audio-based music structure analysis. InISMIR, pages 625–636. Utrecht, 2010.
    [26] Jonathan Foote. Automatic audio segmentation using a measure ofaudio novelty. InMultimedia and Expo, 2000. ICME 2000. 2000 IEEEInternational Conference on, volume 1, pages 452–455. IEEE, 2000.
    [27] George Tzanetakis and Perry Cook. Multifeature audio segmentationfor browsing and annotation. InApplications of Signal Processing toAudio and Acoustics, 1999 IEEE Workshop on, pages 103–106. IEEE,1999.
    [28] Kristoffer Jensen. Multiple scale music segmentation using rhythm,timbre, and harmony.EURASIP Journal on Applied Signal Processing,2007(1):159–159, 2007.98
    [29] Oriol Nieto and Tristan Jehan. Convex non-negative matrix factoriza-tion for automatic music structure identification. InICASSP, pages236–240, 2013.
    [30] Lawrence R Rabiner. A tutorial on hidden markov models and selectedapplications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989.
    [31] Beth Logan and Stephen Chu.Music summarization using keyphrases. Inicassp, pages II749–II752. IEEE, 2000.
    [32] Jean-Julien Aucouturier and Mark Sandler. Segmentation of musicalsignals using hidden markov models.Preprints-Audio EngineeringSociety, 2001.
    [33] Geoffroy Peeters, Amaury La Burthe, and Xavier Rodet. Toward auto-matic music audio summary generation from signal analysis. InISMIR,pages 1–1, 2002.
    [34] Mark Levy and Mark Sandler. Structural segmentation of musical au-dio by constrained clustering.IEEE transactions on audio, speech,and language processing, 16(2):318–326, 2008.99
    [35] Matthew Cooper and Jonathan Foote. Summarizing popular music viastructural similarity analysis. InApplications of Signal Processing toAudio and Acoustics, 2003 IEEE Workshop on., pages 127–130. IEEE,2003.
    [36] Jonathan Foote. Visualizing music and audio using self-similarity. InProceedings of the seventh ACM international conference on Multime-dia (Part 1), pages 77–80. ACM, 1999.
    [37] Lie Lu, Muyuan Wang, and Hong-Jiang Zhang. Repeating pattern dis-covery and structure analysis from acoustic music data. InProceedingsof the 6th ACM SIGMM international workshop on Multimedia infor-mation retrieval, pages 275–282. ACM, 2004.
    [38] Masataka Goto. A chorus section detection method for musical audiosignals and its application to a music listening station.IEEE Transac-tions on Audio, Speech, and Language Processing, 14(5):1783–1794,2006.
    [39] Yu Shiu, Hong Jeong, and C-c Jay Kuo. Similar segment detectionfor music structure analysis via viterbi algorithm. InMultimedia and100
    Expo, 2006 IEEE International Conference on, pages 789–792. IEEE,2006.
    [40] G David Forney. The viterbi algorithm.Proceedings of the IEEE, 61(3):268–278, 1973.
    [41] Jouni Paulus and Anssi Klapuri. Music structure analysis using a prob-abilistic fitness measure and a greedy search algorithm.IEEE Transac-tions on Audio, Speech, and Language Processing, 17(6):1159–1170,2009.
    [42] M. Goto, T. Nishimura, H. Hashiguchi, and R. Oka. Development ofthe rwc music database. InProceedings of ICA, 2004.
    [43] F. Bimbot, E. Deruty, S. Gabriel, and E. Vincent. Semiotic structurelabeling of music pieces: Concepts, methods and annotation conven-tions. InProceedings of ISMIR, 2012.
    [44] J. B. L. Smith, J. A. Burgoyne, I. Fujinaga, D. De Roure, and J. S.Downie. Design and creation of a large-scale database of structuralannotations. InProceedings of the International Society for MusicInformation Retrieval Conference, Miami, FL, 555–60, 2011.101
    [45] Thomas Grill and Jan Schluter. Music boundary detection using neu-ral networks on spectrograms and self-similarity lag matrices. InSig-nal Processing Conference (EUSIPCO), 2015 23rd European, pages1296–1300. IEEE, 2015.
    [46] Thomas Grill and Jan Schl ̈uter. Music boundary detection using neuralnetworks on combined features and two-level annotations. InISMIR,pages 531–537, 2015.
    [47] Alice Cohen-Hadria and Geoffroy Peeters. Music structure bound-aries estimation using multiple self-similarity matrices as input depthof convolutional neural networks. InAudio Engineering Society Con-ference: 2017 AES International Conference on Semantic Audio. Audio Engineering Society, 2017.
    [48] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, MattMcVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and mu-sic signal analysis in python. InProceedings of the 14th python inscience conference, pages 18–25, 2015.
    [49] Franc ̧ois Chollet et al. Keras, 2015.102
    [50] Mart ́ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv-ing, Michael Isard, et al. Tensorflow: a system for large-scale machinelearning. InOSDI, volume 16, pages 265–283, 2016.
    [51] Sebastian Ewert. Chroma toolbox: Matlab implementations for ex-tracting variants of chroma-based audio features. InProc. ISMIR,2011.
    [52] Florian Kaiser and Geoffroy Peeters. Multiple hypotheses at multi-ple scales for audio novelty computation within music. InAcous-tics, Speech and Signal Processing (ICASSP), 2013 IEEE InternationalConference on, pages 231–235. IEEE, 2013.
    [53] Alex Graves and J ̈urgen Schmidhuber. Framewise phoneme classifi-cation with bidirectional lstm and other neural network architectures.Neural Networks, 18(5-6):602–610, 2005.
    54] Sebastian B ̈ock, Florian Krebs, and Markus Schedl. Evaluating theonline capabilities of onset detection methods. InISMIR, pages 49–54, 2012.103
    [55] Oriol Nieto, Morwaread M Farbood, Tristan Jehan, and Juan PabloBello. Perceptual analysis of the f-measure for evaluating sectionboundaries in music. InProceedings of the 15th International Soci-ety for Music Information Retrieval Conference (ISMIR 2014), pages265–270, 2014.
    [56] Nathalie Japkowicz. The class imbalance problem: Significance andstrategies. InProc. of the Int’l Conf. on Artificial Intelligence, 2000.
    [57] Meinard M ̈uller, Frank Kurth, and Michael Clausen. Chroma-basedstatistical audio features for audio matching.signal, 21(22):108, 2005.

    QR CODE