簡易檢索 / 詳目顯示

研究生: 王君豪
Wang, Chun-Hao
論文名稱: 基於分離自動編碼器及卷積遞歸神經網路之聲音事件偵測
Sound Event Detection Based on Partitioned Autoencoder and Convolutional Recurrent Neural Network
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 曹昱
Tsao, Yu
黃朝宗
Huang, Chao-Tsung
林守德
Lin, Shou-De
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 107
語文別: 英文
論文頁數: 67
中文關鍵詞: 聲音事件偵測分離自動編碼器卷積遞歸神經網路
外文關鍵詞: Sound Event Detection, Partitioned Autoencoder, Convolutional Recurrent Neural Network
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本篇論文中,在聲音事件偵測系統加入了降噪演算法,用來偵測在32個音檔 (其中,在開發資料集有24個音檔,在評估資料集有8個音檔) 中,總共包含6個聲音事件種類的 DCASE2017 TUT Sound Events 2017 資料集 [1]。這是一個多音的資料集,被訓練好的聲音偵測模型必須判斷出聲音事件的開始與結束時間。加入降噪演算法的目的是為了觀察在聲音事件偵測模型的訓練過程中,降噪過程是否有其必要。在本篇論文中,降噪過程採用的是分離自動編碼器 (Partitioned Autoencoder) [2],聲音事件系統則是採用卷積遞歸神經網路 (Convolutional Recurrent Neural Network, CRNN) [3][4],在DCASE2017中的真實生活聲音事件偵測任務中,獲得了第一名的成績。本篇論文採用的輸入特徵分別為原始的對數梅爾能量、降噪後的對數梅爾能量和結合兩者的擴增對數梅爾能量。從實驗結果來說,降噪後的對數梅爾能量訓練出來的聲音事件偵測模型在某一些聲音事件的測試錯誤率中,表現出較低的中位數以及較好的分佈。除此之外,在最終的表現裡,降噪後的對數梅爾能量訓練出來的聲音事件偵測模型在開發資料及以及評估資料集中,最好的結果分別可以達到測試錯誤率0.622 以及0.744。如果分別在三種特徵訓練出來的模型中,選取各自表現最好的聲音事件種類,整體的測試錯誤率則可以再下降。


    In this thesis, a noise reduction process and a sound event detection (SED) system are used in detecting DCASE2017 TUT Sound Events 2017 dataset [1] which contains six sound events in a total of 32 audio recordings (24 audio recordings in the development set and 8 audio recordings in the evaluation set). It is a polyphonic task and the trained SED model have to detect the sound events with their onset time and offset time. The purpose of the noise reduction is to observe whether it is helpful in the training process of the sound event detection task. In this thesis, a partitioned autoencoder [2] is adopted for noise reduction. In the sound event detection part, a convolutional recurrent neural network (CRNN) [3][4] which won the first prize in the task of "sound event detection in real life" in DCASE2017 is adopted. The original log mel-band energies, the denoised log mel-band energies, and the augmented log mel-band energies which combine both of above are the input features of the CRNN. From the training results, it reveals that the SED models trained with the denoised features have better performance in some sound events by showing the lower medians of the testing error rates or showing the better distribution of the testing error rates. Furthermore, the final performance of the error rates reveals that the models trained with the denoised features can achieve the best testing error rate of 0.622 in the development set and the best testing error rate of 0.744 in the evaluation set. The testing error rate could be improved further if choosing the best model of each class across 3 kinds of features.

    摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Log Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Log Mel-Band Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Partitioned Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1.1 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1.2 Partitioned Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Neural Network Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Fully-Connected Layer (FC Layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Convolutional Layer (Conv Layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Gated Recurrent Unit Layer (GRU Layer) . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 Batch Normalization (BN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.5 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.6 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.7 Hyperbolic Tangent (Tanh) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.8 Rectified Linear Unit (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.9 Leaky Rectified Linear Unit (Leaky ReLU) . . . . . . . . . . . . . . . . . . . . 15 2.3.10 Binary Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.11 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Segment-Based Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1.1 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1.2 F-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Validation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 The Amount-Based Files Splitting Method. . . . . . . . . . . . . . . . . . . . 22 3.3.2 The Duration-Based Files Splitting Method . . . . . . . . . . . . . . . . . . . 23 3.4 Partitioned Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.2 Denoising Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.2.1 Standard Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.2.2 Partitioned Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Convolutional Recurrent Neural Network (CRNN) . . . . . . . . . . . . . . 45 3.5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.2 Development Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.3 Evaluation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61 5 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 A.1 Suggestions From The Oral Defense Committees . . . . . . . . . . . . . . 67

    [1]  A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in 24th European Signal Processing Conference 2016 (EUSIPCO 2016), (Budapest, Hungary), 2016. 

    [2]  D. Stowell and R. E. Turner, “Denoising without access to clean data using a partitioned autoencoder,” arXiv preprint arXiv:1509.05982, 2015. 

    [3]  S. Adavanne, P. Pertilä, and T. Virtanen, “Sound event detection using spatial features and convolutional recurrent neural network,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 771–775, March 2017. 

    [4]  S. Adavanne and T. Virtanen, “A report on sound event detection with different binaural features,” tech. rep., DCASE2017 Challenge, September 2017. 

    [5]  X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 1759–1763, IEEE, 2014. 

    [6]  X. Lu, S. Matsuda, C. Hori, and H. Kashioka, “Speech restoration based on deep learning autoencoder with layer-wised pretraining,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012. 

    [7]  X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.,” in Interspeech, pp. 436–440, 2013. 

    [8]  I. Sobieraj and M. Plumbley, “Coupled sparse nmf vs. random forest classification for real life acoustic event detection,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 90–94, 2016. 

    [9]  P. Giannoulis, G. Potamianos, P. Maragos, and A. Katsamanis, “Improved dictionary selection and detection schemes in sparse-cnmf-based overlapping acoustic event detection,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2016. 

    [10]  S. Adavanne, G. Parascandolo, P. Pertilä, T. Heittola, and T. Virtanen, “Sound event detection in multichannel audio using spatial and harmonic features,” tech. rep., DCASE2016 Challenge, September 2016. 

    [11]  S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, et al., “The htk book,” Cambridge university engineering department, vol. 3, p. 175, 2002. 

    [12]  M. Slaney, “Auditory toolbox,” Interval Research Corporation, Tech. Rep, vol. 10, p. 1998, 1998. 

    [13]  D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” tech. rep., California Univ San Diego La Jolla Inst for Cognitive Science, 1985. 

    [14]  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006. 

    [15]  P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010. 

    [16]  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015. 

    [17]  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. 

    [18]  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012. 

    [19]  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, pp. 3111–3119, 2013. 

    [20]  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989. 

    [21]  R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications, vol. 31999. McGraw-Hill New York, 1986. 

    [22]  F. J. Pineda, “Generalization of back-propagation to recurrent neural networks,” Physical review letters, vol. 59, no. 19, p. 2229, 1987. 

    [23]  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,vol.9, no. 8, pp. 1735–1780, 1997. 

    [24]  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. 

    [25]  M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. 

    [26]  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. 

    [27]  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012. 

    [28]  V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010. 

    [29]  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, vol. 30, p. 3, 2013. 

    [30]  N. Morgan and H. Bourlard, “Generalization and parameter estimation in feedforward nets: Some experiments,” in Advances in neural information processing systems, pp. 630– 637, 1990. 

    [31]  B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, pp. 18–25, 2015. 

    [32]  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017. 

    [33]  A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016. 

    [34]  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. 

    [35]  I.-Y. Jeong, S. Lee, Y. Han, and K. Lee, “Audio event detection using multiple-input convolutional neural network,” tech. rep., DCASE2017 Challenge, September 2017. 

    [36]  T. Heittola and A. Mesaros, “DCASE 2017 challenge setup: Tasks, datasets and baseline system,” tech. rep., DCASE2017 Challenge, September 2017. 

    [37]  Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443–445, 1985. 

    [38]  Y.-H. Lai, C.-H. Wang, S.-Y. Hou, B.-Y. Chen, Y. Tsao, and Y.-W. Liu, “DCASE report for task 3: Sound event detection in real life audio,” tech. rep., DCASE2016 Challenge, September 2016.

    QR CODE