研究生: |
侯詩彥 Hou, Shih Yen |
---|---|
論文名稱: |
聲音事件偵測之特徵萃取與分類方法之測試 Sound Event Detection Using Different Feature Extraction and Multi-label Classification Methods |
指導教授: |
劉奕汶
Liu, Yi Wen |
口試委員: |
白明憲
Bai, Ming Sian 李夢麟 Lee, Meng Lin 李祈均 Lee, Chi Chun |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 52 |
中文關鍵詞: | 聲音事件偵測 |
外文關鍵詞: | Sound Event Detection |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在某些場景,如街道上,會有各種不同的事件同時發生。人類可以透過聽覺,查覺並分辨這些不同事件的能力被稱為聽覺場景分析(Auditory Scene Analysis)。而為了讓電腦能夠做到人類的這項能力的相關研究則是被稱作計算聽覺場景分析(Computational Auditory Scene Analysis,CASA)。聲音事件偵測(Sound Event Detection)是屬於計算聽覺場景分析相關的研究,特別專指將某個場景下的聲音訊號轉換成具體的事件描述的任務。這種技術可能的應用方向涵蓋了居家安全,醫療照顧等。透過樣型識別(Pattern Recognition)的方法,聲音訊號會可透過特徵萃取轉成聲學特徵,再利用機器學習的演算法訓練對應的事件模型。由於資料庫中有多個事件重疊發生的情況,所以需要進行多聲音事件偵測(Polyphonic Sound Event Detection)。相較於每個時間點只需輸出最主要事件的單聲音事件偵測(Monophonic Sound Event Detection),多聲音事件偵測則更為複雜,可以視為一種多標籤分類(Multi-Label Classification)的問題。這篇論文使用Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016)所公開的多聲音事件偵測資料庫(TUT Sound Events 2016)以及其提供的基線系統(Baseline System)。基線系統使用的特徵萃取方法為梅爾倒頻譜係數(Mel Frequency Cepstral Coefficient, MFCC),多標籤分類方法則是使用高斯混和模型(Gaussian Mixture Model, GMM)。這篇論文則是加入另一種基於人類聽覺的特徵萃取方法以及基於深層式類神經網路(Deep Neural Networks)的多標籤分類方法,嘗試提升系統在錯誤率(Error Rate)以及F分數(F-socre)上的表現。經過不同組合的嘗試,表現最好的方法和基線系統相比在總體錯誤率上降低約0.04,在F分數上提升了約6.6%。
Events occur simultaneously at some scenes. The ability that people are able to detect these events and analyse these scenes by listening is called auditory scene analysis, and the study to let computers have this ability is called computational auditory scene analysis. Sound event detection is a topic that relates to computational auditory scene analysis, which focuses on converting acoustic signal to concrete descriptions of its corresponding sound events. This technology can be used in many applications, such as house security, healthcare, and so on. Through methods developed in pattern recognition, acoustic signal can be turned into feature vectors first, then learning methods can be applied to train models with these feature vectors and their corresponding event labels. Since the data used here were recorded in environments with multiple sound sources, polyphonic sound event detection is required such that the system can detect multiple events at the same time. Compared to monophonic sound event detection, polyphonic sound event detection is more complicated and can be viewed as multi-label classification.
This research used TUT Sound Events 2016 database, which was published by Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016). The baseline system of the database used mel frequency cepstral coefficients for feature extraction and Gaussian mixture models for multi-label classification. This paper tries to improve the performance by introducing another feature extraction method based on a human auditory model and multi-label classification methods based on deep neural networks. After trying different combination of feature extraction method and multi-label classification methods, the proposed method reduces the error rate by 0.04 and increases the F-score by 6.6% compared to baseline.
[1]M. Annamaria, T. Heittola, and T. Virtanen. "TUT database for acoustic scene classification and sound event detection," 24th European Signal Processing Conference. Vol. 2016. 2016.
[2]J. Schröder, J. Anemüller, and S. Goetze. "Performance Comparision of GMM, HMM and DNN Based Approches For Acoustic Event Detection Within Task 3 of the DCASE 2016 Challenge," Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
[3]S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, T. Virtanen. “Sound event detection in multichannel audio using spatial and harmonic features," Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016.
[4]E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," 2015 international joint conference on neural networks (IJCNN). IEEE, 2015.
[5]E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Multi-label vs. combined single-label sound event detection with deep neural networks," Signal Processing Conference (EUSIPCO), 2015 23rd European. IEEE, 2015.
[6]A. Mesaros, T. Heittola, and T. Virtanen. "Metrics for Polyphonic Sound Event Detection," Applied Sciences 6.6 (2016): 162.
[7]O. Gencoglu, T. Virtanen, and H. Huttunen, "Recognition of acoustic events using deep neural networks," 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 2014.
[8]Detection and Classification of Acoustic Scenes and Events 2016. [Online]. Available: http://www.cs.tut.fi/sgn/arg/dcase2016/
[9]Y. Shao and D.-L. Wang, “Robust Speaker Identification Using Auditory Features And Coputational Auditory Scene Analysis,” 2008 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008).
[10]T. Irino and R.D. Patterson , “A Dynamic Compressive Gammachirp Auditory Filterbank,” IEEE Transactions on Audio, Speech, and Language Processing ( Volume: 14, Issue: 6, Nov. 2006 ).
[11]H.Hermansky, "Perceptual linear predictive (PLP) analysis of speech," J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[12]H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, Oct. 1994.
[13]S. T. Neely, J. Rodriguez, Y.-W. Liu, W. Jesteadt, and M. P. Gorga (2009), “A computational model of loudness density,” unpublished manuscript.
[14]G. Hinton, L. Deng, D. Yu, G. Dahl, M. Abdel-rahman, N. Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” IEEE Signal Processing Magazine ( Volume: 29, Issue: 6, Nov. 2012 ).
[15]M. Tanaka and M. Okutomi, “A Novel Inference of a Restricted Boltzmann Machine,” International Conference on Pattern Recognition (ICPR2014), August, 2014.
[16]C. M. Bishop, “Neural Networks for Pattern Recognition (Oxford Press),” (1995).
[17]D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[18]K. Kumar, C. Kim and R. M. Stern , “Delta-spectral Cepstral Coefficients For Robust Speech Recognition,” Acoustics, Speech, and Signal Processing, 1988. ICASSP-88.
[19]S. R. Chetupalli , A. Gopalakrishnan and T. V. Sreenivas, “Feature Selection and Model Optimization for Semi-supervised Speaker Spotting,” Signal Processing Conference (EUSIPCO), 2016 24th European.
[20]M. Zhang and Z. Zhou, “Multilabel neural networks with applications to functional genomics and text categorization,” IEEE Trans. Knowledge and Data Engineering, vol. 18, no. 10, pp. 1338–1351, 2006.
[21]何育澤,“基於支持向量機之混合聲響辨認,”國立清華大學, 2014.
[22]T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependent sound event detection,” EURASIP J. Audio, Speech, Music Process., vol. 2013, no. 1, p. 1, Jan. 2013.
[23]D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” IEEE TranSactions on MultiImedia, VOL. 17, NO. 10, October 2015.
[24]Q. Kong, I. Sobieraj, W. Wang, M. D. Plumbley, “Deep Neural Network Baseline For DCASE Challenge 2016,” Detection and Classification of Acoustic Scenes and Events 2016.
[25]R. F. Lyon, “Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function,” Journal of the Acoustical Society of America, vol. 130 (2011), pp. 3893-3904.
[26]R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, Volume 22 Issue 9, September 2010, Pages 2390-2416.
[27]C. Clavel, T. Ehrette, G. Richard, “Events Detection for an Audio-Based Surveillance System,” Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on Multimedia and Expo.
[28]J. K¨urby, R. Grzeszick, A. Plinge, and G. A. Fink, “Bag-of-Features Acoustic Event Detection For Sensor Networks,” Detection and Classification of Acoustic Scenes and Events 2016.
[29]D. P. W. Ellis, “Prediction-driven computational auditory scene analysis,” Doctoral Dissertation, Massachusetts Institute of Technology Cambridge, MA, USA, 1996.
[30]C.-W. Wu and Y.-W. Liu, “Event-related sounds in residential environment: Classification and outlier rejection,” in National Computer Symposiums, Taichung, Taiwan, 2013.
[31]J. Salamon, C. Jacoby, J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research,” MM '14 Proceedings of the 22nd ACM international conference on Multimedia, Pages 1041-1044, Orlando, Florida, USA — November 03 - 07, 2014.
[32]C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
[33]Y.-H. Lai, C.-H. Wang , S.-Y. Hou , B.-Y. Chen , Y. Tsao , Y.-W. Liu, “DCASE Report for Task 3: Sound Event Detection in Read Life Audio,” Detection and Classification of Acoustic Scenes and Events 2016, 2016.
[34]S. Sigtia, A. M. Stark, S. Krstulovic and M. D. Plumbley, “Automatic Environmental Sound Recognition: Performance versus Computational Cost,” IEEE/ACM Transactions on Audio, Speech, and Language Processing), Volume: 24, Issue: 11, Nov. 2016.