簡易檢索 / 詳目顯示

研究生: 黃仲逸
Huang, Jong-Yi
論文名稱: 基於深層類神經網路的人工智慧聲音事件分類
Artificial intelligence (AI) sound event classification based on deep neural networks (DNNs)
指導教授: 白明憲
Bai, Mingsian R.
口試委員: 丁川康
Ting, Chuan-Kang
洪健中
Hong, Chien-Chong
學位類別: 碩士
Master
系所名稱: 工學院 - 動力機械工程學系
Department of Power Mechanical Engineering
論文出版年: 2018
畢業學年度: 107
語文別: 英文
論文頁數: 53
中文關鍵詞: 深度學習居家照護事件音分類
外文關鍵詞: Deep learning, Health home, Sound event classification
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於平行運算的硬體架構突破以及長久以來資料庫累積,造就當今深度學習的崛起,而聲音事件的識別仍處於發展階段,本文提出以深度學習為基礎架構來達成聲音事件辨識例如:咳嗽聲、玻璃破碎聲、年長者跌倒聲等。雖然聲音事件的識別仍處於發展階段,但可以借鑒發展成熟的語音辨識的處理技巧,而深度學習中的遞歸神經網路 (RNN) 與卷積網路 (CNN)分別在語音辨識及影像辨識有著卓越的成果,故我們希望以CNN以及RNN為基礎架構來結合聲音事件辨識的應用;深度學習的缺點即是複雜的結構帶來了運算量上的龐大負擔,這將導致運算時間需求是非常高的,因此本文提出以平行運算架構硬體GPU為輔助,來解決運算時間冗長的問題;除了運算時間上的問題外,深度學習訓練時需的巨量資料集亦是重大的議題,故本文中採用一些聲音調變技巧做為資料集擴充的策略,並且探討此策略對深度學習訓練的影響。


    In this study, we will use different neural network models, which are fully connected neural networks (FCNNs), deep CNNs, Long Short-Term Memory (LSTM), the combination of CNN and LSTM, Convolutional LSTM (ConvLSTM), and the combination of CNN and ConvLSTM, respectively to explore sound event classification (SEC) task. The features for neural networks are Mel-spectrograms or 128-dimensional features. Since we use neural network based methods to explore SEC problem, we need to collect sufficient amount of data for our models. The procedure of establishing the data set is presented. In addition, we apply data augmentation techniques for network training because of scarcity of labeled data. Network structures with various combinations of CNN, LSTM, and FCNN are compared. The results have demonstrated that data augmentation is effective for DNNs such as the combination of 5 convolution layers, 2 ConvLSTM layers and one fully connected layer, given small training data set.

    TABLE OF CONTENTS 摘 要 i ABSTRACT ii 誌 謝 iii Chapter 1 INTRODUCTION 1 Chapter 2 DEEP NEURAL NETWORKS 7 2.1 The deep neural netorks architecture 7 2.2 Parameter estimation with error backpropagation 9 2.2.1 Training criteria 9 2.2.2 Training algorithms 10 2.3 Convolutional neural networks 12 2.4 Long Short-Term Memory 15 2.5 Convolutional LSTM 15 Chapter 3 DATA SET 17 3.1 AudioSet videos and features 18 3.2 Recording event sounds and setting up stringly labeled audios 18 3.3 Data augmentation 21 Chapter 4 SIMULATION 22 4.1 Simulation 22 4.2 TOA localization with DNN for regression problem 22 4.3 Isolated word recognition 27 A. Audio preprocessing 29 B. Neural network structure 26 C. Performance 27 4.4 SEC tasks 34 A. Audio preprocessing 34 B. Neural network structure 34 C. Training and evaluation 39 Chapter 5 CONCLUSIONS AND FUTUTRE WORK 42

    [1] D. Wang and G. J. Brown, “Computational Auditory Scene Analysis: Principles, Algorithms, and Applications,” Wiley-IEEE Press, 2006.
    [2] S. Ntalampiras and I. Potamitis, “Detection of human activities in natural environments based on their acoustic emissions,” Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pp. 1469–1473, 2012.
    [3] Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai, “Healthcare audio event classification using hidden markov models and hierarchical hidden Markov models,” IEEE International Conference on Multimedia and Expo, pp. 1218–1221, 2009.
    [4] I. V. McLoughlin, Applied Speech and Audio Processing, Cambridge University Press, 2009.
    [5] A. Plinge, R. Grzeszick, and GA. Fink, “A bag-of-features approach to acoustic event detection,” IEEE International Conference in Acoustics, Speech and Signal Processing, pp. 3732–3736, 2014.
    [6] M. Casey, “Mpeg-7 sound-recognition tools,” IEEE Transactions on circuits and Systems for video Technology, vol. 11, no. 6, pp. 737–747, 2001.
    [7] S. Chu, S. Narayanan, and C. CJ. Kuo, “Environmental sound recognition with time–frequency audio features,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, 2009.
    [8] R. Igual, C. Medrano, and I. Plaza, “Challenges, issues and trends in fall detection systems,” BioMed. Eng. OnLine, vol. 12, no. 1, p. 66, 2013. [Online]. Available: http://www.biomedical-engineeringonline.com/content/12/1/66
    [9] M. Vacher, A. Fleury, F. Portet, J.-F. Serignat, and N. Noury, "Complete sound and speech recognition system for health smart homes: application to the recognition of activities of daily living," New Developments in Biomedical Engineering, pp. 645-673, 2010.
    [10] S. Miaou, P. H. Sung and C. Y. Huang, "A Customized Human Fall Detection System Using Omni-Camera Images and Personal Information," 1st Transdisciplinary Conference on Distributed Diagnosis and Home Healthcare, 2006. D2H2., Arlington, VA, pp. 39-42, 2006.
    [11] H. W. Tzeng, M.Y. Chen and J. Chen, "Design of fall detection system with floor pressure and infrared image," 2010 International Conference on System Science and Engineering, Taipei, pp. 131-135, 2010.
    [12] Y. Li, K. C. Ho and M. Popescu, "A Microphone Array System for Automatic Fall Detection," in IEEE Transactions on Biomedical Engineering, vol. 59, no. 5, pp. 1291-1301, May 2012.
    [13] C. Rougier, J. Meunier, A. St-Arnaud and J. Rousseau, "Robust Video Surveillance for Fall Detection Based on Human Shape Deformation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 5, pp. 611-622, May 2011.
    [14] L. Hazelhoff, J. Han and P. H.N. de With, “Video-based fall detection in the home using principal component analysis,” in Proceedings of the 10th International Conference on Advanced Concepts for Intelligent Vision Systems, J. Bland-Talon, S. Bourennane, W. Philips, D. Popescu and P. Scheunders (eds), Heidelberg, Berlin, Germany: Springer, 2008, pp.298–309.
    [15] H. Rimminen, J. Lindström, M. Linnavuo and R. Sepponen, "Detection of falls among the elderly by a floor sensor using the electric near field," in IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 6, pp. 1475-1476, Nov. 2010.
    [16] Y. Li, K. C. Ho and M. Popescu, "A Microphone Array System for Automatic Fall Detection," in IEEE Transactions on Biomedical Engineering, vol. 59, no. 5, pp. 1291-1301, May 2012
    [17] C. Rita, A. Prati and R. Vezzani, "AmultiDcamera vision system for fall detection and alarm generation", Expert Systems, vol. 24.5, pp. 334-345, 2007.
    [18] D. Anderson, R. H. Luke, J. M. Keller, M. Skubic, M. Rantz, M. Aud, "Linguistic summarization of video for fall detection using voxel person and fuzzy logic", Comput. Vis. Image Understand., vol. 113, no. 1, pp. 80-89, 2008.
    [19] Q. Li, J. A. Stankovic, M. A. Hanson, A. T. Barth, J. Lach and G. Zhou, "Accurate, Fast Fall Detection Using Gyroscopes and Accelerometer-Derived Posture Information," 2009 Sixth International Workshop on Wearable and Implantable Body Sensor Networks, Berkeley, CA, pp. 138-143, 2009.
    [20] F. Sposaro and G. Tyson, "iFall: An android application for fall monitoring and response," 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Minneapolis, MN, pp. 6119-6122, 2009.
    [21] S. Shan and T. Yuan, "A wearable pre-impact fall detector using feature selection and Support Vector Machine," IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, Beijing, 2010, pp. 1686-1689.
    [22] M. Yuwono, B. Moulton, S. Su, B. Celler and H. Nguyen, "Unsupervised machine-learning method for improving the performance of ambulatory fall-detection systems", Biomed Eng Online, pp. 1-11, 2012.
    [23] H. Kerdegari, K. Samsudin, A. R. Ramli, S. Mokaram, "Evaluation of fall detection classification approaches", Proc. 4th Int. Conf. Intell. Adv. Syst. (ICIAS), vol. 1, pp. 131-136, Jun. 2012.
    [24] J. Cheng, X. Chen and M. Shen, "A Framework for Daily Activity Monitoring and Fall Detection Based on Surface Electromyography and Accelerometer Signals," in IEEE Journal of Biomedical and Health Informatics, vol. 17, no. 1, pp. 38-45, Jan. 2013.
    [25] M. Cowling, R. Sitte, "Comparison of techniques for environmental sound recognition", Pattern Recognition Letters, vol. 24, no. 15, pp. 2895-2907, 2003.
    [26] D. Maunder, E. Ambikairajah, J. Epps and B. Celler, "Dual-microphone Sounds of Daily Life classification for telemonitoring in a noisy environment," 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, BC, pp. 4636-4639, 2008.
    [27] A. Harma, M. F. McKinney and J. Skowronek, "Automatic surveillance of the acoustic activity in our living environment," 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, pp. 634-637, 2005.
    [28] D. Istrate, E. Castelli, M. Vacher, L. Besacier and J. F. Serignat, "Information extraction from sound for medical telemonitoring," in IEEE Transactions on Information Technology in Biomedicine, vol. 10, no. 2, pp. 264-274, April 2006.
    [29] D. Stowell, D. Giannoulis, E. Benetos, Member, IEEE, M. Lagrange, and Mark D. Plumbley, Fellow, IEEE, “Detection and Classification of Acoustic Scenes and Events,” IEEE Trans. MULTIMEDIA, vol. 17, no. 10, Oct. 2015.
    [30] J.-J. Aucouturier, B. Defreville, and F. Pachet, “The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,” J. Acoust. Soc. America, vol. 122, no. 2, pp. 881–891, 2007.
    [31] M. Chum, A. Habshush, A. Rahman, and C. Sang, “IEEE AASP scene classification challenge using hidden Markov models and frame based classification,” Tech. Rep., 2013 [Online]. Available: http://c4dm.eecs. qmul.ac.uk/sceneseventschallenge/abstracts/SC/CHR.pdf.
    [32] B. Elizalde, H. Lei, F. G., and N. Peters, “An I-vector based approach for audio scene detection,” Tech. Rep., 2013 [Online]. Available: http:/ /c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/SC/ELF.pdf.
    [33] J. T. Geiger, B. Schuller, and G. Rigoll, “Recognising acoustic scenes with large-scale audio feature extraction and SVM,” 2013 [Online]. Available: http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/ abstracts/SC/GSR.pdf.
    [34] J. D. Krijnders and G. A. ten Holt, “A tone-fit feature representation for scene classification,” 2013 [Online]. Available: http://c4dm.eecs.qmul. ac.uk/sceneseventschallenge/abstracts/SC/KH.pdf.
    [35] D. Li, J. Tam, and D. Toub, “Auditory scene classification using machine learning techniques,” Tech. Rep., 2013 [Online]. Available: http:/ /c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/SC/LTT.pdf.
    [36] J. Nam, Z. Hyung, and K. Lee, “Acoustic scene classification using sparse feature learning and selective max-pooling by event detection,” Tech. Rep., 2013 [Online]. Available: http://c4dm.eecs.qmul.ac.uk/ sceneseventschallenge/abstracts/SC/NHL.pdf.
    [37] W. Nogueira, G. Roma, and P. Herrera, “Sound scene identification based on MFCC, binaural features and a support vector machine classifier,” Tech. Rep., 2013 [Online]. Available: http://c4dm.eecs.qmul.ac. uk/sceneseventschallenge/abstracts/SC/NR1.pdf.
    [38] E. Olivetti, “The wonders of the normalized compression dissimilarity representation,” Tech. Rep., 2013 [Online]. Available: http://c4dm. eecs.qmul.ac.uk/sceneseventschallenge/abstracts/SC/OE.pdf.
    [39] K. Patil and M. Elhilali, “Multiresolution auditory representations for scene classification,” Tech. Rep., 2013 [Online]. Available: http:// c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/SC/PE.pdf.
    [40] A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” Tech. Rep., 2013 [Online]. Available: http://c4dm.eecs.qmul.ac.uk/ sceneseventschallenge/abstracts/SC/RG.pdf.
    [41] G. Roma, W. Nogueira, and P. Herrera, “Recurrence quantification analysis features for auditory scene classification,” Tech. Rep., 2013 [Online]. Available: http://c4dm.eecs.qmul.ac.uk/ sceneseventschallenge/abstracts/SC/RNH.pdf.
    [42] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” IEEE Trans. Neural Networks, vol. 14, no. 1, pp. 209–215, 2003.
    [43] A. Graves, A. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural networks," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, pp. 6645-6649, 2013.
    [44] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770-778, 2016.
    [45] A. Krizhevsky, I. Sutskeve, and G. Hinton (2012), “Imagenet classification with deep convolutional neural networks”. In Advances in Neural Information Processing Systems, pages 1097–1105.
    [46] A. Dang, T. H. Vu and J. Wang, "A survey of deep learning for polyphonic sound event detection," 2017 International Conference on Orange Technologies (ICOT), Singapore, pp. 75-78, 2017.
    [47] A. Mesaros et al., "Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge," in IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 26, no. 2, pp. 379-393, Feb. 2018
    [48] D. D. Li and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999.
    [49] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal and M. Ritter, "Audio Set: An ontology and human-labeled dataset for audio events," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, pp. 776-780, 2017.
    [50] D. Yu, L. Deng, “Deep Neural Networks. In: Automatic Speech Recognition,” Signals and Communication Technology. Springer, London, 2015.
    [51] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,”Learning representations by back-propagating errors,” Neurocomputing: foundations of research, J. A. Anderson, and E. Rosenfeld (ed.), MIT Press, Cambridge, MA, pp. 696-699, 1988.
    [52] K. O'Shea and R. Nash, "An introduction to convolutional neural networks," arXiv preprint arXiv:1511.08458, 2015.
    [53] Y. Liu and K. K. Parhi, "Computing hyperbolic tangent and sigmoid functions using stochastic logic," 2016 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, pp. 1580-1585, 2016..
    [54] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, pages 1310–1318, 2013
    [55] A. Graves, “Generating sequences with recurrent neural networks,” arXiv: 1308.0850, 2013.
    [56] X. Shi, Z. Chen, H. Wang, D. Yeung, “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting,” arXiv: 1506.04214 , 2015
    [57] J. F. Gemmeke et al., “Audio set Download,” [Online] Available: https://research.google.com/audioset/download.html
    [58] TensorFlow, “tf.python_io.tf_record_iterator,” [Online] Available: https://www.tensorflow.org/api_docs/python/tf/python_io/tf_record_iterator
    [59] B. McFee, E. Humphrey, and J. Bello, “A software framework for musical data augmentation,” in 16th Int. Soc. for Music Info. Retrieval Conf., Malaga, Spain, pp. 248–254, Oct. 2015.
    [60] J. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification," in IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279-283, March 2017.
    [61] Keras, ” The Python Deep Learning library,” [Online] Available: https://keras.io/
    [62] TensorFlow,” Simple Audio Recognition,” [Online] Available: https: //www. tensorflow.org/ versions/ master/tutorials/audio_recognition, October 27, 2017.
    [63] S. Davis, P. Mermelstein, “(1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
    [64] M. Hossin and M. N. Sulaiman, “A review on evaluation metrics for data classification evaluations,” International Journal of Data Mining & Knowledge Management Process (IJDKP), vol.5, no.2, pp. 1-11, March 2015.
    [65] L. Xu, C. Choy and Y. Li, “Deep sparse rectifier neural networks for speech denoising,” 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi'an, pp. 1-5, 2016.

    QR CODE