研究生: |
廖俊祺 Liao, Jyun-Ci |
---|---|
論文名稱: |
基於調制頻譜向量之環境聲響事件分類 Environmental Sound Event Classification Based on Modulation Spectral Vectors |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
黃元豪
Huang, Yuan-Hao 黃朝宗 Huang, Chao-Tsung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 48 |
中文關鍵詞: | 調制頻譜向量 、噪音訓練 、環境聲響事件 、高斯混合模型 |
外文關鍵詞: | modulation spectral vectors, noisy training, environment sound event, gmm |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
高斯混和模型運用在語音、聲響辨識系統方展成熟,然而在高度的環境背景雜訊,其辨識效果會大幅下降。本論文提出結合短時間與長時間的特徵萃取向量,提昇環境在高度背景雜訊下的辨識率。短時間特徵向量採用梅爾倒頻譜係數(Mel-frequency cepstral coefficients, MFCCs),長時間特徵係數採用調制頻譜向量(Modulation spectral vectors, MSVs),調制頻譜特徵向量可以萃取訊號在頻率域的能量包絡,能量包絡的特性能夠有效的抵抗環境雜訊的干擾。
為了讓系統對於雜訊更加強健,本論文提一種訓練方式,在訓練的過程中,高斯混合模型就先看過含有雜訊的資料,這種方式有助於提升在低訊雜比的訊號辨識率。本論文進行辨識8種類別的室內環境聲響事件,在訊雜比 0 dB 的環境下,辨識率達八成以上。
The Gaussian mixture model (GMM) has developed well both in the speech and sound recognition, but it does not perform well in the high background noisy environment. This thesis proposes a method combining short-term and long-term features to overcome this issue. Here the short-term features are Mel-frequency cepstral coefficients (MFCCs) and the long-term features are the modulation spectral vectors (MSVs) calculated in the frequency domain. The MSVs contains the envelope message of signals which is a good feature against high noise.
For robustness against noise, this thesis proposes a method to learn noisy data while training on GMMs. This method could raise the recognition accuracy in the low singal-to-noise ratio (SNR) case. The method was evaluated on a database which consists of 8 different indoor sound event classes. It achieves > 80 % accuracy at 0 dB SNR.
[1] B. E. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech communication, vol. 25, no. 1, pp. 117–132,
1998.
[2] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
[3] F. Morchen, A. Ultsch, M. Thies, and I. Lohken, “Modeling timbre distance with temporal statistics from polyphonic music,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 81–90, 2006.
[4] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
[5] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.
[6] M. H. Moattar and M. M. Homayounpour, “A review on speaker diarization systems and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.
[7] M. Mckinney and J. Breebaart, “Features for audio and music classification,” in Proceedings of the International Symposium on Music Information Retrieval, pp. 151–158, 2003.
[8] M. A. Hossan, S. Memon, and M. A. Gregory, “A novel approach for mfcc feature extraction,” in Signal Processing and Communication Systems (ICSPCS), 2010 4th International Conference on, pp. 1–5, IEEE, 2010.
[9] S. Greenberg and B. E. Kingsbury, “The modulation spectrogram: In pursuit of an invariant representation of speech,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 3, pp. 1647–1650, IEEE, 1997.
[10] S. Sukittanon, L. E. Atlas, and J. W. Pitton, “Modulation-scale analysis for content identification,” IEEE Transactions on Signal Processing, vol. 52, no. 10, pp. 3023–3035, 2004.
[11] 何育澤, “基於支持向量機之混合聲響辦認,” 國立清華大學, 2014 年.
[12] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
[13] A. Oppenheim and R. Schafer, Discrete-time Signal Processing. Prentice-Hall signal processing series, Pearson, 2010.
[14] H. Hermansky, “Modulation spectrum in speech processing,” in Signal Analysis and Prediction, pp. 395–406, Springer, 1998.
[15] M. Markaki and Y. Stylianou, “Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features,” Speech Communication, vol. 53, no. 5, pp. 726–735, 2011.
[16] L. Atlas and S. A. Shamma, “Joint acoustic and modulation frequency,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 7, p. 310290, 2003.
[17] G. Evangelopoulos and P. Maragos, “Multiband modulation energy tracking for noisy speech detection,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 6, pp. 2024–2038, 2006.
[18] J.-H. Bach, B. Kollmeier, and J. Anemüller, “Modulation-based detection of speech in real background noise: Generalization to novel background classes,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 41–44, IEEE, 2010.
[19] N. H. Sephus, A. D. Lanterman, and D. V. Anderson, “Exploring frequency modulation features and resolution in the modulation spectrum,” in Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE), 2013 IEEE, pp. 169–174, IEEE, 2013.
[20] J.-M. Ren and J.-S. R. Jang, “Discovering time-constrained sequential patterns for music genre classification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1134–1144, 2012.
[21] C.-H. Lee, J.-L. Shih, K.-M. Yu, and H.-S. Lin, “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features,” IEEE Transactions on Multimedia, vol. 11, no. 4, pp. 670–682, 2009.
[22] M. Markaki and Y. Stylianou, “Voice pathology detection and discrimination based on modulation spectral features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 1938–1948, 2011.
[23] S.-C. Lim, S.-J. Jang, S.-P. Lee, and M. Y. Kim, “Music genre/mood classification using a feature-based modulation spectrum,” in Mobile IT Convergence (ICMIC), 2011 International Conference on, pp. 133–136, IEEE, 2011.
[24] D. Reynolds, “Gaussian mixture models,” Encyclopedia of biometrics, pp. 827–832, 2015.
[25] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
[26] K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” Technical University of Denmark, vol. 7, p. 15, 2008.
[27] A. B. Downey, Think complexity: complexity science and computational modeling, ch. 9, p. 91. O’Reilly Media, Inc., 2012.