簡易檢索 / 詳目顯示

研究生: 廖俊祺
Liao, Jyun-Ci
論文名稱: 基於調制頻譜向量之環境聲響事件分類
Environmental Sound Event Classification Based on Modulation Spectral Vectors
指導教授: 劉奕汶
Liu, Yi-Wen
口試委員: 黃元豪
Huang, Yuan-Hao
黃朝宗
Huang, Chao-Tsung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 48
中文關鍵詞: 調制頻譜向量噪音訓練環境聲響事件高斯混合模型
外文關鍵詞: modulation spectral vectors, noisy training, environment sound event, gmm
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 高斯混和模型運用在語音、聲響辨識系統方展成熟,然而在高度的環境背景雜訊,其辨識效果會大幅下降。本論文提出結合短時間與長時間的特徵萃取向量,提昇環境在高度背景雜訊下的辨識率。短時間特徵向量採用梅爾倒頻譜係數(Mel-frequency cepstral coefficients, MFCCs),長時間特徵係數採用調制頻譜向量(Modulation spectral vectors, MSVs),調制頻譜特徵向量可以萃取訊號在頻率域的能量包絡,能量包絡的特性能夠有效的抵抗環境雜訊的干擾。
    為了讓系統對於雜訊更加強健,本論文提一種訓練方式,在訓練的過程中,高斯混合模型就先看過含有雜訊的資料,這種方式有助於提升在低訊雜比的訊號辨識率。本論文進行辨識8種類別的室內環境聲響事件,在訊雜比 0 dB 的環境下,辨識率達八成以上。


    The Gaussian mixture model (GMM) has developed well both in the speech and sound recognition, but it does not perform well in the high background noisy environment. This thesis proposes a method combining short-term and long-term features to overcome this issue. Here the short-term features are Mel-frequency cepstral coefficients (MFCCs) and the long-term features are the modulation spectral vectors (MSVs) calculated in the frequency domain. The MSVs contains the envelope message of signals which is a good feature against high noise.
    For robustness against noise, this thesis proposes a method to learn noisy data while training on GMMs. This method could raise the recognition accuracy in the low singal-to-noise ratio (SNR) case. The method was evaluated on a database which consists of 8 different indoor sound event classes. It achieves > 80 % accuracy at 0 dB SNR.

    摘要 [iii] Abstract [v] 誌謝 [vii] {1}緒論 [1] {1.1}研究動機 [1] {1.2}文獻回顧 [2] {1.3}研究方向 [4] {2}系統架構與方法 [5] {2.1}訊號預處理與噪音訓練 [6] {2.2}梅爾倒頻譜係數(Mel-frequency cepstral coefficients, MFCCs) [7] {2.3}調制頻譜向量(Modulation Spectral Vectors, MSVs) [12] {2.3.1}調制頻譜向量單位(Unit of Modulation Spectral Vectors) [14] {2.3.2}長時間向量萃取(Long-term Feature Extraction) [14] {2.4}特徵向量分析 [16] {2.4.1}自相關(Autocorrelation) [16] {2.4.2}調制頻譜向量之和(Sum of Modulation Spectral Vectors) [17] {2.5}高斯混合模型(Gaussian mixture models, GMMs) [20] {2.5.1}模型描述 [20] {2.5.2}模型參數的初始化 [20] {2.5.3}期望值最大演算法(Expectation Maximization Algorithm, EM Algorithm) [21] {2.5.4}高斯混合模型之訓練流程 [26] {3}分析與討論 [29] {3.1}單一聲響事件資料庫 [29] {3.2}效能評估 [30] {3.3}短時間與長時間特徵向量之比較 [31] {3.4}噪音訓練之辨識結果 [33] {3.5}調制頻譜向量之和之辨識結果 [37] {3.6}雜訊種類之討論 [40] {4}結論與未來展望 [45] 參考文獻 [47]

    [1] B. E. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech communication, vol. 25, no. 1, pp. 117–132,
    1998.
    [2] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
    [3] F. Morchen, A. Ultsch, M. Thies, and I. Lohken, “Modeling timbre distance with temporal statistics from polyphonic music,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 81–90, 2006.
    [4] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733–1746, 2015.
    [5] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley, “Acoustic scene classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 16–34, 2015.
    [6] M. H. Moattar and M. M. Homayounpour, “A review on speaker diarization systems and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.
    [7] M. Mckinney and J. Breebaart, “Features for audio and music classification,” in Proceedings of the International Symposium on Music Information Retrieval, pp. 151–158, 2003.
    [8] M. A. Hossan, S. Memon, and M. A. Gregory, “A novel approach for mfcc feature extraction,” in Signal Processing and Communication Systems (ICSPCS), 2010 4th International Conference on, pp. 1–5, IEEE, 2010.
    [9] S. Greenberg and B. E. Kingsbury, “The modulation spectrogram: In pursuit of an invariant representation of speech,” in Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 3, pp. 1647–1650, IEEE, 1997.
    [10] S. Sukittanon, L. E. Atlas, and J. W. Pitton, “Modulation-scale analysis for content identification,” IEEE Transactions on Signal Processing, vol. 52, no. 10, pp. 3023–3035, 2004.
    [11] 何育澤, “基於支持向量機之混合聲響辦認,” 國立清華大學, 2014 年.
    [12] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
    [13] A. Oppenheim and R. Schafer, Discrete-time Signal Processing. Prentice-Hall signal processing series, Pearson, 2010.
    [14] H. Hermansky, “Modulation spectrum in speech processing,” in Signal Analysis and Prediction, pp. 395–406, Springer, 1998.
    [15] M. Markaki and Y. Stylianou, “Discrimination of speech from nonspeeech in broadcast news based on modulation frequency features,” Speech Communication, vol. 53, no. 5, pp. 726–735, 2011.
    [16] L. Atlas and S. A. Shamma, “Joint acoustic and modulation frequency,” EURASIP Journal on Advances in Signal Processing, vol. 2003, no. 7, p. 310290, 2003.
    [17] G. Evangelopoulos and P. Maragos, “Multiband modulation energy tracking for noisy speech detection,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 6, pp. 2024–2038, 2006.
    [18] J.-H. Bach, B. Kollmeier, and J. Anemüller, “Modulation-based detection of speech in real background noise: Generalization to novel background classes,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 41–44, IEEE, 2010.
    [19] N. H. Sephus, A. D. Lanterman, and D. V. Anderson, “Exploring frequency modulation features and resolution in the modulation spectrum,” in Digital Signal Processing and Signal Processing Education Meeting (DSP/SPE), 2013 IEEE, pp. 169–174, IEEE, 2013.
    [20] J.-M. Ren and J.-S. R. Jang, “Discovering time-constrained sequential patterns for music genre classification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1134–1144, 2012.
    [21] C.-H. Lee, J.-L. Shih, K.-M. Yu, and H.-S. Lin, “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features,” IEEE Transactions on Multimedia, vol. 11, no. 4, pp. 670–682, 2009.
    [22] M. Markaki and Y. Stylianou, “Voice pathology detection and discrimination based on modulation spectral features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 1938–1948, 2011.
    [23] S.-C. Lim, S.-J. Jang, S.-P. Lee, and M. Y. Kim, “Music genre/mood classification using a feature-based modulation spectrum,” in Mobile IT Convergence (ICMIC), 2011 International Conference on, pp. 133–136, IEEE, 2011.
    [24] D. Reynolds, “Gaussian mixture models,” Encyclopedia of biometrics, pp. 827–832, 2015.
    [25] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72–83, 1995.
    [26] K. B. Petersen, M. S. Pedersen, et al., “The matrix cookbook,” Technical University of Denmark, vol. 7, p. 15, 2008.
    [27] A. B. Downey, Think complexity: complexity science and computational modeling, ch. 9, p. 91. O’Reilly Media, Inc., 2012.

    QR CODE