研究生: |
陳柏旻 Chen, Bo Min |
---|---|
論文名稱: |
基於聲音辨識特徵之聲音重建 Sound reconstruction based on features for sound recognition |
指導教授: |
劉奕汶
Liu, Yi Wen |
口試委員: |
冀泰石
曹昱 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2015 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 聲音重建 、聲音辨認 、梅爾 |
外文關鍵詞: | sound reconstruction, sound recognition, mel |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在日常生活中,聲音扮演了人與人之間溝通以及使人了解何種事件發生的媒介。透過萃取聲音特徵的方式保留聲音重要的特徵資訊,以達成聲音辨識的目的。若使用傳輸後的聲音特徵來重建聲音,便可達到傳輸聲音的目的,等於將聲音的辨識和傳輸整合在一起。
本論文使用的聲音特徵為普遍使用在聲音辨識的梅爾頻率倒頻譜係數(Mel Frequency Cepstral Coefficients, MFCC),而由於MFCC代表的只是聲音的頻譜包絡,已經捨棄掉細節,而語音中的音高就是語音的細節,因此再加上音高(pitch)作為特徵的一部分,以增加重建聲音的完整性。聲音重建模型則使用聲源-濾波器模型(source-filter model)為基礎,使用MFCC回推的頻率響應當作原始聲音的頻譜包絡,並用音高來決定聲源訊號。有音高的有聲語音,在重建其聲源時會根據人的發聲機制來決定泛音和雜訊的頻率分布範圍,將頻譜包絡和聲源用修改後的聲源-濾波器模型重建出聲音。
本論文用來分析和重建的聲音使用語音以及非語音,透過分析重建過程以及結果來探討可能影響重建聲音品質的因素,並透過主觀的真人聽覺測試以及客觀的聲音品質感知評估(Perceptual Evaluation of Audio Quality, PEAQ)對重建聲音評分,分數範圍為1分(非常差)到5分(非常好)。真人聽覺測試結果顯示非語音和語音的重建效果分數約介於3到4分之間,屬於可清楚理解的程度。聲音品質感知評估則顯示非語音的重建效果分數約介於2到3.5分之間,而語音的重建效果分數只些微大於1分。
Abstract
Sounds play an important role in our life. We can communicate with each other and know what happens by listening to sounds. By extracting the feature of sounds, we can keep specific information of sounds to recognize sounds. Sound transmission can be done if sounds could be reconstructed from the transmitted features of sounds. In this research, we attempt to reconstruct sounds using features that are typically transmitted for recognition purposes.
In this thesis, we take the mel frequency cepstral coefficients (MFCC), a set of features that has been commonly used for sound recognition, as the basic features for reconstruction. Because MFCC does not encode the detail of sounds, we use the pitch as additional information to enhance the completeness of the features. The sound reconstruction is based on a source-filter model which takes the reconstructed frequency response from MFCC as the spectral envelope and determines the sound source with the pitch. The critical factors of the reconstructed sound source are the frequency distribution of noise and harmonics which could be determined by the human speech production mechanism. We then combine the spectral envelope with the sound source to reconstruct sounds through a modified source-filter model.
In this thesis, we test our methods by analysis and reconstruction of speech and non-speech materials. We attempt to find the factors that may affect the quality of reconstructed sounds. We also evaluate reconstructed sounds by subjective listening test and objective perceptual evaluation of audio quality ( PEAQ). The range of grades is from 1(very bad) to 5(very good). The result of listening test reveals that the grade of speech and non-speech reconstruction is about 3 to 4. PEAQ reveals that the grade of non-speech reconstruction is about 2 to 3.5 and the grade of speech reconstruction is slightly higher than 1.
[1] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., vol. 28, no. 4, pp. 357–366, Aug. 1980.
[2] X. Huang, A. Acero, H.-W. Hon, and R. Foreword By-Reddy, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR, 2001.
[3] Z. Tychtl and J. Psutka, “Speech production based on the mel-frequency cepstral coefficients.,” in EuroSpeech, 1999, vol. 99, pp. 2335–2338.
[4] B. P. Milner and X. Shao, “Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model,” in 7th International Conference on Spoken Language Processing (ICSLP-2002), 2002, pp. 2421–2424.
[5] D. Chazan, R. Hoory, G. Cohen, and M. Zibulski, “Speech reconstruction from mel frequency cepstral coefficients and pitch frequency,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000, vol. 3, pp. 1299–1302.
[6] X. Shao and B. Milner, “Clean speech reconstruction from noisy mel-frequency cepstral coefficients using a sinusoidal model,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., 2003, vol. 1, pp. I–704–I–707.
[7] B. Milner, “Pitch prediction from MFCC vectors for speech reconstruction,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1, pp. I–97–100.
[8] X. Shao and B. Milner, “Predicting fundamental frequency from mel-frequency cepstral coefficients to enable speech reconstruction,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 1134–1143, 2005.
[9] B. Milner and X. Shao, “Prediction of fundamental frequency and voicing from mel-frequency cepstral coefficients for unconstrained speech reconstruction,” IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 1, pp. 24–33, Jan. 2007.
[10] J. O. Smith, Spectral Audio Signal Processing, 2011 editi. http://ccrma.stanford.edu/~jos/sasp/.
[11] E. Larson and R. Maddox, “Real-time time-domain pitch tracking using wavelets,” Proc. Univ. Illinois Urbana Champaign Res. Exp. Undergraduates Progr., 2005.
[12] C. T. Ferrand, “Speech science: An integrated approach to theory and clinical practice,” Ear Hear., vol. 22, no. 6, p. 549, 2001.
[13] D. P. W. Ellis, “PLP and RASTA (and MFCC, and inversion) in Matlab.” 2005.
[14] 王小川, 語音訊號處理, 修訂二版. 全華圖書, 2008.
[15] S. N. Levine and J. O. Smith III, “A sines+ transients+ noise audio representation for data compression and time/pitch scale modifications,” in Audio Engineering Society Convention 105, 1998.
[16] R. J. McAulay and T. F. Quatieri, Sinusoidal coding. Defense Technical Information Center, 1995.
[17] A. V Oppenheim, R. W. Schafer, J. R. Buck, and others, Discrete-time signal processing, vol. 2. Prentice-hall Englewood Cliffs, 1989.
[18] P. Kabal, “An examination and interpretation of ITU-R BS. 1387: Perceptual evaluation of audio quality,” TSP Lab Tech. Report, Dept. Electr. Comput. Eng. McGill Univ., pp. 1–89, 2002.
[19] L. Besacier, S. Grassi, A. Dufaux, M. Ansorge, and F. Pellandini, “GSM speech coding and speaker recognition,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), 2000, vol. 2, pp. II1085–II1088.
[20] I. Recommendation, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs “,” ITU-T Recomm., p. 862, 2001.