簡易檢索 / 詳目顯示

研究生: 林奇嶽
Lin, Chi-Yueh
論文名稱: 基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用
Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate
指導教授: 王小川
Wang, Hsiao-Chuan
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 92
中文關鍵詞: 隨機森林嗓音起始時間爆發起始語音辨識
外文關鍵詞: random forest, voice onset time, burst onset, speech recognition
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 確實地從語音信號中偵測顯著的語音事件,在基於語音事件的語音辨識技術中扮演重要的角色。標記語音事件不僅對音素辨識有所助益,也能方便吾人從中萃取重要的語音訊息。本論文著重於爆發起始的偵測,它是塞音和塞擦音中最具代表性的語音事件,此事件的存在與否可藉由偵測時頻空間中的閉塞爆發轉移來達成。本文採用二維倒頻譜係數為特徵參數,搭配隨機森林法在連續語音中偵測爆發起始。在建造隨機森林的過程中,會遭遇影響偵測效果甚鉅的不平衡語料問題,因此我們提出非對稱拔靴複製法來克服。在TIMIT英文語料庫所進行的一連串實驗結果顯示,本文提出的爆發起始偵測法同時俱有高效率和高準確率兩項優點,而偵測過程的部份資訊亦可輔佐梅爾頻率倒頻譜係數來增加塞音和塞擦音的音素辨識率。

    嗓音起始時間是塞音爆發起始和帶聲起始的時間差,在有關嗓音起始時間的諸多研究中,如何有效率地估量其值一直是備受關注的議題之一。以人工進行標記是個可行方案,但面對龐大的語料庫時,人工的方式難免耗日費時。本論文的第二部份特別針對此問題,結合基於隱藏式馬爾可夫模型的狀態層次強制對齊以及基於隨機森林法的起始點偵測兩項技術,來自動標記嗓音起始時間。強制對齊能在連續語音中約略地標出塞音的發生位置,爾後起始點偵測在對齊的塞音片段中找尋較精準的爆發起始和帶聲起始時間點。實驗語料亦來自於TIMIT語料庫,包含2,344個字首塞音及1,440個字中塞音,平均而言,本文提出的方法能在5毫秒、10毫秒、15毫秒、20毫秒的誤差容忍度之下,分別達到57%、83%、93%、96%的累積正確率,實驗結果亦指出字首塞音的嗓音起始時間較字中塞音的嗓音起始時間易於估測。除了展現嗓音起始時間估測的準確程度外,對於可能影響估測的因素,如塞音的發音部位、塞音的帶聲狀態、後接母音的品質等,皆在文中有所探討。


    The reliable detection of salient acoustic-phonetic cues in speech signal plays an important role in landmark-based speech recognition. Locating speech landmarks not only assists phone recognition, but also helps extraction of phonetic information. This dissertation focuses on the issue of detecting burst onset, which is the most prominent landmark in stop and affricate consonants. The chosen feature representation is the two-dimensional cepstral coefficients (TDCCs) from a spectro-temporal patch, which are able to highlight the closure-burst transitions that indicate the presences of burst onsets. Then the random forest technique, an ensemble of tree-structured classifiers, employs the feature vectors to detect burst onsets in continuous speech. During the random forest construction, we also proposed an asymmetric bootstrap to deal with the problem of imbalanced training data, which may deteriorate performance of a resulting forest. A series of experiments conducted on an English spoken corpus, TIMIT, demonstrate that the proposed detector provides an efficient and accurate means to detect burst onsets. When the detection results are appended to MFCC vectors, the augmented feature vectors enhance the recognition correctness of stop and affricate consonants.

    Voice onset time (VOT) of a stop consonant is an interval between its burst onset and voicing onset. Among a variety of research topics on VOT, one that has been concerned for years is how to efficiently measure a VOT. Manual annotation is a feasible way, but it becomes a time-consuming task when corpus size is large. The second part of this dissertation proposes an automatic VOT estimate method which combines an HMM-based state-level forced alignment and an RF-based onset detection. The forced alignment roughly locates stop consonants in continuous speech. Then the onset detector searches each aligned stop segment for its subtle locations of burst and voicing onsets to estimate a VOT. The proposed method is able to onset detection can detect the onsets in an efficient and accurate manner with only a small amount of training data. The evaluation data were extracted from TIMIT corpus, which in total comprises 2,344 word-initial and 1,440 word-medial stops. The experimental results showed that, on average, 57%, 83%, 93%, and 96% of the estimates deviate less than 5 ms, 10 ms, 15 ms, and 20 ms from their manually labeled values respectively. The results also revealed the fact that VOTs of word-initial stops are more accurately estimated than those of word-medial stops. In addition to the accuracy of VOT estimates, factors that may influence the estimate accuracy, i.e., articulation place of a stop, voicing status of a stop, and quality of succeeding vowel, were also investigated.

    Chinese Abstract i English Abstract ii Acknowledgment iii Table of Contents iv List of Figures vii List of Tables x 1 Introduction 1 2 Feature Representation 5 2.1 Linear Predictive Coefficient (LPC) 5 2.2 Mel-frequency Cepstral Coefficient (MFCC) 8 2.3 Two-dimensional Cepstral Coefficient (TDCC) 10 3 Learning Machines 14 3.1 Random Forest (RF) 14 3.1.1 Background 14 3.1.2 Bagging 15 3.1.3 Bootstrap 16 3.1.4 Randomization 16 3.1.5 Dealing with imbalanced training data 17 3.2 Support Vector Machine (SVM) 19 3.3 Gaussian Mixture Model (GMM) 22 3.3.1 Background 22 3.3.2 Expectation-Maximization algorithm 23 3.4 Hidden Markov Model (HMM) 24 3.4.1 Background 24 3.4.2 Forced alignment 25 4 Detection of Burst Onsets 27 4.1 Overview 27 4.1.1 Stop and affricate consonants in English 27 4.1.2 Burst onset 29 4.2 Random forest based detector 32 4.2.1 Broad phonetic class 32 4.2.2 Feature representation 33 4.2.3 Detector construction 36 4.3 Experiments 37 4.3.1 Speech materials 37 4.3.2 Burst onset detection 39 4.3.3 General detection results 42 4.3.4 Various random forest detectors 46 4.3.5 Estimation accuracy of burst onset locations 48 4.3.6 Comparison with other learning machines 50 4.3.7 Stop-like dental fricatives 53 4.3.8 Application to phone recognition 54 4.4 Summary 57 5 Voice Onset Time Estimate in Continuous Speech 59 5.1 Voice onset time (VOT) 59 5.2 Cascade VOT estimate system 63 5.2.1 Overview of proposed system 63 5.2.2 HMM-based Forced alignment 65 5.2.3 RF-based Onset detection 66 5.3 Experiments 69 5.3.1 Speech corpus 69 5.3.2 Accuracy of burst onset estimate 70 5.3.3 Accuracy of voicing onset estimate 71 5.3.4 Accuracy of voice onset time estimate 74 5.3.5 Performance comparison with and without onset detection 75 5.4 Discussion 78 5.5 Summary 80 6 Conclusion 81 Appendix 83 A.1 VOT Experiments 83 Bibliography 85 Publication List 91

    [AF94] F. Auger and P. Flandrin. The why and how of time-frequency reassignment. In Proceedings of IEEE International Symposium on Time- Frequency and Time-Scale Analysis, pages 197-200, 1994.
    [AMNS89] Y. Ariki, S. Mizuta, M. Nagata, and T. Sakai. Spoken-word recognition using dynamic features analysed by two-dimensional cepstrum. IEE Proceedings, 136(2):133-140, April 1989.
    [AS68] B.S. Atal and M.R. Schroeder. Predictive coding of speech signals. Technical report, Tokyo, Japan, 1968.
    [BFOS84] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, Pacfic Grove, CA, 1984. 358 p.
    [BHT63] B.P. Bogert, M.J.R. Healy, and J.W. Tukey. The quefrency alanysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking. 1963.
    [Bis06] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
    [Bre96] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123-140, August 1996.
    [Bre01] Leo Breiman. Random forests. Machine Learning, 45(1):5-32, October 2001.
    [Bur98] Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121-167, 1998.
    [BV06] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2006.
    [CKY08] R. Caruana, N. Karampatziakis, and A. Yessenalina. An empirical comparison of supervised learning in high dimensions. In Proceedings of the 25rd International Conference on Machine Learning, pages 96-103, 2008.
    [CL01] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [CLB04] Chao Chen, Andy Liaw, and Leo Breiman. Using random forest to learn imbalanced data. Technical Report 666, Department of Statistics, UC Berkeley, July 2004.
    [CNM06] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, pages 161-168, 2006.
    [CRS05] T.-S. Chi, P. Ru, and S. A. Shamma. Multiresolution spectrotemporal analysis of complex sounds. Journal of Acoustical Society of America, 118:887-906, August 2005.
    [Cry08] David Crystal. A Dictionary of Linguistics and Phonetics. Blackwell Publishing Limited, Malden, MA, 6th edition, 2008.
    [CV95] Corinna Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273-297, 1995.
    [DH04] Sharmistha Das and John H.L. Hansen. Detection of voice onset time (VOT) for unvoiced stops (/p/, /t/, /k/) using the teager energy operator (TEO) for automatic detection of accented english. In Proceedings of the 6th Nordic Signal Processing Symposium, pages 334-347, June 2004.
    [DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society, B, 39(1):1-38, 1977.
    [DM80] S.B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357-366, 1980.
    [Dur60] J. Durbin. The fitting of time series models. Rev. Inst. Int. Stat., 28:233-243, 1960.
    [Edw81] T. J. Edwards. Multiple features analysis of intervocalic english plosives. Journal of Acoustical Society of America, 69(2):535-547, Feburary 1981.
    [Efr79] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 1:1-26, 1979.
    [EWD04] Carol Y. Espy-Wilson and Om Deshmukh. Detection of speech landmarks: Use of temporal information. Journal of the Acoustical Society of America, 115(3):1296-1305, March 2004.
    [FCY03] Alexander L. Francis, Valter Ciocca, and Jojo Man Ching Yu. Accuracy and variability of acoustic measures of voicing onset. Journal of Acoustical Society of America, 113(2):1025-1032, February 2003.
    [Fur86] S. Furui. On the role of spectral transition for speech perception. Journal of Acoustical Society of America, 80(4):1016-1025, October 1986.
    [FWMD88] Karen Forrest, Gary Weismer, Paul Milenkovic, and R. N. Dougall. Statistical analysis of word-initial voiceless obstruents: Preliminary data. Journal of Acoustical Society of America, 84(1):115-123, July 1988.
    [GLF+93] John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgrena, and Victor Zue. Timit acoustic-phonetic continuous speech corpus, 1993. (date last viewed September 21, 2010).
    [HAH05] X.-D. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Pearson Education Taiwan Ltd., Taipei, Taiwan, 2005.
    [HCL03] C.-W. Hsu, C.-C. Chang, and C.-J. Lin. A practical guide to support vector classification. National Taiwan University, Taiwan, 2003. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [HGK10] John H.L. Hansen, Sharmistha S. Gray, and Wooil Kim. Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/) with application to accent classification. Speech Communication, 52(10):777-789, October 2010.
    [HJBB+05] Mark Hasegawa-Johnson, James Baker, Sarah Borys, Ken Chen, Emily Coogan, Steven Greenberg, Amit Juneja, Katrin Kirchho , Karen Livescu, Srividya Mohan, Jennifer Muller, Kemal Sonmez, and Tianyu Wang. Landmark-based speech recognition: Report of the 2004 johns hopkins summer workshop. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, volume I, pages 213-215, 2005.
    [Ho98] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832-844, August 1998.
    [HR83] P. Howell and S. Rosen. Production and perception of rise time in the voiceless affricate/fricative distinction. Journal of Acoustical Society of America, 73:976-984, 1983.
    [HS98] H. Hermansky and S. Sharma. Traps-classifiers of temporal patterns. In Proceedings of 5th International Conference on Spoken Language Processing, ICSLP 98, pages 1003-1006, 1998.
    [JEW02] A. Juneya and C.Y. Espy-Wilson. Segmentation of continuous speech using acoustic-phonetic parameters and statistical learning. In Proceedings of the 9th International Conference on Neural Information Processing, pages 726-730, 2002.
    [JKP05] R. Jarina, M. Kuba, and M. Paralic. Compact representation of speech using 2-d cepstrum - an application to slovak digits recognition. Lecture Notes in Computer Science, 3658:342-347, 2005.
    [JN08] Aren Jansen and Partha Niyogi. Modeling the temporal dynamics of distinctive feature landmark detectors for speech recognition. Journal of Acoustical Society of America, 124(3):1739-1758, September 2008.
    [JP09] A. R. Jayan and P. C. Pandey. Detection of stop landmarks using gaussian mixture model of speech spectrum. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4681-4684, 2009.
    [Kai90] James F. Kaiser. On a simple algorithm to calculate the `energy' of a signal. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 381-384, 1990.
    [KP83] D. Kewley-Port. Time-varying features as correlates of place of articulation in stop consonants. Journal of Acoustical Society of America, 73:322-335, 1983.
    [KR92] R. D. Kent and C. Read. The Acoustic Analysis of Speech. Singular Publishing Group, Inc., San Diego, CA, 1992. chap. 5.
    [KTS+06] Abe Kazemzadeh, Joseph Tepperman, Jorge Silva, Hong You, Sungbok Lee, Abeer Alwan, and Shrikanth Narayanan. Automatic detection of voice onset time contrasts for use in pronunciation assessment. In Proceedings of INTERSPEECH, pages 721-724, September 2006.
    [LA64] Leigh Lisker and Arthur S. Abramson. A cross-language study of voicing in initial stops: Acoustical measurements. WORD, 20(3):384-422, December 1964.
    [Lad01] Peter Ladefoged. A course in Phonetics. Harcourt College Publishers, Fort Worth, TX, 4 edition, 2001. 289 p.
    [LCD+07] Chin-Hui Lee, Mark A. Clements, Sorin Dusan, Eric Fosler-Lussier, Keith Johnson, Biing-Hwang Juang, and Lawrence R. Rabiner. An overview on automatic speech attribute transcription (asat). In Proceedings of INTERSPEECH 2007, pages 1825-1828, August 2007.
    [Lev47] N. Levinson. The wiener rms error criterion in filter design and prediction. J. Math. Phys., 25:261-278, 1947.
    [LH89] K.-F. Lee and H.-W. Hon. Speaker-independent phone recognition using hidden markov models. IEEE trans. on acoustic, speech, and signal processing, 37(11):1641-1648, November 1989.
    [Liu95] S. A. Liu. Landmark Detection for Distinctive Feature-based Speech Recognition. PhD thesis, Massachusetts Institute of Technology, May 1995.
    [LJ06] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578-590, June 2006.
    [LJL08] Catherine Lemyre, Milan Jelinek, and Roch Lefebvre. New approach to voiced onset time detection in speech signal and its application for frame error concealment. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4757-4760, 2008.
    [Mak73] J. Makhoul. Spectral analysis of speech by linear prediction. IEEE Trans. on Acoustics, Speech and Signal Processing, 21(3):140-148, 1973.
    [MAR64] E. Braverman M. Aizerman and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821-837, 1964.
    [MFL06] J. Morris and E. Fosler-Lussier. Combining phonetic attributes using conditional random fields. In Proceedings of INTERSPEECH 2006, pages 597-600, 2006.
    [Mil96] B. Milner. Inclusion of temporal information into features for speech recognition. In Proceedings of International Conference on Spoken Language Processing, pages 256-259, 1996.
    [Mil02] B. Milner. A comparison of front-end configurations for robust speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 797-800, 2002.
    [NBR99] P. Niyogi, C. Burges, and P. Ramesh. Distinctive feature detection using support vector machines. In IEEE International Conference on Acoustics, Speech and Signal Processing, 1999, volume 1, pages 425-428, 1999.
    [Net] The netlab toolbox. http://www.ncrg.aston.ac.uk/netlab/index.php.
    [Nol64] A.M. Noll. Short-time spectrum and cepstrum techniques for vocal-pitch detection. Journal of Acoustical Society of America, 36(2):296-302, 1964.
    [Nol67] A.M. Noll. Cepstrum pitch determination. Journal of Acoustical Society of America, 41(2):293-309, Feb. 1967.
    [NR03] Partha Niyogi and Padma Ramesh. The voicing feature for stop consonants: Recognition experiments with continuously spoken alphabets. Speech Communication, 41:349-367, October 2003.
    [NS02] Partha Niyogi and M. M. Sondhi. Detecting stop consonants in continuous speech. Journal of Acoustical Society of America, 111(2):1063-1076, February 2002.
    [Opp64] A.V. Oppenheim. Superposition in a class of nonlinear systems. PhD thesis, MIT, May 1964.
    [OS99] Alan V. Oppenheim and Ronald W. Schafer. Discrete-time Signal Processing. Prentice-Hall International, Inc., Upper Saddle River, NJ, 2 edition, 1999.
    [OS04] A.V. Oppenheim and R.W. Schafer. From frequency to quefrency: A history of the cepstrum. IEEE Signal Processing Magazine, pages 95-99, September 2004.
    [PMA98] F. Plante, G. Meyer, and W. A. Ainsworth. Improvement of speech spectrogram accuracy by the method of reassignment. IEEE Transaction on Speech and Audio Processing, 6(3):282-287, May 1998.
    [PW93] H.-F. Pai and H.-C. Wang. Two-dimensional cepstral distance measure for speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 672-675, 1993.
    [Qua01] T.F. Quatieri. Discrete-Time Speech Signal Processing: Principles and Practice. Prentice Hall, 2001.
    [Rab89] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE, 77(2):257-286, February 1989.
    [Rey95] D.A. Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17:91-108, 1995.
    [RN98] Padma Ramesh and Partha Niyogi. The voicing feature for stop consonants: Acoustic phonetic analyses and automatic speech recognition experiments. In Proceedings of ICSLP, 1998.
    [Roa01] Peter Roach. Phonetics. Oxford University Press, 2001.
    [Say00] Khalid Sayood. Introduction to data compression. Morgan Kaufmann Publishers, CA, USA, 2 edition, 2000.
    [SG05] Ken Schutte and James Glass. Robust detection of sonorant landmarks. In Proceedings of INTERSPEECH 2005, pages 1005-1008, September 2005.
    [Sh09] Veronique Stouten and Hugo Van hamme. Automatic voice onset time estimation from reassignment spectra. Speech Communication, 51(12):1194-1205, December 2009.
    [SK10] Morgan Sonderegger and Joseph Keshet. Automatic discriminative measurement of voice onset time. 2010. (date last viewed September 21, 2010).
    [SL09] Sabato Marco Siniscalchi and Chin-Hui Lee. A study on integrating acoustic phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51(11):1139-1153, November 2009.
    [Ste98] K. N. Stevens. Acoustic Phonetics. The MIT Press, Cambridge, MA, 1998.
    [Ste02] K. N. Stevens. Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of Acoustical Society of America, 111(4):1872-1891, April 2002.
    [Vap99] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1999.
    [YW04] H. Yu and A. Waibel. Integrating thumbnail features for speech recognition using conditional exponential models. In Proceedings of ICASSP 2004, pages 893-896, 2004.
    [Zha07] Sherry Zhao. The stop-like modification of /dh/ : a case study in the analysis and handling of speech variation. PhD thesis, Massachusetts Institute of Technology, 2007.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE