簡易檢索 / 詳目顯示

研究生: 陳江村
論文名稱: 華語發音評量與聲調辨識研究
A Study on Pronunciation Assessment and Tone Recognition in Mandarin Chinese
指導教授: 張智星
Jyh-Shing Roger Jang
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 78
中文關鍵詞: 電腦輔助發音訓練電腦輔助語言學習語音辦識聲調辨識語音評估高斯混合模型華文音素強度韻律強迫對應連續聲調辨認區段延伸式聲調辨識隱藏式馬可夫模型前後文相關聲調模型
外文關鍵詞: CAPT, CALL, speech recognition, tone recognition, speech assessment, GMM, Mandarin Chinese, downhill Simplex method, phoneme, intensity, rhythm, forced alignment, Continuous tone recognition, extended segment for tone recognition, HMM, context-dependent tone modeling, supratone modeling
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文首先提出一套自動化華語發音學習的演算法及其雛型展示系統。本系統使用了隱藏式馬可夫模型(hidden Markov models)的強制對位來切割每一個音素,並計算對應之聲學模型對數機率,以進行以排名為基準的信心度計算。 接著再把每一單音節的音高資料以高斯混合模型(Gaussian mixture models)來進行訓練,以便進行聲調辨識。我們也針對標準語句和測試語句計算了強度和節奏的相似度分數。 以音素、聲調、強度、節奏的四個分數函數,都是以參數化函數來表示,而最後的總分數,則是由音素、聲調、強度、節奏等四個評分函數的線性組合來決定。 由於整體分數牽涉到線性和非線性參數,我們使用了下坡式Simplex搜尋來微調這些參數以逼近人為主觀評分。實驗結果顯示,本系統的計算結果和人為主觀評鑑具有高度一致性。
    更進一步的,在此發音學習的研究中可以發現,聲調對聲調語言的發音與識別是基本而重要的,聲調的辨識正確與否很大地影響了發音評估的好壞,因此我們也在此提出改進聲調辨識的創新方法。在聲調的辨識研究中,前人的工作大多採用兩階段處理,先以聲學模型對句子以強迫對應方式切割出音節,再使用眾多分類方法如類神經網路、高斯混合模型、隱藏式馬可夫模型和支撐向量機(support vector machines)等等,對切割好的音節訓練聲調模型。然而,強迫對應並不保證有人為判斷般的精準音素邊界,使得聲調模型的效能可能因為有聲範圍的判斷不佳而降低。為降低此一問題的影響,我們提出了一套強健化的,以隱藏馬可夫模型為基礎的連續語音聲調辨識方法,稱之為TRUES (tone recognition using extended segments)。這套方法對整個語句取出AMDF (average magnitude difference function)時域特徵,再以動態程式最佳化的方法,擷取出整句連續而不中斷的音高特徵曲線。每個音節的音高曲線在左右進行延伸後,訓練出左右本文均相關的聲調模型,期使增加聲調有用特徵和模型鑑別性,並減少切音結果對聲調模型的衝擊。實驗結果指出,吾人提出的TRUES這套方法,在我們自行錄製的唐詩語料下,對2007年新提出的supratone model而言,在辨識率上相對減少了 49.13% 的錯誤;而在我們實際的測試中,supratone model甚至已比新近的相關研究來得好了。此令人振奮的結果顯示出我們所提TURES方法的強健性和效果,也展現了以動態程序為基礎, 吾人所建議的整句不中斷音高追蹤法的優點。


    This dissertation firstly presents the algorithms used in a prototypical software system for automatic pronunciation assessment of Mandarin Chinese. The system uses forced alignment of HMM (hidden Markov models) for identifying each syllable and the corresponding log probability for phoneme assessment, through a ranking-based confidence measure. The pitch vector of each syllable is then sent to a GMM (Gaussian mixture models) for tone recognition and assessment. We also compute the similarity of scores for intensity and rhythm between the target and test utterances. All the four scores for phoneme, tone, intensity, and rhythm are parametric functions with certain free parameters. The overall scoring function was then formulated as a linear combination of these four scoring functions of phoneme, tone, intensity, and rhythm. Since there are both linear and nonlinear parameters involved in the overall scoring function, we employ the downhill Simplex search to fine-tune these parameters in order to approximate the scoring results obtained from a human expert. The experimental results demonstrate that the system can give consistent scores that are close to those of a human’s subjective evaluation.
    Moreover, in the experimental results of pronunciation assessment, tone recognition has been a basic but important criterion for speech recognition/assessment of tonal languages, such as Mandarin Chinese. Most previously proposed approaches adopt a two-step approach where syllables within an utterance are identified via forced alignment first, and tone recognition using a variety of classifiers, such as neural networks, GMM, HMM, SVM (support vector machines), is then performed on each segmented syllable to predict its tone. However, forced alignment does not always generate accurate syllable boundaries, leading to unstable voiced-unvoiced detection and deteriorating performance in tone recognition. Aiming to alleviate this problem, we propose a robust approach called TRUES (tone recognition using extended segments) for HMM-based continuous tone recognition. The proposed approach extracts an unbroken pitch contour from a given utterance based on dynamic programming over time-domain acoustic features of AMDF (average magnitude difference function). The pitch contour of each syllable is then extended for tri-tone HMM modeling, such that the influence from inaccurate syllable boundaries is lessened. Our experimental results demonstrate that the proposed TRUES achieves 49.13% relative error rate reduction over that of the recently proposed supratone modeling, which is deemed the state-of-the-art of tone recognition that outperforms several previously proposed approaches. The encouraging improvement demonstrates the effectiveness and robustness of the proposed TRUES, as well as the corresponding pitch determination algorithm which produces unbroken pitch contours.

    中 文 摘 要 I Abstract III Acknowledgements V Table of Contents VI Lists of Figures VII Lists of Tables IX Chapter 1. Automatic Pronunciation Assessment for Mandarin Chinese 1 Chapter 1.1. Introduction 1 Chapter 1.2. Related Work 4 Chapter 1.3. The Proposed Approach 5 Chapter 1.3.1. Syllable/Phone Segmentation Using HMM-based Forced Alignment 5 Chapter 1.3.1.1. Acoustic Model Training 5 Chapter 1.3.1.2. Pronunciation Confusion Network (PCN) 7 Chapter 1.3.1.3. Syllable Segmentation via Forced Alignment 9 Chapter 1.3.2. Ranking-based Confidence Measure for Phoneme Assessment 12 Chapter 1.3.3. Tone Recognition Using GMM 15 Chapter 1.3.4. Intensity 20 Chapter 1.3.5. Rhythm 22 Chapter 1.3.6. Parametric Scoring Function 24 Chapter 1.4. Experimental Results 26 Chapter 1.5. Overview of the Software System 29 Chapter 2. Robust Tone Recognition 32 Chapter 2.1. Introduction 32 Chapter 2.2. Related Work 34 Chapter 2.3. Tone Recognition Using Extended Segments 38 Chapter 2.3.1. Related Work of Pitch Extraction 39 Chapter 2.3.2. Unbroken Pitch Determination Using Dynamic Programming 41 Chapter 2.3.3. Tone Modeling Using Extended Segments 51 Chapter 2.3.4. Dynamic-Programming Optimization over Tri-tone Lattice 53 Chapter 2.4. Experimental Results 55 Chapter 2.4.1. Speech Corpus 55 Chapter 2.4.2. Performance Comparison and Discussion 57 Chapter 2.4.3. Parameters Tuning 64 Chapter 3. Conclusions and Future Work 69 Bibliography 71 List of Publications 77

    [Boersma and Weenink 2002] BOERSMA, P. and WEENINK, D. 2002. Praat: Doing phonetics by computer (Version 4.6.34) [Computer program]. Retrieved October 18, 2007, from http://www.praat.org/
    [Chao 1968] CHAO, Y. R. 1968. A grammar of spoken Chinese, University of California Press, Berkeley CA.
    [Chen and Jang 2007] CHEN J. C. AND JANG J. S. R. 2007, "Automatic Pronunciation Assessment for Mandarin Chinese: Approaches and System Overview," International Journal of Computational Linguistics and Chinese Language Processing, 443-458
    [Chen and Jang 2008] CHEN J. C. and JANG J. S. R. 2008. “Extended Supratone Modeling for HMM-based Continuous Tone Recognition,” ACM Transaction on Speech and Language Processing (to be appeared).
    [Chen and Wang 1993] CHEN S. H. and WANG Y. R. 1995. “Tone Recognition of Continuous Mandarin Speech Based on Neural Networks,” IEEE Transactions on Speech and Audio Processing, Vol. 3, Issue 2, 146-150.
    [Chen and Wang 1995] CHEN S. H. AND WANG Y. R. 1995. “Tone recognition of continuous Mandarin speech based on neural networks,” IEEE Trans. Speech Audio Process, Vol. 3 No. 2, 146-150.
    [Chen et al. 1997] CHEN C. J. , GOPINATH R. A. , MONKOWSKI M. D., PICHENY M. A., AND SHEN K. 1997. “New methods in continuous Mandarin speech recognition,” in Proc. of Eurospeech, 1543-1546.
    [Chen et al. 2001] CHEN C. J., LI H., SHEN L. AND FU G. 2001. ”Recognize tone languages using pitch information on the main vowel of each syllable,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 61-64.
    [Chen et al. 2004a] CHEN J. C., JANG J. S. R., LI J. Y. and WU M. C., 2004. “Automatic Pronunciation Assessment for Mandarin Chinese,” in Proc. of IEEE International Conference on Multimedia & Expo, Vol. 3, 1979-1982.
    [Chen et al. 2004b] CHEN J. C., LO J. L. AND JANG J. S. R. 2004. “Computer assisted spoken English learning for Chinese in Taiwan,” in Proc. of International Symposium on Chinese Spoken Language Processing, 337-340.
    [Chen et al. 2004c] CHEN J. C., LO J. L., JANG J. S. R. 2004. “以語音辨識與評分輔助口說英文學習,” in Proc. of Conference on Computational Linguistics and Speech Processing (ROCLING).
    [Hosom 2002] HOSOM J. P. 2002. “Automatic phoneme alignment based on acoustic-phonetic modeling,” in Proc. of the International Conference on Spoken Language Processing, 357-360.
    [Huang 2006] HUANG S. C. 2006. ”Improvement and Error Analysis of Tone Recognition for Mandarin Chinese,” thesis of the Computer Science Department, National Tsing Hua University.
    [Huang et al. 2001] HUANG X., ACERO A., AND HON H. W. 2001. Chapter 12 of Spoken Language Processing. Prentice Hall PTR, Upper Saddle River, NJ. 585-636.
    [Jang and Lin 2002] JANG J. S. R., and LIN S. S., 2002. "Optimization of Viterbi Beam Search in Speech Recognition," in Proc. of International Symposium on Chinese Spoken Language Processing. 177-180
    [Jang et al. 1997] JANG J. S. R., SUN C. T. and MIZUTANI E. 1997. Neural-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall PTR, Upper Saddle River, New Jersey.
    [Kim and Sung 2002] KIM C. and SUNG W. 2002, "Implementation of An Intonational Quality Assessment System,” in Proc. of International Conference on Spoken Language Processing, 1225-1228.
    [Kim and Sung 2002] KIM C. AND SUNG W., 2002. "Implementation of An Intonational Quality Assessment System,” in Proc. International Conference on Spoken Language Processing, 1225-1228.
    [Lee 1997] LEE L. S. 1997. "Voice Dictation of Mandarin Chinese," IEEE Signal Processing Magazine, Vol. 14, Issue 4, 63-101.
    [Lee et al. 2002] LEE, T., LAU, W., WONG, Y. W., AND CHING, P. C. 2002. “Using tone information in Cantonese continuous speech recognition,” ACM Trans. On Asian Language Information Processing Vol. 1, 83-102.
    [Li 2002] LI J. Y. 2002. “Speech Evaluation,” thesis of the Computer Science Department, National Tsing Hua University, Taiwan.
    [Lin and Lee 2003] LIN W. Y. AND LEE L. S. 2003. “Improved tone recognition for fluent Mandarin speech based on new inter-syllabic features and robust pitch extraction,” in Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding, 237-242
    [Lin et al. 2005] LIN C. Y., CHEN K. T., AND Jang J. S. R. 2005, "A Hybrid Approach to Automatic Segmentation and Labeling for Mandarin Chinese Speech Corpus," in Proc. of European Conference on Speech Communication and Technology, 1553-1556.
    [Markel 1962] MARKEL, J., 1962, “The SIFT algorithm for fundamental frequency estimation,” IEEE Transaction on Audio and Electroacoustics, Vol. 20, No. 5, 367-377.
    [Neri et al. 2003] NERI A., CUCCHIARINI C., STRIK W., 2003. “Automatic Speech Recognition for Second Language Learning: How and Why It Actually Works,” in Proc. International Congresses of Phonetic Sciences, 1157-1160
    [Neumeyer et al. 2000] NEUMEYER L., H. FRANCO, DIGALAKIS V., AND WEINTRAUB M., 2000. “Automatic Scoring of Pronunciation Quality,” Speech Communication, Vol. 30, No. 2-3, 83-93
    [Noll 1967] NOLL A. M., 1967. “Cepstrum pitch determination,” Journal of the Acoustical Society of America, Vol. 41, No. 2, 293-309.
    [Peng and Wang 2005] PENG G. AND WANG W. S. Y. 2005, “Tone recognition of continuous Cantonese speech based on support vector machines,” Speech Communication, Vol. 45, 49-62.
    [Qian et al. 2007] QIAN Y., LEE T., AND SOONG F. K. 2007. “Tone recognition in continuous Cantonese speech using supratone models,” Journal of the Acoustical Society of American. Vol. 121, No. 5, 2936-2945.
    [Rabiner 1977] RABINER, L, 1977. “On the use of autocorrelation analysis for pitch detection,” IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. 25, No. 1, 24-33.
    [Rabiner and Juang 1993] RABINER L. AND JUANG B. H., 1993. Fundamentals of Speech Recognition, Prentice Hall PTR, Upper Saddle River, New Jersey
    [Ross et al. 1974] ROSS, M. SHAFFER, H. COHEN, A. FREUDBERG, R. MANLEY, H., 1974. ”Average magnitude difference function pitch extractor,” IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. 22, No. 5, 353-362.
    [Seide and Wang 2000] SEIDE F. AND WANG N. J. C. 2000. “Two-stream modeling of Mandarin tones,” in Proc. of the International Conference on Spoken Language Processing. 867-870.
    [SFS 2000] SFS, Speech Filing System, [Computer program], available at http://www.phon.ucl.ac.uk/resource/ sfs.html
    [Sukkar and Lee 1996] SUKKAR R. A. and LEE C. H., 1996. “Vocabulary Independent Discriminative Utterance Verification for Nonkeyword Rejection in Subword Based Speech Recognition,” IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 6, 420-429.
    [Talkin 1995] TALKIN D. 1995. “A robust algorithm for pitch tracking (RAPT),” Speech Coding and Synthesis, Amsterdam, NL: Elsevier Science. 495-518.
    [Tang 2002-2006] Tang Poetry Corpus, 2002-2006, available in http://mir.cs.nthu.edu.tw/ research/corpus/tangpoetry
    [Tang Poetry 2002] 2002 Recordings of MIR Tang Poetry Corpus, available at http://mir.cs.nthu.edu.tw/research/corpus/tangPoetry
    [TCC300 2000] TCC-300 corpus, 2000, available at http://www.aclclp.org.tw/use_mat.php#tcc300edu
    [Tokuda et al. 2002] TOKUDA, K., MASUKO, T., MIYAZAKI, N., AND KOBAYASHI, T. 2002. “Multi-space probability distribution HMM,” IEICE Trans. Information & System, Vol. E85-D, No. 3, 455-464.
    [Toledano et al. 1998] TOLEDANO D. T., CRESPO M. A. R., AND SARDINA J. G. E. 1998. “Try to mimic human segmentation of speech using hmm and fuzzy logic post-correction rules,” in Proc. of the Third ESCA/COCOSDA International Workshop on Speech Synthesis. 207-212.
    [Wang and Seneff 2002] WANG C. AND SENEFF S. 2000. “Robust pitch tracking for prosodic modeling in telephone speech,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1343-1346.
    [Wang et al. 1994] WANG Y. R., CHEN S. H., AND SHIEH J. M. 1994. “Tone recognition of continuous Mandarin speech based on hidden Markov model,” International Journal of Pattern Recognition and Artificial Intelligence, Vol. 8, No. 1, 233-246.
    [Wang et al. 2006a] WANG H. L., QIAN Y., SOONG F. K., ZHOU J. L., AND HAN J. Q. 2006a. “Improved Mandarin speech recognition by lattice rescoring with enhanced tone models,” in Proc. of International Symposium on Chinese Spoken Language Processing, 445-453.
    [Wang et al. 2006b] WANG H. L., QIAN Y., SOONG F. K., ZHOU J. L., AND HAN J. Q. 2006b. “A multi-space distribution (MSD) approach to speech recognition of tonal languages,” in Proc. of the International Conference on Spoken Language Processing, 125-128.
    [Woodland et al. 1994] WOODLAND P.C., ODELL, J. J., VALTCHEV, V., AND YOUNG, S.J. 1994. “Large vocabulary continuous speech recognition using HTK,” in Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 19-22.
    [Young et al. 2004] YOUNG S., EVERMANN G., KERSHAW D., MOORE G., ODELL J., OLLASON D., VALTCHEV V., AND WOODLAND P. 2002. The HTK Book (for HTK Version 3.2). Cambridge University, Cambridge, UK.
    [Zhang et al. 2006] ZHANG L., HUANG C., CHU M., SOONG F., ZHANG X. D. AND CHEN Y. D. 2006. “Automatic detection of tone mispronunciation in Mandarin,” in Proc. of International Symposium on Chinese Spoken Language Processing, 590-601.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE