研究生: |
林志晃 Chi-Huang Lin |
---|---|
論文名稱: |
基於語音評分技術之發音唇形提示之研究—以基礎華語學習為例 The prompt of lip shape modification of cacology based on the speech evaluation techniques –a case of basic Chinese learning |
指導教授: |
鐘太郎
Tai-Lang Jong |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 中文 |
論文頁數: | 80 |
中文關鍵詞: | 中文學習 |
外文關鍵詞: | Chinese learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本篇論文中,我們的華語學習方法將以語音為主,唇形影像為輔的方式逐步訓練學習者。首先我們需同步擷取語音與唇形影像來建立兩組資料庫,而每組資料庫中又細分成標準語音、普通語音、劣等語音3組。語音方面利用線性預估分析與倒頻譜分析分別來取出線性預估係數(LPC)、線頻譜對係數(LSP)與梅爾倒頻譜參數(MFCC)三個參數來表示語音聲紋,此外,也取出基頻軌跡(Pitch Contour)與能量曲線(Energy Curve)分別用來表示語音聲調與強度部分。而唇形影像方面將利用區域成長、型態學、GRB向量空間分割與橢圓曲線擬合等法,來取出唇形的高度與寬度當作影像參數。之後利用動態時間扭曲(DTW)演算方式算出標準語音與其它語音在LPC、LSP、MFCC、Pitch Contour、Energy curve的差異量並配合模糊理論(Fuzzy Theory)、輻射半徑基底函數網路(RBFNN)與機率神經網路(PNN)來訂出一套判定學習者學習程度的法則,同時我們也將學習者的唇形影像與標準唇形影像利用DTW演算方式求出兩者唇形在高度與寬度的差異程度,用來提醒使用者需改進的地方達到學習互動上的最佳性。
在實驗結果中發現,3種語音聲紋參數LPC、LSP、MFCC,以MFCC來分辨語音優劣的正確率約為84%為最佳。如果再加上Pitch Contour與Energy Curve則分辨語音優劣的正確率將能明顯地再提升,其中以MFCC、Pitch Contour、Energy Curve為參數並利用DTW配合PNN的方式對語音優劣的程度的辨別為最佳,其正確率可達90%。最後將利用ROC Curve分成兩階段對整體方法評估其可行性。
A Chinese learning assisted system based on the features of speech recognition and lip shape image processing is proposed in this thesis. A test database of synchronous speech signals and images of lip shape had been built. The database includes three types of audio and video pair-- good, fair, and unqualified groups of speech and lip shape. During the learning process, the system first plays a demo speech and video, then acquires the learner’s repeat speech and video sequence of mouth, then analyzes and evaluates the utterance of the learner, and indicates to the user the correct way of lip movement and utterance and prompt for repeated practice if the evaluation is graded poorly. For speech analysis, the linear prediction coefficient (LPC), line spectrum pair (LSP) and mel-scale cepstrum (MFCC) were examined as the parameters of voiceprint. In addition, the pitch contour and energy curve were adopted as the parameter of tone and magnitude of speech signals, respectively. On the other hand, the height and width of lip shape were used as the parameters of the lip shape analysis. In the scoring stage of speech utterances, the dynamic time warping (DTW) algorithm combined with Fuzzy theory, radial basis function (RBFNN) and probabilistic neural network (PNN) techniques were applied to determining whether the test speech was qualified or not during Chinese learning process. The DTW comparison of standard database with unqualified speech signal was introduced to quantitatively prompt the lip shape modification to users.
In simulation, we found that the MFCC is the best voiceprint parameter of the three voiceprint parameters and the correct rate achieved 84% by using MFCC parameters with DTW processing and PNN classification. We also found that the hybrid of MFCC, pitch contour, and energy curve parameters of speech signal could slightly promote the accuracy of classification-- could be achieved up to 90%. Finally, the Receiver Operating Characteristic Curve (ROC) curve was introduced to quantitatively evaluate the sensitivity and specificity of the performance of the proposed algorithm.
[1]L.R. Rabiner and B. H. Juang, Fundanemtals of Speech
Recognition, Prentice Hall, New Jersey, 1993.
[2]X. Huang, A. Acero and H. W. Hon, Spoken Language
Processing , Prentice Hall, 2001.
[3]L.S. Lee and Y. Lee, ‘’Voice Acess of Global
Information for Broad-Band Wireless: Technologies of
Today and Challenges of Tomorrow,’’Proceedings of the
IEEE, vol.89, no. 1, pp.41-57,January 2001.
[4]John R. Deller, Jr., John G Proakis, and John H. L.
Hansen “Discrete-Time Processing of Speech Signals”
chapter 3, Macmillan Publishing Company. 1993.
[5]R. J. Mammone, X. Zhang, and R. P. Ramachandran, ‘‘Robust speaker recognition: A feature-based
approach,’’ IEEE Signal Processing Mag, vol. 13, pp.
58-71,1996.
[6]Z. X. Yuan, B. L. Xu, and C. Z. Yu, ’’Binary
quantization of feature vectors for robust text-
independent speaker indentification,’’ IEEE Tran. of
Speech and Audio Processing, vol. 7, no. 1, Jan 1990.
[7]Canny, J.[1986] ‘’A Computational Approach for Edge
Detection,’’ IEEE Trans. Pattern Anal. Machine
Intell., vol. 8, no. 6. pp.679-688
[8]E.H. Mamdani and S. Assilian, ‘’An experiment in
linguistic sythesis with a fuzzy logic controller,‘’
Int. Journal of Man-Machine Studies,Vol. 7,No. 1, pp.1-
13, 1975.
[9]Poggio,T. , and F. Girosi, 1990b.’’Regularization
algorithms for learning that are equivalent to
multilayer networks, ‘’Science, vol. 247, pp.978-982.
[10]D.F. Specht, ‘’Probabilistic Neural Network for
Classification, Mapping or Associative Memory, ‘’
Proc. IEEE Int. Conf. Neural Network, San Diego, CA,
Vol.1, July 1988, PP.525-532.
[11]Specht, D.F., ‘’A General Regression Neural Network, ‘’ IEEE Transaction on Neural Network,1991,pp.568-576.
[12]Carey E. Priebe, “Adaptive Mixtures,” Journal of the
American Statistical Association, Vol. 89, No. 427,
Sep. 1994.
[13]E.A. Yfantis, ,T. Lazarakis, A. Angelopoulos, J.D.
Elison, , and Y. Zhang,. ”On Time Alignment and Metric
Algorithms for Speech Recognition”, Proceedings on
Information Intelligence and System, pp. 423-428,1999.
[14]M.J.F. Gales, D. Pye, P.C. Woodland, “Variance
Compensation within the MLLR framework for Robust
Speech Recognition and Speaker Adaptation”, Fourth
International Conference On Spoken Language
Proceedings. Vol. 3 pp. 1832-1835, 1996.
[15]B.H Juang and S. Furui (ed.),‘’Special Issue on
Spoken Language Processing,’’ Proceedings of the
IEEE,vol. 88, no. 8, August 2000.
[16]L.S Lee,’’Voice dictation of Mandarin Chinese, ‘’
IEEE Signal Processing Magazine, vol. 14, no. 4,1997.
[17]Rafael C. Gonzalez, Richard E. Woods ,’’Digital image
Processing, 2nd Edition’’,published by Pearson
Education, Inc, publishing as Prentice Hall, Copyright
2002.
[18]蘇木春、張孝德,’’機器學習 類神經網路、模糊系統以及基因
演算法則’’, 全華科技圖書股份有限公司
[19]王小川,’’語音訊號處理’’,全華科技圖書股份有限公司
[20]高名揚,“以聲音內容為主的音樂資料庫檢索系統的加速方
法”, 清華 大學碩士論文, 民國90年
[21]黃國源,’’Neural Network and Pattern Recongnition’’,
維科圖書有限公司
[22]李俊毅,’’語音評分 Speech Evaluation”, 清華大學碩士論
文, 民國89年[23] Hanley JA, McNeil BJ: The meaning and
use of the area under a receiver
operating characteristic (ROC) curve. Radiology,
1982;143:29-36.
[24] Hanley JA, McNeil BT: A method of comparing the areas
under receiver operating characteristic curve
derivedfrom the same cases. Radiology
1983;148:839-843
[25] Swets JA, Pickett RM, Whitehead SF et al. Assessment
of diagnostic technologies. Science 1979;205:753-759.
[26] Catalona WJ, Richie JP, deKernion JB et al: Comparison
of prostate specific antigen concentration versus
prostate specific antigen density in the early
detection of prostate cancer: receiver operation
characteristic curves. J Urol 1994;152:2031-2036.
[27] Centor: Signal detectability: the use of ROC curves
and their analyses. Med Decis Making 1991; 11:102-106.