基於語音評分技術之發音唇形提示之研究—以基礎華語學習為例

簡易檢索 / 詳目顯示

回結果列表

研究生：	林志晃 Chi-Huang Lin
論文名稱：	基於語音評分技術之發音唇形提示之研究—以基礎華語學習為例 The prompt of lip shape modification of cacology based on the speech evaluation techniques –a case of basic Chinese learning
指導教授：	鐘太郎 Tai-Lang Jong
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2007
畢業學年度：	95
語文別：	中文
論文頁數：	80
中文關鍵詞：	中文學習
外文關鍵詞：	Chinese learning
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本篇論文中，我們的華語學習方法將以語音為主,唇形影像為輔的方式逐步訓練學習者。首先我們需同步擷取語音與唇形影像來建立兩組資料庫，而每組資料庫中又細分成標準語音、普通語音、劣等語音3組。語音方面利用線性預估分析與倒頻譜分析分別來取出線性預估係數(LPC)、線頻譜對係數(LSP)與梅爾倒頻譜參數(MFCC)三個參數來表示語音聲紋，此外，也取出基頻軌跡(Pitch Contour)與能量曲線(Energy Curve)分別用來表示語音聲調與強度部分。而唇形影像方面將利用區域成長、型態學、GRB向量空間分割與橢圓曲線擬合等法，來取出唇形的高度與寬度當作影像參數。之後利用動態時間扭曲(DTW)演算方式算出標準語音與其它語音在LPC、LSP、MFCC、Pitch Contour、Energy curve的差異量並配合模糊理論(Fuzzy Theory)、輻射半徑基底函數網路(RBFNN)與機率神經網路(PNN)來訂出一套判定學習者學習程度的法則，同時我們也將學習者的唇形影像與標準唇形影像利用DTW演算方式求出兩者唇形在高度與寬度的差異程度，用來提醒使用者需改進的地方達到學習互動上的最佳性。

　　在實驗結果中發現，3種語音聲紋參數LPC、LSP、MFCC，以MFCC來分辨語音優劣的正確率約為84%為最佳。如果再加上Pitch Contour與Energy Curve則分辨語音優劣的正確率將能明顯地再提升，其中以MFCC、Pitch Contour、Energy Curve為參數並利用DTW配合PNN的方式對語音優劣的程度的辨別為最佳，其正確率可達90%。最後將利用ROC Curve分成兩階段對整體方法評估其可行性。

A Chinese learning assisted system based on the features of speech recognition and lip shape image processing is proposed in this thesis. A test database of synchronous speech signals and images of lip shape had been built. The database includes three types of audio and video pair-- good, fair, and unqualified groups of speech and lip shape. During the learning process, the system first plays a demo speech and video, then acquires the learner’s repeat speech and video sequence of mouth, then analyzes and evaluates the utterance of the learner, and indicates to the user the correct way of lip movement and utterance and prompt for repeated practice if the evaluation is graded poorly. For speech analysis, the linear prediction coefficient (LPC), line spectrum pair (LSP) and mel-scale cepstrum (MFCC) were examined as the parameters of voiceprint. In addition, the pitch contour and energy curve were adopted as the parameter of tone and magnitude of speech signals, respectively. On the other hand, the height and width of lip shape were used as the parameters of the lip shape analysis. In the scoring stage of speech utterances, the dynamic time warping (DTW) algorithm combined with Fuzzy theory, radial basis function (RBFNN) and probabilistic neural network (PNN) techniques were applied to determining whether the test speech was qualified or not during Chinese learning process. The DTW comparison of standard database with unqualified speech signal was introduced to quantitatively prompt the lip shape modification to users.

In simulation, we found that the MFCC is the best voiceprint parameter of the three voiceprint parameters and the correct rate achieved 84% by using MFCC parameters with DTW processing and PNN classification. We also found that the hybrid of MFCC, pitch contour, and energy curve parameters of speech signal could slightly promote the accuracy of classification-- could be achieved up to 90%. Finally, the Receiver Operating Characteristic Curve (ROC) curve was introduced to quantitatively evaluate the sensitivity and specificity of the performance of the proposed algorithm.

章節目錄
________________________________________
摘要    i
Abstract    ii
誌謝    iv
章節目錄    v
圖表目錄    vii
表格目錄    ix

1、簡介    1
1.1    背景..........................................1
1.2    語音技術回顧與簡介............................2
1.3    研究方法......................................3
1.4    章節概述......................................4
2、方法（Method).......................................5
2.1 同步擷取語音與影像訊號.............................6
2.1.1 操作環境.........................................6
2.1.2 訊號擷取.........................................6
2.1.3 加視窗與取音框...................................6
2.1.4 音框能量與過零率(Frame energy and zero crossing rate)
      .................................................7
2.1.5 語音端點偵測    .....................................8
2.1.6 影像端點偵測.....................................10
2.2 語音參數...........................................10
2.2.1 發音模型.........................................11
2.2.2 線性預估係數(linear prediction coefficients , LPC)
      .................................................12
2.2.3  L-D遞迴演算法 (Levinson-Durbin recursion).......13
2.2.4 線頻譜對係數 (Line spectrum pair , LSP)..........14
2.2.5 梅爾頻率倒頻譜(Mel-Scale Cepstrum)...............15
2.2.6 基頻軌跡 (Pitch contour).........................16
2.3 影像參數...........................................18
2.3.1 取出樣本唇色.....................................19
2.3.2  RGB向量空間影像分割.............................20
2.3.3 取出唇形邊界.....................................21
2.3.4 橢圓曲線擬合.....................................22
2.4 語音評分工具.......................................24
2.4.1 模糊理論.........................................24
2.4.1.1 歸屬函數.......................................24
2.4.1.2 模糊系統.......................................26
2.4.2 輻射半徑基底函數網路( Radial Basis Function Network)
      .................................................28
2.4.3 機率神經網路( Probabilistic Neural Network)......30
3、實驗結果............................................34
3.1 建立資料庫.........................................34
3.1.1 語音參數資料庫...................................34
3.1.1.1動態時間扭曲(dynamic time warping，DTW).........35
3.1.1.2 建立標準參數...................................37
3.1.2 唇形影像參數資料庫...............................38
3.2 聲紋參數選擇.......................................39
3.2.1 模糊理論評分.....................................44
3.2.2 輻射半徑基底函數網路辨別.........................51
3.2.3 機率神經網路辨別.................................54
3.3 語音參數評估.......................................56
3.3.1    輻射半徑基底函數網路辨別......................60
3.3.2  機率神經網路辨別................................60
3.4 唇形影像輔助.......................................62
3.4.1 取出被建議的部份影像.............................62
3.5  ROC曲線評估.......................................68
3.5.1 ROC曲線..........................................68
3.5.2 利用ROC曲線來評估唇形輔助之語言學習的可行性......70
4、結論與未來展望......................................76
4.1 結論...............................................76
4.2 未來展望...........................................77
參考文獻...............................................78


圖表目錄
________________________________________
Fig1.1  (a)傳統的學習方法(b)加入唇形影像輔助學習方法.....2
Fig1.2 語音分析與評分的基本架構..........................3
Fig2.1 華語學習方法架構圖................................5
Fig2.2  Hamming window平移情形...........................7
Fig2.3  (a)華語「請」字的語音訊號 (b)能量曲線(c)過零率...10
Fig2.4  (a)發音模型系統方塊圖. (b)簡化發音模型之系統方塊
        圖...............................................12
Fig2.5  LSP在z平面單位圓上的根...........................15
Fig2.6  (a)梅爾刻度濾波頻帶(b)華語「請」字之13階梅爾頻率倒
        頻譜.............................................16
Fig2.7 基頻軌跡預估圖....................................18
Fig2.8 聲調的基頻軌跡,橫軸為frame number(n),縱軸為頻率, (a)
       ~(d)依序為華語聲調的一至四聲(分別為華語「安」，
      「程」，「早」，「飯」的聲調)......................18
Fig2.9 取得唇型參數步驟..................................19
Fig2.10 利用橢圓來虛擬取代唇形...........................22
Fig2.11 唇形影像參數取得過程 (a)閉唇之原圖(b)區域成長(c)影
        像閉合(d)影像細線化(e)模糊化之唇形樣本(f)RGB向量空間
        影像分割(g)影像閉合與填補(h)橢圓曲線擬合(i)將取出的
        唇形橢圓貼回原圖比較.............................24
Fig2.12 模糊系統的基本架構...............................26
Fig2.13  Learning of radial basis function net...........28
Fig2.14  PNN網路基本架構圖...............................31
Fig2.15  PNN建構過程.....................................33
Fig3.1    動態時間扭曲比對示意圖.........................35
Fig3.2     DTW彈性起始點與終點示意圖.....................36
Fig3.3    常見的DTW限制條件..............................37
Fig3.4  標準參數取得流程.................................38
Fig3.5 標準參數取得流程..................................39
Fig3.6 計算資料庫中語音聲紋參數與標準參數距離的流程......40
Fig3.7 資料庫中LPC參數與標準LPC參數距離的平均值與標準差來說
       明距離的分佈......................................41
Fig3.8 資料庫中LSP參數與標準LSP參數距離的平均值與標準差來說
       明距離的分佈。....................................43
Fig3.9 資料庫中MFCC參數與標準MFCC參數距離的平均值與標準差來
       說明距離的分佈....................................44
Fig3.10  LPC參數的歸屬函數...............................47
Fig3.11  LSP參數的歸屬函數...............................48
Fig3.12  MFCC參數的歸屬函數..............................50
Fig3.13  Learning of radial basis function net...........51
Fig3.14  PNN網路的分類架構.    54
Fig3.15  資料庫中基頻軌跡參數與標準基頻軌跡參數距離的平均值
         與標準差來說明距離的分佈........................58
Fig3.16 資料庫中能量曲線參數與標準能量曲線參數距離的平均值與
        標準差來說明距離的分佈...........................60
Fig3.17 發音唇形是否被建議之判定流程.....................63
Fig3.18 取出需建議唇形影像的架構圖.......................63
Fig3.19 以影像方式來說明取出需建議唇形影像的過程.........65
Fig3.20 以影像方式來說明取出需建議唇形影像的過程.........67
Fig3.21 資料樣本分佈.....................................69
Fig3.22  ROC curve.......................................70
Fig3.23  (a)第一階段的ROC評估流程 (b)第二階段的ROC評估流
          程.............................................71
Fig3.24  (a) 華語中的「站」字的ROC curve (b) ROC curve所佔據
          的面績比例.....................................74



表格目錄
________________________________________
Table 3.1 針對LPC，使用模糊理論對資料庫中的語音進行優劣分辨
          所得之結果與正確率.............................50
Table 3.2 針對LSP，使用模糊理論對資料庫中的語音進行優劣分辨
          所得之結果與正確率.............................50
Table 3.3 針對MFCC，使用模糊理論對資料庫中的語音進行優劣分辨
          所得之結果與正確率.............................51
Table 3.4 針對LPC，使用RBFNN對資料庫中的語音進行優劣分辨所得
          之結果與正確率.................................52
Table 3.5 針對LSP，使用RBFNN對資料庫中的語音進行優劣分辨所得
          之結果與正確率.................................53
Table 3.6 針對MFCC，使用RBFNN對資料庫中的語音進行優劣分辨所
          得之結果與正確率...............................53
Table 3.7 針對LPC，使用PNN對資料庫中的語音進行優劣分辨所得之
          結果與正確率...................................55
Table 3.8 針對LPC，使用PNN對資料庫中的語音進行優劣分辨所得之
          結果與正確率...................................55
Table 3.9 針對LPC，使用PNN對資料庫中的語音進行優劣分辨所得之
          結果與正確率...................................56
Table 3.10 以MFCC、基頻、能量為輸入參數並使用RBF為評分工具對
           資料庫中的語音進行優劣分辨所得之結果與正確率..60
Table 3.11 以MFCC、基頻、能量為輸入參數並使用PNN為評分工具對
           資料庫中的語音進行優劣分辨所得之結果與正確率..61
Table 3.12  2-Class下可能分類情況........................70
Table 3.13 華語語音2-Class下分類的情況...................72
Table 3.14 華語語音2-Class下總和分類的情況...............73
Table 3.15 華語中的「站」字的統計數據vs.參數臨界值的變化情況      ..............................................74
Table 3.16 華語語音2-Class下總和分類的情況...............75

                                

[1]L.R. Rabiner and B. H. Juang, Fundanemtals of Speech
Recognition, Prentice Hall, New Jersey, 1993.
[2]X. Huang, A. Acero and H. W. Hon, Spoken Language
Processing , Prentice Hall, 2001.
[3]L.S. Lee and Y. Lee, ‘’Voice Acess of Global
Information for Broad-Band Wireless: Technologies of
Today and Challenges of Tomorrow,’’Proceedings of the
IEEE, vol.89, no. 1, pp.41-57,January 2001.
[4]John R. Deller, Jr., John G Proakis, and John H. L.
Hansen “Discrete-Time Processing of Speech Signals”
chapter 3, Macmillan Publishing Company. 1993.
[5]R. J. Mammone, X. Zhang, and R. P. Ramachandran, ‘‘Robust speaker recognition: A feature-based
approach,’’ IEEE Signal Processing Mag, vol. 13, pp.
58-71,1996.
[6]Z. X. Yuan, B. L. Xu, and C. Z. Yu, ’’Binary
quantization of feature vectors for robust text-
independent speaker indentification,’’ IEEE Tran. of
Speech and Audio Processing, vol. 7, no. 1, Jan 1990.
[7]Canny, J.[1986] ‘’A Computational Approach for Edge
Detection,’’ IEEE Trans. Pattern Anal. Machine
Intell., vol. 8, no. 6. pp.679-688
[8]E.H. Mamdani and S. Assilian, ‘’An experiment in
linguistic sythesis with a fuzzy logic controller,‘’
Int. Journal of Man-Machine Studies,Vol. 7,No. 1, pp.1-
13, 1975.
[9]Poggio,T. , and F. Girosi, 1990b.’’Regularization
algorithms for learning that are equivalent to
multilayer networks, ‘’Science, vol. 247, pp.978-982.
[10]D.F. Specht, ‘’Probabilistic Neural Network for
Classification, Mapping or Associative Memory, ‘’
Proc. IEEE Int. Conf. Neural Network, San Diego, CA,
Vol.1, July 1988, PP.525-532.
[11]Specht, D.F., ‘’A General Regression Neural Network, ‘’ IEEE Transaction on Neural Network,1991,pp.568-576.
[12]Carey E. Priebe, “Adaptive Mixtures,” Journal of the
American Statistical Association, Vol. 89, No. 427,
Sep. 1994.
[13]E.A. Yfantis, ,T. Lazarakis, A. Angelopoulos, J.D.
Elison, , and Y. Zhang,. ”On Time Alignment and Metric
Algorithms for Speech Recognition”, Proceedings on
Information Intelligence and System, pp. 423-428,1999.
[14]M.J.F. Gales, D. Pye, P.C. Woodland, “Variance
Compensation within the MLLR framework for Robust
Speech Recognition and Speaker Adaptation”, Fourth
International Conference On Spoken Language
Proceedings. Vol. 3 pp. 1832-1835, 1996.
[15]B.H Juang and S. Furui (ed.),‘’Special Issue on
Spoken Language Processing,’’ Proceedings of the
IEEE,vol. 88, no. 8, August 2000.
[16]L.S Lee,’’Voice dictation of Mandarin Chinese, ‘’
IEEE Signal Processing Magazine, vol. 14, no. 4,1997.
[17]Rafael C. Gonzalez, Richard E. Woods ,’’Digital image
Processing, 2nd Edition’’,published by Pearson
Education, Inc, publishing as Prentice Hall, Copyright
2002.
[18]蘇木春、張孝德,’’機器學習類神經網路、模糊系統以及基因
演算法則’’, 全華科技圖書股份有限公司
[19]王小川,’’語音訊號處理’’,全華科技圖書股份有限公司
[20]高名揚,“以聲音內容為主的音樂資料庫檢索系統的加速方
法”, 清華大學碩士論文, 民國90年
[21]黃國源,’’Neural Network and Pattern Recongnition’’,
維科圖書有限公司
[22]李俊毅,’’語音評分 Speech Evaluation”, 清華大學碩士論
文, 民國89年[23] Hanley JA, McNeil BJ: The meaning and
use of the area under a receiver
operating characteristic (ROC) curve. Radiology,
1982;143:29-36.
[24] Hanley JA, McNeil BT: A method of comparing the areas
under receiver operating characteristic curve
derivedfrom the same cases. Radiology
1983;148:839-843
[25] Swets JA, Pickett RM, Whitehead SF et al. Assessment
of diagnostic technologies. Science 1979;205:753-759.
[26] Catalona WJ, Richie JP, deKernion JB et al: Comparison
of prostate specific antigen concentration versus
prostate specific antigen density in the early
detection of prostate cancer: receiver operation
characteristic curves. J Urol 1994;152:2031-2036.
[27] Centor: Signal detectability: the use of ROC curves
and their analyses. Med Decis Making 1991; 11:102-106.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文