研究生: |
黃得勝 Hoang, Dac Thang |
---|---|
論文名稱: |
基於頻譜變化偵測的盲音素分割 Blind Phone Segmentation Based on Spectral Change Detection |
指導教授: |
王小川
Wang, Hsiao Chuan 鐘太郎 Jong, Tai Lang |
口試委員: |
李琳山
劉奕汶 李夢麟 陳信宏 王新民 |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 英文 |
論文頁數: | 90 |
中文關鍵詞: | Phone Segmentation 、Bounday Detection |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
音素分割是將一段連續語音訊號切割成各個音素單位,通常在語音處理時會作音素分割,
例如聲學語音學分析、語音辨識、語者辨識、語音合成、及語音資料庫標註等。人工的
音素分割是耗時的,而且會因為不同轉譯者得出不一致的結果,因此需要一個自動音素
分割的方法。典型的做法是將語音訊號與音素標記對齊,如果已經有一句話的文字轉譯,
使用基於隱藏式馬可夫模型的強制對齊法可以找出這句話的音素邊界時間點,這是一種
監督式的方法,通常會得到高正確率。然而有些應用是沒有訓練語料與事先轉譯,就需
要採用非監督式的方法。 如果語音資料沒有告知其語言的相對文字與轉譯,音素分割就
得採用盲分割的方法,這種方法很難得到高的正確率,提高其正確率是一種挑戰。
本論文探討盲音素分割的問題,採用頻帶能量抽取語音特徵,提出四個盲音素分
割的方法,(1)頻譜差異函數法、(2) 頻帶能量曲線追蹤法、(3)高氏函數法、與(4)勒氏多
項式近似法。在英語語音資料庫TIMIT 上作檢驗,實驗的結果顯示所提出的方法較前人
的方法為佳。頻帶能量曲線追蹤法也用以檢驗中文語音資料庫TCC300,發現一些關於
語言不相關性的問題,噪音的影響則在高氏函數法與勒氏多項式近似法中作探討。
Phone segmentation involves partitioning a continuous speech signal into discrete phone units.
It is often required in some areas of speech processing, such as acoustic-phonetic analysis,
speech recognition, speaker recognition, speech synthesis, and annotations of speech corpus.
Manual phone segmentation is time consuming, and its result may be inconsistent because of
the subjective criteria of different transcribers. Therefore a method of automatic phone
segmentation is desirable. A typical approach is to align the speech signal to its phone
transcripts in an utterance. The forced alignment based on hidden Markov model is a way to
locate phone boundaries when the phone transcripts of the target utterance are available. This
supervised method usually obtains high accuracy. However, the training speech signal and their
transcripts are unavailable in some applications. Hence, unsupervised methods are used. If
there is no linguistic knowledge (such as, orthographic or phonetic transcripts) of given speech
data, phone segmentation is performed in blind method. However, this approach is difficult to
obtain a high accuracy. Obtaining a high level of accuracy by using the blind method is
challenging.
This dissertation addresses the problem of blind phone segmentation. The band energies
of speech signals are calculated for feature extraction. Four methods for blind phone
segmentation are proposed. They are based on (1)Delta spectral function, (2)Band-energy
tracing technique, (3)Gaussian function, and (4)Legendre polynomial approximation. English
speech corpus, TIMIT, was examined. Experimental results showed that the proposed methods
were more accurate than previous methods. For the method using BE tracing technique,
Chinese speech corpus, TCC300, was also evaluated to reveal the language-independent problems. Noise influences were investigated in the methods using Gaussian function and
Legendre polynomial approximation.
[1] Furui, “Digital Speech Processing, Synthesis and Recognition,” Marcel Dekker, New
York, (2001).
[2] J. Marcus, ‘‘Phonetic recognition in a segment-based HMM,’’ in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, 2, pp. 479–482,
(1993).
[3] James R. Glass, “A probabilistic framework for segment-based speech recognition,”
Computer Speech and Language, 17, pp. 137–152, (2003).
[4] A.E. Rosenberg, C-H Lee, and F.K Soong, “Sub-word unit Talker Verification using
Hidden Markov models,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 269–272, (1990).
[5] T. Matsui and S. Furui, “Concatenated phoneme models for text-variable Speaker
Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing, pp 391–394, (1994).
[6] M. Sharma and R. Mammone, “Subword-based text-dependent speaker verification
system with user-selectable passwords,” in Proceedings of IEEE International
Conference on Acoustics, Speech and Signal Processing, pp 93–96, (1996).
[7] J. Adell, A. Bonafonte, “Towards phone segmentation for concatenative speech
synthesis,” in Proceedings 5th ISCA Speech Synthesis Workshop (SSW5), pp. 139–144,
(2004).
[8] Ljolje Andrej, Hirschberg Julia, and Jan P H van, “Automatic speech segmentation for
concatenative inventory selection,” in Proceedings of Speech Synthesis, Springer-Verlag,
New York, USA, (1997).
83
[9] O. Scharenborg, V. Wan, M. Ernestus, “Unsupervised speech segmentation: an analysis
of the hypothesized phone boundaries,” Journal of Acoustical Society of America, vol.
172 (2), pp. 1084-1095, ( 2010).
[10] O. Räsänen , U. K. Laine, and T. Altosaar, “Blind segmentation of speech using nonlinear
filtering methods,” Ipsic I. (Ed.): Speech Technologies. InTech Publishing, pp.
105–124, (2011).
[11] Y. P. Estevan, V. Wan, and O. Scharenborg, “Finding Maximum Margin Segments in
Speech,” in Proceedings of IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 937–940, (2007).
[12] X. Huang, A. Acero, H. W. Hon, “Spoken Language Processing: A Guide to Theory,
Algorithm, and System Development,” Prentice Hall PTR, (2001).
[13] L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signal,” Prentice Hall,
Englewood Cliffs, NJ, (1978).
[14] T. F. Quatieri, “Discrete-time Speech Signal Processing: Principles and Practice,” Peason
Education Taiwan Ltd., Taipei, Taiwan, (2005).
[15] E. Vidal and A. Marzal, "A review and new approaches for automatic segmentation of
speech signals", in Proceedings of European Signal Processing Conference., Barcelona,
Spain, September 1990, pp. 43–55.
[16] Keri Venkatesh and Prahallad Kishore, “A comparative study of constrained and
unconstrained approaches for segmentation of speech signal,” in Proceedings of
Interspeech 2010, Makuhari, Japan, (2010).
[17] G. Almpanidis, M. Kotti, and C. Kotropoulos, “Robust Detection of Phone Boundaries
Using Model Selection Criteria with Few Observation,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 17, no. 2, pp. 287–298, (2009).
[18] F. Brugnara, D. Falavigna, and M. Omologo, “Automatic segmentation and labeling of
speech based on hidden Markov models,” Speech Communication, vol. 12, no. 4, pp.
357–370, (1993).
84
[19] Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer, and Dan Chazan, “Phone alignment
based on discriminative learning,” in Proceedings of Interspeech 2005, Lisbon, Portugal,
pp. 2961–2964, (2005).
[20] Jen-Wei Kuo, Hung-Yi Lo, and Hsin-Min Wang, “Improved HMM/SVM Methods for
Automatic Phoneme Segmentation,” in Proceedings of Interspeech 2007, Antwerp,
Belgium, pp. 2057–2060, (2007).
[21] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren,
“DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus,” U.S. Dept. of
Commerce, NIST, Gaithersburg, MD, (1993).
[22] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese
broadcast news corpus,” International Journal of Computational Linguistics and Chinese
Language Processing, vol. 10, no. 2, pp. 219–236, (2005).
[23] C. Cucchiarini and H. Strik, “Automatic Phonetic Transcription: An Overview,” in
Proceedings of International Congress of Phonetic Sciences, pp. 347–350, (2003).
[24] K. Demuynck, and T. Laureys, “A Comparison of Different Approaches to Automatic
Speech Segmentation,” in Proceedings of the 5th International Conference on Text,
Speech and Dialogue, pp. 277–284, (2002).
[25] T. Svendsen and F. Soong, “On the automatic segmentation of speech signals,” in
Proceeding of IEEE International Conference on Acoustics, Speech and Signal
Processing, vol. 12, pp. 77–80, (apr 1987).
[26] Y. Suh and Y. Lee, “Phoneme segmentation of continuous speech using multi-layer
perceptron,” in Proceedings of Fourth International Conference on Spoken Language
Processing, ICSLP 96., pp. 1297 –1300, (1996).
[27] O. Räsänen , U. K. Laine, and T. Altosaar, “An Improved Speech Segmentation Quality
Measure: the R-value,” in Proceedings of 10th Annual Conference of the International
Speech Communication Association, pp. 1851-1854, (2009).
85
[28] V. Z. van Vuuren, L. ten Bosch, and T. Niesler, “Automatic segmentation of TIMIT by
dynamic programming,” in Proceedings of 23th annual symposium of the Pattern
Recognition Association of South Africa, pp. 39-46, PRASA, (2012).
[29] V. Zue, S. Seneff, and J. Glass, ‘‘Speech database development
at MIT: TIMIT and beyond,’’ Speech Communication. Vol. 9, pp. 351–356, (1990).
[30] A. K. Halberstadt, “Heterogeneous Acoustic Measurements and Multiple Classifiers for
Speech Recognition,” Ph.D. dissertation, Massachusetts Institute of Technology, MIT,
(1998).
[31] S. Dusan and L. Rabiner, “On the Relation between Maximum Spectral Transition
Position and Phone Boundaries,” in Proceedings of Interspeech 2006, pp. 17–21, (2006).
[32] Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
(2013). Mandarin microphone speech corpus–TCC300 [Database]. Retrieved
from http://www.aclclp.org.tw/use_mat.php#tcc300edu.
[33] S. A. Liu, “Landmark detection for distinctive feature based speech recognition,” Journal
of Acoustical Society of America, vol. 100 (5), pp. 3417–3430, (1996).
[34] WS04, “Landmark-Based Speech Recognition: Report of the 2004 Johns Hopkins
Summer Workshop,” available at:
http://www.isle.illinois.edu/sst/pubs/2005/ws04ldmk_final.pdf (date last viewed
September 08, 2014).
[35] K. N. Stevens, ‘‘Evidence for the role of acoustic boundaries in the perception of speech
sounds,’’ Phonetic Linguistics: Essays in Honor of Peter Ladefoged, edited by V.
Fromkin (Academic, New York), pp. 243–255, (1985).
[36] J. Wilpon, B. Juang, and L. Rabiner, “An investigation on the use of
acoustic sub-word units for automatic speech recognition,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing,
apr 1987, vol. 12, pp. 821–824, (1987).
86
[37] Guido Aversano, Anna Esposito, Antonietta Esposito, Maria Marinaro, “A New Text-
Independent Method for Phoneme Segmentation,” in Proceedings of the 44th IEEE 2001
Midwest Symposium on Circuits and System, Vol. 2, pp 516–519, (2001).
[38] Ladan Golipour and Douglas O’Shaughnessy, “A new approach for phoneme
segmentation of speech signals,” in Proceedings of Interspeech 2007, Antwerp, Belgium,
August 2007, pp. 1933–1936, (2007).
[39] F. Brugnara, R. De Mori, D. Guiliani, and M. Omologo, “Improved connected digit
recognition using spectral variation function,” in Proceedings of International
Conference on Spoken Language Processing, vol. 1, pp. 627–630, (1992).
[40] C. Mitchell, M. Harper, and L. Jamieson, “Using explicit segmentation to improve HMM
phone recognition,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, vol. 1, pp. 229–232, (1995).
[41] L. ten Bosch, B. Cranen, “A computational model for unsupervised word discovery,” in
Proceedings of Interspeech 2007, pp. 1481–1484, (2007).
[42] Manish Sharma and Richard Mammone, ““Blind” speech segmentation: automatic
segmentation of speech without linguistic knowledge,” in Proceedings of International
Conference on Spoken Language Processing, pp. 1237–1240, (1996).
[43] Paul Mermelstein, “Automatic segmentation of speech into syllabic units,” Journal of
Acourtical Society of America, Vol. 58(4), pp 880–883, (October 1975).
[44] K. Fukunaga, “Introduction of Statistical Pattern Recognition,” Academic Press, San
Diego, (1990).
[45] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised Optimal Phoneme
Segmentation: Objective, Algorithm, and Comparisons,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 3989–3992,
(2008).
87
[46] A. S. Cherniz, M. E. Torres, H. L. Rufiner, “Dynamic Speech Parameterization for Text-
Independent Phone Segmentation,” in Proceedings of 32nd Annual International
Conference of the IEEE EMBS, pp. 4044–4047, (2010).
[47] V. Khanagha, K. Daoudi, O. Pont, and H. Yahia, “A novel text-independent phonetic
segmentation algorithm based on the microcanonical multiscal formalism,” in
Proceedings of Interspeech 2010, pp.1393–1396, (2010).
[48] G. Almpanidis, C. Kotropoulos, “Phonemic segmentation using the generalized Gamma
distribution and small sample Bayesian information criterion,” Speech Communication,
vol. 50, pp. 38–55, (2008).
[49] C. Jankowski, A, Kalyanswamy, S. Basson, J. Spitz, 1990. “NTIMIT: a phonetically
balanced, continuous speech, telephone bandwidth speech database,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 109–
112, (1990).
[50] T. Monica and T. Nagarajan, “Segmentation of Speech Signal into Phonemes using Two-
Level GMM Tokenization,” in Proceedings of IEEE International Conference on Recent
Trends in Information Technology, pp. 843–847, (2011).
[51] Huiqun Deng, Douglas O'Shaughnessy, “Voiced-Unvoiced-Silence Speech Sound
Classification Based on Unsupervised Learning”, in Proceedings of International
Conference on Multimedia and Expo 2007, pp. 176–179, (2007).
[52] C. Y. Lee and J. Glass, “A nonparametric Bayesian Approach to Acoustic Model
Discovery,” in Proceedings of 50th Annual Meeting of the Association for Computational
Linguistics, pp. 40–49, (2012).
[53] V. Khanagha, K. Daoudi, O. Pont, and H. Yahia, “Improving Text-Independent Phonetic
Segmentation based on the microcanonical multiscal formalism,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 4484–4487,
(2011).
88
[54] You-Yu. Lin, Yih-Ru Wang, and Yuan-Fu Liao, “Phone boundary detection using
sample-based acoustic parameters,” in Proceedings of Interspeech 2010, pp. 1397–1400,
(2010).
[55] Yih-Ru Wang, “A Two-Stage Sample-based Phone Boundary Detector using Segmental
Similarity Features,” in Proceedings of Interspeech 2011, pp. 413–416, (2011).
[56] Yih-Ru Wang and Chi-Han Huang, “Speaker-and-environment Change Detection in
Broadcast News using the Common Component GMM-based Divergence Measure,” in
Proceedings of International Conference on Spoken Language Processing, Jeju island,
Korea, pp. 1069–1072, (2004).
[57] S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the
Psychological Magnitude Pitch,” Journal of Acoustical Society of America, Vol. 8(3), pp
185–190, (1937).
[58] Ray D. Kent, and Charles Read, “The Acoustic Analysis of Speech,” Singular Publishing
Group Inc., San Diego, California, (1992).
[59] D. T. Hoang and H. C. Wang, “Unsupervised Phone Segmentation Method using Delta
Spectral Function,” in Proceedings of Oriental COCOSDA, pp. 152–156, (2011).
[60] P. Delacourt, C. J. Wellekens, “ DISTBIC: A Speaker-based segmentation for audio data
indexing,” Speech Communication, Vol. 32, no. 1-2, pp. 111–126, (2000).
[61] D. T. Hoang and H. C. Wang, “A Phone Segmentation Method and Its Evaluation on
Mandarin Speech Corpus,” in Proceedings of the 8th International Symposium on
Chinese Spoken Language Processing, pp. 373–377, (2012).
[62] C. N. Li and S. A. Thompson, “Mandarin Chinese,” University of California Press,
London, (1981).
[63] Hsiao-Chuan Wang, “Speech Signal Processing (in Chinese),” 2nd ed, Chuan Hwa Book
Co., Taipei, Taiwan, 2009 (ISBN 978-957-21-6546-1).
89
[64] F. C. Chou, C. Y. Tseng and L. S. Lee, “A Set of Corpus-Based Text-to-Speech
Technologies for Mandarin Chinese”, IEEE Transaction on Speech and Audio
Processing, vol. 10, issue 7, pp.481–494, (2002).
[65] D. T. Hoang and H. C. Wang, “Text-Independent Phone Segmentation Method Using
Gaussian Function,” in Proceedings of International Conference on Knowledge and
System Engineering, pp. 113–122, (2013).
[66] H. Peng, L. Luo, and C. Lin, “The parameter optimization of Gaussian function via the
similarity comparison within class and between classes,” in Proceedings of Third Pacific-
Asia Conference on Circuits, Communications and System, pp. 1–4, (2011).
[67] Examples of NOISEX-92 database, available at http://spib.rice.edu/spib/select_noise.html
(Accessed: 15 Aug., 2012).
[68] D. T. Hoang, H. C. Wang, “Blind phone segmentation based on spectral change detection
using Legendre polynomial approximation,” Journal of Acoustical Society of America,
vol. 137 (2), pp. 797–805, (Feb. 2015).
[69] http://en.wikipedia.org/wiki/Legendre_polynomials (Accessed: 12 Dec., 2012).
[70] A. Vorstermans, J. P. Martens, B. Van Coile, “Automatic segmentation and labeling of
multi-lingual speech data,” Speech Communication, vol. 19, pp. 271–293, (1996).