簡易檢索 / 詳目顯示

研究生: 黃得勝
Hoang, Dac Thang
論文名稱: 基於頻譜變化偵測的盲音素分割
Blind Phone Segmentation Based on Spectral Change Detection
指導教授: 王小川
Wang, Hsiao Chuan
鐘太郎
Jong, Tai Lang
口試委員: 李琳山
劉奕汶
李夢麟
陳信宏
王新民
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 90
中文關鍵詞: Phone SegmentationBounday Detection
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 音素分割是將一段連續語音訊號切割成各個音素單位,通常在語音處理時會作音素分割,
    例如聲學語音學分析、語音辨識、語者辨識、語音合成、及語音資料庫標註等。人工的
    音素分割是耗時的,而且會因為不同轉譯者得出不一致的結果,因此需要一個自動音素
    分割的方法。典型的做法是將語音訊號與音素標記對齊,如果已經有一句話的文字轉譯,
    使用基於隱藏式馬可夫模型的強制對齊法可以找出這句話的音素邊界時間點,這是一種
    監督式的方法,通常會得到高正確率。然而有些應用是沒有訓練語料與事先轉譯,就需
    要採用非監督式的方法。 如果語音資料沒有告知其語言的相對文字與轉譯,音素分割就
    得採用盲分割的方法,這種方法很難得到高的正確率,提高其正確率是一種挑戰。
    本論文探討盲音素分割的問題,採用頻帶能量抽取語音特徵,提出四個盲音素分
    割的方法,(1)頻譜差異函數法、(2) 頻帶能量曲線追蹤法、(3)高氏函數法、與(4)勒氏多
    項式近似法。在英語語音資料庫TIMIT 上作檢驗,實驗的結果顯示所提出的方法較前人
    的方法為佳。頻帶能量曲線追蹤法也用以檢驗中文語音資料庫TCC300,發現一些關於
    語言不相關性的問題,噪音的影響則在高氏函數法與勒氏多項式近似法中作探討。


    Phone segmentation involves partitioning a continuous speech signal into discrete phone units.
    It is often required in some areas of speech processing, such as acoustic-phonetic analysis,
    speech recognition, speaker recognition, speech synthesis, and annotations of speech corpus.
    Manual phone segmentation is time consuming, and its result may be inconsistent because of
    the subjective criteria of different transcribers. Therefore a method of automatic phone
    segmentation is desirable. A typical approach is to align the speech signal to its phone
    transcripts in an utterance. The forced alignment based on hidden Markov model is a way to
    locate phone boundaries when the phone transcripts of the target utterance are available. This
    supervised method usually obtains high accuracy. However, the training speech signal and their
    transcripts are unavailable in some applications. Hence, unsupervised methods are used. If
    there is no linguistic knowledge (such as, orthographic or phonetic transcripts) of given speech
    data, phone segmentation is performed in blind method. However, this approach is difficult to
    obtain a high accuracy. Obtaining a high level of accuracy by using the blind method is
    challenging.
    This dissertation addresses the problem of blind phone segmentation. The band energies
    of speech signals are calculated for feature extraction. Four methods for blind phone
    segmentation are proposed. They are based on (1)Delta spectral function, (2)Band-energy
    tracing technique, (3)Gaussian function, and (4)Legendre polynomial approximation. English
    speech corpus, TIMIT, was examined. Experimental results showed that the proposed methods
    were more accurate than previous methods. For the method using BE tracing technique,
    Chinese speech corpus, TCC300, was also evaluated to reveal the language-independent problems. Noise influences were investigated in the methods using Gaussian function and
    Legendre polynomial approximation.

    1. Introduction 1 2. Automatic Phone Segmentation 5 2.1 Phonemes, Phones, Allophones and Letters of Alphabets ……………………….. 5 2.2 Approaches for Automatic Phone Segmentation ………………………………… 7 2.3 Performance Evaluation ………………………………………………………….. 9 2.4 Speech Databases ………………………………………………………….……. 12 2.4.1 TIMIT Corpus ……………………………………………………….…… 12 2.4.2 TCC300 Corpus ………………………………………………….…….… 16 2.5 Landmark Detection ……………………………………………………………. 17 3. Previous works of Blind Phone Segmentation 19 3.1 Model-free Methods …………………………………………………….……… 19 3.2 Model-tied Methods …………………………………………………….……... 25 3.3 Combined Methods …………………………………………………….………. 27 4. Proposed Methods 31 vi 4.1 Band-energy Representation of Speech Spectra …………………….…….…… 31 4.1.1 Mel-scaled Filter Bank ………………………………………….….…… 31 4.1.2 Feature Extraction of Speech Signal ……………………………..……… 33 4.2 Method Using Delta Spectral Function ………………………………..……….. 35 4.2.1 Spectral Change Representation ……………………………….….…….. 35 4.2.2 Post Processing ……………………………………………………..…… 37 4.2.3 Experiments and Results ……………………………………….….……. 39 4.3 Method Using Band-energy (BE) Tracing Technique ……………………..….. 41 4.3.1 The BE Curve …………………………………………………….…..…. 42 4.3.2 BE transition …………………………………………………….…...….. 43 4.3.3 Detection of Phone Boundaries ………………………………….…...…. 43 4.3.4 Experiment on TIMIT …………………………………………….....….. 45 4.3.5 Evaluation on Mandarin Chinese Corpus ………………………….…… 45 4.3.5.1 Mandarin Phonetics …………………………………………...….. 45 4.3.5.2 Experimental Results ………………………………….……….… 48 4.3.5.3 Statistics of phone boundary detection errors ………….….….….. 49 4.3.5.4 Discussions ………………………………………………..…..….. 51 4.4 Method Using Gaussian Function ………………………………………..….… 52 4.4.1 Gaussian Function …………………………………………………....…. 52 4.4.2 Calculation of Gaussian Function Parameter …………………..…..…… 53 4.4.3 Phone Segmentation ………………………………………………....…. 55 4.4.4 Experiments and Results …………………………………………….…. 56 4.4.5 Noise Influence …………………………………………….………....…. 58 vii 4.5 Method Using Legendre Polynomial Approximation ……………..……...…… 59 4.5.1 Legendre Polynomials ……………………………………….……....…. 60 4.5.2 Legendre Polynomial Approximation of BE curve ………….…………. 62 4.5.3 Step 1: Detection of Phone Boundaries ……………………................… 63 4.5.3.1 Detection of BE Changes …………………………………......…. 63 4.5.3.2 Integrated View of BE Changes ………………………….……… 65 4.5.4 Step 2: Recovery of Missed Phone Boundaries …………………….….. 67 4.5.5 Experiment Setup and Results ……………………………………….… 69 4.5.6 Analysis of Detection Errors …………………………………………... 71 4.5.7 Noise Influence …………………………………………………….….. 73 4.5.8 Phone Boundary Types Related to AC and A Landmarks ………….… 75 4.6 Comparison of Phone Segmentation Methods ………………………….……. 77 5. Conclusion 79 Bibliography 82 Publication List 90

    [1] Furui, “Digital Speech Processing, Synthesis and Recognition,” Marcel Dekker, New
    York, (2001).
    [2] J. Marcus, ‘‘Phonetic recognition in a segment-based HMM,’’ in Proceedings of IEEE
    International Conference on Acoustics, Speech and Signal Processing, 2, pp. 479–482,
    (1993).
    [3] James R. Glass, “A probabilistic framework for segment-based speech recognition,”
    Computer Speech and Language, 17, pp. 137–152, (2003).
    [4] A.E. Rosenberg, C-H Lee, and F.K Soong, “Sub-word unit Talker Verification using
    Hidden Markov models,” in Proceedings of IEEE International Conference on Acoustics,
    Speech and Signal Processing, pp. 269–272, (1990).
    [5] T. Matsui and S. Furui, “Concatenated phoneme models for text-variable Speaker
    Recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech
    and Signal Processing, pp 391–394, (1994).
    [6] M. Sharma and R. Mammone, “Subword-based text-dependent speaker verification
    system with user-selectable passwords,” in Proceedings of IEEE International
    Conference on Acoustics, Speech and Signal Processing, pp 93–96, (1996).
    [7] J. Adell, A. Bonafonte, “Towards phone segmentation for concatenative speech
    synthesis,” in Proceedings 5th ISCA Speech Synthesis Workshop (SSW5), pp. 139–144,
    (2004).
    [8] Ljolje Andrej, Hirschberg Julia, and Jan P H van, “Automatic speech segmentation for
    concatenative inventory selection,” in Proceedings of Speech Synthesis, Springer-Verlag,
    New York, USA, (1997).
    83
    [9] O. Scharenborg, V. Wan, M. Ernestus, “Unsupervised speech segmentation: an analysis
    of the hypothesized phone boundaries,” Journal of Acoustical Society of America, vol.
    172 (2), pp. 1084-1095, ( 2010).
    [10] O. Räsänen , U. K. Laine, and T. Altosaar, “Blind segmentation of speech using nonlinear
    filtering methods,” Ipsic I. (Ed.): Speech Technologies. InTech Publishing, pp.
    105–124, (2011).
    [11] Y. P. Estevan, V. Wan, and O. Scharenborg, “Finding Maximum Margin Segments in
    Speech,” in Proceedings of IEEE International Conference on Acoustics, Speech and
    Signal Processing, pp. 937–940, (2007).
    [12] X. Huang, A. Acero, H. W. Hon, “Spoken Language Processing: A Guide to Theory,
    Algorithm, and System Development,” Prentice Hall PTR, (2001).
    [13] L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signal,” Prentice Hall,
    Englewood Cliffs, NJ, (1978).
    [14] T. F. Quatieri, “Discrete-time Speech Signal Processing: Principles and Practice,” Peason
    Education Taiwan Ltd., Taipei, Taiwan, (2005).
    [15] E. Vidal and A. Marzal, "A review and new approaches for automatic segmentation of
    speech signals", in Proceedings of European Signal Processing Conference., Barcelona,
    Spain, September 1990, pp. 43–55.
    [16] Keri Venkatesh and Prahallad Kishore, “A comparative study of constrained and
    unconstrained approaches for segmentation of speech signal,” in Proceedings of
    Interspeech 2010, Makuhari, Japan, (2010).
    [17] G. Almpanidis, M. Kotti, and C. Kotropoulos, “Robust Detection of Phone Boundaries
    Using Model Selection Criteria with Few Observation,” IEEE Transactions on Audio,
    Speech, and Language Processing, vol. 17, no. 2, pp. 287–298, (2009).
    [18] F. Brugnara, D. Falavigna, and M. Omologo, “Automatic segmentation and labeling of
    speech based on hidden Markov models,” Speech Communication, vol. 12, no. 4, pp.
    357–370, (1993).
    84
    [19] Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer, and Dan Chazan, “Phone alignment
    based on discriminative learning,” in Proceedings of Interspeech 2005, Lisbon, Portugal,
    pp. 2961–2964, (2005).
    [20] Jen-Wei Kuo, Hung-Yi Lo, and Hsin-Min Wang, “Improved HMM/SVM Methods for
    Automatic Phoneme Segmentation,” in Proceedings of Interspeech 2007, Antwerp,
    Belgium, pp. 2057–2060, (2007).
    [21] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren,
    “DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus,” U.S. Dept. of
    Commerce, NIST, Gaithersburg, MD, (1993).
    [22] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese
    broadcast news corpus,” International Journal of Computational Linguistics and Chinese
    Language Processing, vol. 10, no. 2, pp. 219–236, (2005).
    [23] C. Cucchiarini and H. Strik, “Automatic Phonetic Transcription: An Overview,” in
    Proceedings of International Congress of Phonetic Sciences, pp. 347–350, (2003).
    [24] K. Demuynck, and T. Laureys, “A Comparison of Different Approaches to Automatic
    Speech Segmentation,” in Proceedings of the 5th International Conference on Text,
    Speech and Dialogue, pp. 277–284, (2002).
    [25] T. Svendsen and F. Soong, “On the automatic segmentation of speech signals,” in
    Proceeding of IEEE International Conference on Acoustics, Speech and Signal
    Processing, vol. 12, pp. 77–80, (apr 1987).
    [26] Y. Suh and Y. Lee, “Phoneme segmentation of continuous speech using multi-layer
    perceptron,” in Proceedings of Fourth International Conference on Spoken Language
    Processing, ICSLP 96., pp. 1297 –1300, (1996).
    [27] O. Räsänen , U. K. Laine, and T. Altosaar, “An Improved Speech Segmentation Quality
    Measure: the R-value,” in Proceedings of 10th Annual Conference of the International
    Speech Communication Association, pp. 1851-1854, (2009).
    85
    [28] V. Z. van Vuuren, L. ten Bosch, and T. Niesler, “Automatic segmentation of TIMIT by
    dynamic programming,” in Proceedings of 23th annual symposium of the Pattern
    Recognition Association of South Africa, pp. 39-46, PRASA, (2012).
    [29] V. Zue, S. Seneff, and J. Glass, ‘‘Speech database development
    at MIT: TIMIT and beyond,’’ Speech Communication. Vol. 9, pp. 351–356, (1990).
    [30] A. K. Halberstadt, “Heterogeneous Acoustic Measurements and Multiple Classifiers for
    Speech Recognition,” Ph.D. dissertation, Massachusetts Institute of Technology, MIT,
    (1998).
    [31] S. Dusan and L. Rabiner, “On the Relation between Maximum Spectral Transition
    Position and Phone Boundaries,” in Proceedings of Interspeech 2006, pp. 17–21, (2006).
    [32] Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
    (2013). Mandarin microphone speech corpus–TCC300 [Database]. Retrieved
    from http://www.aclclp.org.tw/use_mat.php#tcc300edu.
    [33] S. A. Liu, “Landmark detection for distinctive feature based speech recognition,” Journal
    of Acoustical Society of America, vol. 100 (5), pp. 3417–3430, (1996).
    [34] WS04, “Landmark-Based Speech Recognition: Report of the 2004 Johns Hopkins
    Summer Workshop,” available at:
    http://www.isle.illinois.edu/sst/pubs/2005/ws04ldmk_final.pdf (date last viewed
    September 08, 2014).
    [35] K. N. Stevens, ‘‘Evidence for the role of acoustic boundaries in the perception of speech
    sounds,’’ Phonetic Linguistics: Essays in Honor of Peter Ladefoged, edited by V.
    Fromkin (Academic, New York), pp. 243–255, (1985).
    [36] J. Wilpon, B. Juang, and L. Rabiner, “An investigation on the use of
    acoustic sub-word units for automatic speech recognition,” in Proceedings of IEEE
    International Conference on Acoustics, Speech and Signal Processing,
    apr 1987, vol. 12, pp. 821–824, (1987).
    86
    [37] Guido Aversano, Anna Esposito, Antonietta Esposito, Maria Marinaro, “A New Text-
    Independent Method for Phoneme Segmentation,” in Proceedings of the 44th IEEE 2001
    Midwest Symposium on Circuits and System, Vol. 2, pp 516–519, (2001).
    [38] Ladan Golipour and Douglas O’Shaughnessy, “A new approach for phoneme
    segmentation of speech signals,” in Proceedings of Interspeech 2007, Antwerp, Belgium,
    August 2007, pp. 1933–1936, (2007).
    [39] F. Brugnara, R. De Mori, D. Guiliani, and M. Omologo, “Improved connected digit
    recognition using spectral variation function,” in Proceedings of International
    Conference on Spoken Language Processing, vol. 1, pp. 627–630, (1992).
    [40] C. Mitchell, M. Harper, and L. Jamieson, “Using explicit segmentation to improve HMM
    phone recognition,” in Proceedings of IEEE International Conference on Acoustics,
    Speech and Signal Processing, vol. 1, pp. 229–232, (1995).
    [41] L. ten Bosch, B. Cranen, “A computational model for unsupervised word discovery,” in
    Proceedings of Interspeech 2007, pp. 1481–1484, (2007).
    [42] Manish Sharma and Richard Mammone, ““Blind” speech segmentation: automatic
    segmentation of speech without linguistic knowledge,” in Proceedings of International
    Conference on Spoken Language Processing, pp. 1237–1240, (1996).
    [43] Paul Mermelstein, “Automatic segmentation of speech into syllabic units,” Journal of
    Acourtical Society of America, Vol. 58(4), pp 880–883, (October 1975).
    [44] K. Fukunaga, “Introduction of Statistical Pattern Recognition,” Academic Press, San
    Diego, (1990).
    [45] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised Optimal Phoneme
    Segmentation: Objective, Algorithm, and Comparisons,” in Proceedings of IEEE
    International Conference on Acoustics, Speech and Signal Processing, pp. 3989–3992,
    (2008).
    87
    [46] A. S. Cherniz, M. E. Torres, H. L. Rufiner, “Dynamic Speech Parameterization for Text-
    Independent Phone Segmentation,” in Proceedings of 32nd Annual International
    Conference of the IEEE EMBS, pp. 4044–4047, (2010).
    [47] V. Khanagha, K. Daoudi, O. Pont, and H. Yahia, “A novel text-independent phonetic
    segmentation algorithm based on the microcanonical multiscal formalism,” in
    Proceedings of Interspeech 2010, pp.1393–1396, (2010).
    [48] G. Almpanidis, C. Kotropoulos, “Phonemic segmentation using the generalized Gamma
    distribution and small sample Bayesian information criterion,” Speech Communication,
    vol. 50, pp. 38–55, (2008).
    [49] C. Jankowski, A, Kalyanswamy, S. Basson, J. Spitz, 1990. “NTIMIT: a phonetically
    balanced, continuous speech, telephone bandwidth speech database,” in Proceedings of
    IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 109–
    112, (1990).
    [50] T. Monica and T. Nagarajan, “Segmentation of Speech Signal into Phonemes using Two-
    Level GMM Tokenization,” in Proceedings of IEEE International Conference on Recent
    Trends in Information Technology, pp. 843–847, (2011).
    [51] Huiqun Deng, Douglas O'Shaughnessy, “Voiced-Unvoiced-Silence Speech Sound
    Classification Based on Unsupervised Learning”, in Proceedings of International
    Conference on Multimedia and Expo 2007, pp. 176–179, (2007).
    [52] C. Y. Lee and J. Glass, “A nonparametric Bayesian Approach to Acoustic Model
    Discovery,” in Proceedings of 50th Annual Meeting of the Association for Computational
    Linguistics, pp. 40–49, (2012).
    [53] V. Khanagha, K. Daoudi, O. Pont, and H. Yahia, “Improving Text-Independent Phonetic
    Segmentation based on the microcanonical multiscal formalism,” in Proceedings of IEEE
    International Conference on Acoustics, Speech and Signal Processing, pp. 4484–4487,
    (2011).
    88
    [54] You-Yu. Lin, Yih-Ru Wang, and Yuan-Fu Liao, “Phone boundary detection using
    sample-based acoustic parameters,” in Proceedings of Interspeech 2010, pp. 1397–1400,
    (2010).
    [55] Yih-Ru Wang, “A Two-Stage Sample-based Phone Boundary Detector using Segmental
    Similarity Features,” in Proceedings of Interspeech 2011, pp. 413–416, (2011).
    [56] Yih-Ru Wang and Chi-Han Huang, “Speaker-and-environment Change Detection in
    Broadcast News using the Common Component GMM-based Divergence Measure,” in
    Proceedings of International Conference on Spoken Language Processing, Jeju island,
    Korea, pp. 1069–1072, (2004).
    [57] S. S. Stevens, J. Volkmann, and E. B. Newman, “A Scale for the Measurement of the
    Psychological Magnitude Pitch,” Journal of Acoustical Society of America, Vol. 8(3), pp
    185–190, (1937).
    [58] Ray D. Kent, and Charles Read, “The Acoustic Analysis of Speech,” Singular Publishing
    Group Inc., San Diego, California, (1992).
    [59] D. T. Hoang and H. C. Wang, “Unsupervised Phone Segmentation Method using Delta
    Spectral Function,” in Proceedings of Oriental COCOSDA, pp. 152–156, (2011).
    [60] P. Delacourt, C. J. Wellekens, “ DISTBIC: A Speaker-based segmentation for audio data
    indexing,” Speech Communication, Vol. 32, no. 1-2, pp. 111–126, (2000).
    [61] D. T. Hoang and H. C. Wang, “A Phone Segmentation Method and Its Evaluation on
    Mandarin Speech Corpus,” in Proceedings of the 8th International Symposium on
    Chinese Spoken Language Processing, pp. 373–377, (2012).
    [62] C. N. Li and S. A. Thompson, “Mandarin Chinese,” University of California Press,
    London, (1981).
    [63] Hsiao-Chuan Wang, “Speech Signal Processing (in Chinese),” 2nd ed, Chuan Hwa Book
    Co., Taipei, Taiwan, 2009 (ISBN 978-957-21-6546-1).
    89
    [64] F. C. Chou, C. Y. Tseng and L. S. Lee, “A Set of Corpus-Based Text-to-Speech
    Technologies for Mandarin Chinese”, IEEE Transaction on Speech and Audio
    Processing, vol. 10, issue 7, pp.481–494, (2002).
    [65] D. T. Hoang and H. C. Wang, “Text-Independent Phone Segmentation Method Using
    Gaussian Function,” in Proceedings of International Conference on Knowledge and
    System Engineering, pp. 113–122, (2013).
    [66] H. Peng, L. Luo, and C. Lin, “The parameter optimization of Gaussian function via the
    similarity comparison within class and between classes,” in Proceedings of Third Pacific-
    Asia Conference on Circuits, Communications and System, pp. 1–4, (2011).
    [67] Examples of NOISEX-92 database, available at http://spib.rice.edu/spib/select_noise.html
    (Accessed: 15 Aug., 2012).
    [68] D. T. Hoang, H. C. Wang, “Blind phone segmentation based on spectral change detection
    using Legendre polynomial approximation,” Journal of Acoustical Society of America,
    vol. 137 (2), pp. 797–805, (Feb. 2015).
    [69] http://en.wikipedia.org/wiki/Legendre_polynomials (Accessed: 12 Dec., 2012).
    [70] A. Vorstermans, J. P. Martens, B. Van Coile, “Automatic segmentation and labeling of
    multi-lingual speech data,” Speech Communication, vol. 19, pp. 271–293, (1996).

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE