簡易檢索 / 詳目顯示

研究生: 戴強麟
Tai, Chiang-Lin
論文名稱: 以自編碼器架構之聲學模型和半監督式學習來改善孩童語音辨識
Improving Children Speech Recognition through Autoencoder-based acoustic modeling and Semi-supervised learning
指導教授: 蔡仁松
Tsay, Ren-Song
口試委員: 王新民
Wang, Hsin-Min
張俊盛
Chang, Chun-Sheng
蘇宜青
Su, Yi-Ching
劉奕汶
Liu, Yi-Wen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 57
中文關鍵詞: 語音辨識孩童自編碼器半監督式學習模型調適
外文關鍵詞: Speech recognition, Children, Autoencoder, Semi-supervised learning, Model adaptation
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音辨識是指將人聲轉為文字的一種技術。孩童語音辨識是近十年來的熱門研
    究,即便語音辨識技術近乎成熟,對於廣大族群的辨識錯誤率已降至極低,但對於
    特異族群,如孩童,語音辨識的錯誤率仍偏高。
    原因來自於具備文本的語料稀缺、語音特徵變化起伏大、咬字不正確等情形,
    最後一項需要透過統計方法去修改詞典,而前兩項則是最近這十年來,各方致力於
    研究的焦點。回顧過去學者提出的改善辦法,焦點在於孩童語音特徵的校正或擴
    充,或是模型訓練方法或構造的改變等,而本論文亦針對這兩項因素,提出改善方
    法。
    為了解決孩童語音特徵的變化問題,我們在聲學模型上,引進了自編碼器架
    構,並命名為‘Filter-based Discriminative Autoencoder’,簡稱‘f-DcAE’,功能
    在於加強過濾特徵中的非音素相關資訊,這樣的模型架構可以讓孩童語音測試集錯
    誤率相對下降7.8%。
    為了解決孩童語料稀缺帶來的模型強健性不足,我們將豐沛的成人語音資源混
    合少量且具文本的孩童語音(In-domain),再利用深度學習演算法將其餘不具文本
    的孩童語音(Out-of-domain)引入,強化模型對孩童語音辨識的能力,不僅有助
    於In-domain 的孩童語音測試集的錯誤率降低,Out-of-domain 的孩童語音測試集
    的錯誤率相對下降更可超過20%。


    Automatic speech recognition is the technology that converts human speech into
    text. Children’s speech recognition has been a hot topic nearly for a decade. Even
    if the technique is mature and the error rate of speech recognition has become satisfactorily
    low for the general population, for some minorities, e.g., children, the error
    rate of recognition still rises.
    The high error rate of children’s speech recognition may be attributed to the
    scarcity of children’s speech, the great deviation of features of children’s speech,
    and incorrect pronunciation. The last factor requires a statistical approach to do
    the dictionary modification but the first two factors have been the focus of much
    research in the last decade. In the past, researchers have focused on the feature
    normalization and augmentation on children’s speech, or the change of model training
    methods or model’s constructions. We also proposes improvement methods for
    these two factors.
    To solve the problem of variation of children’s speech characteristics, we shape
    our acoustic model in the architecture of autoencoder and name it ‘Filter-based Discriminative
    Autoencoder’(‘f-DcAE’ in short). Such a modeling framework can
    successfully reduce the error rate of the children test set by 7.8% by enhancing the
    filtering of non-phoneme related information in the features.
    To address the un-robustness of the model due to the scarcity of children corpus,
    we mixed abundant adults’ speech resources with a small amount of children’s
    speech including corresponding text (In-domain), and then used deep learning algorithms
    to introduce additional non-textual children’s speech (Out-of-domain) to
    enhance the model’s ability to recognize children’s speech. Not only does this help to reduce the error rate of the in-domain children test set, but the relative reduction
    in the error rate of the out-of-domain children test set is more than 20%.
    v

    1 緒論1 1.1 問題描述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 改善方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 語音辨識概況5 2.1 語音辨識架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 聲學模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 語言模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 聲學模型之訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 訓練流程. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.2 神經網路模型. . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3 訓練目標. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.4 語者向量. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 聲學和語言模型的整合及工具及成效評估. . . . . . . . . . . . . . 18 2.5.1 解碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.3 成效評估. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 初步系統設計流程21 3.1 材料準備. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 詞典和語言模型介紹. . . . . . . . . . . . . . . . . . . . . 22 3.2 前置作業. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 情景描述. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 標籤和詞圖之生成. . . . . . . . . . . . . . . . . . . . . . 23 3.2.3 Baseline 的生成過程. . . . . . . . . . . . . . . . . . . . . 24 4 模型設計與演算法流程27 4.1 語料特性. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 提議模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 半監督式訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 半監督式訓練與模型調適. . . . . . . . . . . . . . . . . . . . . . . 32 5 實驗數據與討論35 5.1 TDNN-based 之f-DcAE 的辨識結果與討論. . . . . . . . . . . . . 35 5.1.1 測試集擴增與模型建立. . . . . . . . . . . . . . . . . . . . 35 5.1.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 TDNN-f based 之f-DcAE 的辨識結果與討論. . . . . . . . . . . . 37 5.2.1 模型建立. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 TDNN-f based 之f-DcAE 的半監督式訓練. . . . . . . . . . . . . 39 5.3.1 流程細節. . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 TDNN-f based 之f-DcAE 的半監督式訓練與模型調適. . . . . . . 41 5.4.1 流程細節. . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 41 5.5 錯誤率成份分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5.1 In-domain 測試集之錯誤率分析. . . . . . . . . . . . . . . 43 5.5.2 Out-of-domain 測試集之錯誤率分析. . . . . . . . . . . . . 46 6 結論49 References 50 Appendix 54 口試委員們的提問與建議54 A.1 張俊盛教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.2 蘇宜青教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 A.3 王新民教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A.4 蔡仁松教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 A.5 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vii

    [1] P. Alexandros and N. Shrikanth, “Robust recognition of children’s speech,”
    IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 603–
    616, 2003.
    [2] C. H. Lee, “Adaptive compensation for robust speech recognition,” in Proc.
    ASRU, 1997.
    [3] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control, ch. 9. Springer,
    2006.
    [4] Y. Ephraim and D. Malah, Fundamentals of Noise Reduction in Spring Handbook
    of Speech Processing, ch. 43. Springer, 2008.
    [5] S. Shahnawazuddin, A. Dey, and R. Sinha, “Pitch-adaptive front-end features
    for robust children’s ASR,” in Proc. Interspeech, 2016.
    [6] S. Shahnawazuddin, K. T. Deepak, G. Pradhan, and R. Sinha, “Enhancing
    noise and pitch robustness of children’s ASR,” in Proc. ICASSP, 2017.
    [7] W. Ahmad, S. Shahnawazuddin, H. Kathania, G. Pradhan, and A. B. Samaddar,
    “Improving children’s speech recognition through explicit pitch scaling based
    on iterative spectrogram inversion,” in Proc. Interspeech, 2017.
    [8] H. K. Kathania, S. Shahnawazuddin, N. Adiga, and W. Ahmad, “Role of
    prosodic features on children’s speech recognition,” in Proc. ICASSP, 2018.
    [9] S. Shahnawazuddin, N. Adiga, and H. K. Kathania, “Effect of prosody modification
    on children’s ASR,” IEEE Signal Processing Letters, vol. 24, no. 11,
    pp. 1749–1753, 2017.
    [10] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to
    DNN-based children’s and adults’ speech recognition,” in Proc. SLT, 2014.
    [11] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving children’s speech
    recognition through out-of-domain data augmentation,” in Proc. Interspeech,
    2016.
    [12] P. Sheng, Z. Yang, and Y. Qian, “Gans for children: a generative data augmentation
    strategy for children speech recognition,” in Proc. ASRU, 2019.
    [13] P. G. Shivakumar, A. Potamianos, S. Lee, and S. S. Narayanan, “Improving
    speech recognition for children using acoustic adaptation and pronunciation
    modeling,” in Proc. WOCCI, 2014.
    [14] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children
    for speech recognition: evaluation, analysis and recommendations,” Computer
    Speech Language, vol. 63, p. 101077, 2020.
    [15] R. Tong, L. Wang, and B. Ma, “Transfer learning for children’s speech recognition,”
    in Proc. IALP, 2017.
    [16] N. F. Chen, R. Tong, D. Wee, P. Lee, B. Ma, and H. Li, “SingaKids-Mandarin:
    speech corpus of Singaporean children speaking Mandarin Chinese,” in Proc.
    Interspeech, 2016.
    [17] M. H. Yang, H. S. Lee, Y. D. Lu, K. Y. Chen, Y. Tsao, B. Chen, and H. M. Wang,
    “Discriminative autoencoders for acoustic modeling,” in Proc. Interspeech,
    2017.
    [18] P. T. Huang, H. S. Lee, S. S. Wang, K. Y. Chen, Y. Tsao, and H. M. Wang,
    “Exploring the encoder layers of discriminative autoencoders for LVCSR,” in
    Proc. Interspeech, 2019.
    [19] X. lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep
    denoising autoencoder,” in Proc. Interspeech, 2013.
    [20] R. E. Zezario, J. W. Huang, X. Lu, Y. Tsao, H. T. Hwang, and H. M. Wang,
    “Deep denoising autoencoder based post filtering for speech enhancement,” in
    Proc. APSIPA, 2018.
    [21] S. Jalalvand and D. Falavigna, “Stacked autoencoder for asr error detection
    and word error rate prediction,” in Proc. Interspeech, 2015.
    [22] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage
    discriminatively trained HMM-based models for ASR,” IEEE/ACM Transactions
    on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 1949–
    1961, 2018.
    [23] K. Lee, “On large-vocabulary speaker-independent continuous speech recognition,”
    Speech Communication, vol. 7, no. 4, pp. 375 – 379, 1988.
    [24] A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief
    networks,” IEEE Transactions on Audio, Speech, and Language Processing,
    vol. 20, no. 1, pp. 14–22, 2012.
    [25] A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao, “Acoustic
    modelling with CD-CTC-sMBR LSTM RNNs,” in Proc. ASRU, 2015.
    [26] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word
    LSTM model for large vocabulary speech recognition,” in Proc. Interspeech,
    2017.
    [27] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme
    recognition using time-delay neural networks,” IEEE Transactions on Acoustics,
    Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
    [28] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture
    for efficient modeling of long temporal contexts,” in Proc. Interspeech,
    2015.
    [29] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur,
    “Semi-orthogonal low-rank matrix factorization for deep neural networks,”
    in Proc. Interspeech, 2018.
    [30] M. Federico, N. Bertoldi, and M. Cettolo, “Irstlm: an open source toolkit for
    handling large scale language models,” 2008.
    [31] B. Juang and L. R. Rabiner, “The segmental K-means algorithm for estimating
    parameters of hidden Markov models,” IEEE Transactions on Acoustics,
    Speech, and Signal Processing, vol. 38, no. 9, pp. 1639–1641, 1990.
    [32] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal
    forced aligner: trainable text-speech alignment using Kaldi,” in Proc. Interspeech,
    2017.
    [33] K. Kumar, C. Kim, and R. M. Stern, “Delta-spectral cepstral coefficients for
    robust speech recognition,” in Proc. ICASSP, 2011.
    [34] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,”
    IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272–
    281, 1999.
    [35] C. Leggetter and P. Woodland, “Maximum likelihood linear regression for
    speaker adaptation of continuous density hidden Markov models,” Computer
    Speech Language, vol. 9, no. 2, pp. 171 – 185, 1995.
    [36] S. Rath, D. Povey, K. Veselý, and J. Cernocký, “Improved feature processing
    for deep neural networks,” in Proc. Interspeech, 2013.
    [37] K. Han, S. Hahm, B.-H. Kim, J. Kim, and I. Lane, “Deep learning-based telephony
    speech recognition in the wild,” in Proc. Interspeech, 2017.
    [38] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative
    training of deep neural networks,” in Proc. Interspeech, 2013.
    [39] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang,
    and S. Khudanpur, “Purely sequence-trained neural networks for ASR based
    on lattice-free MMI,” in Proc. Interspeech, 2016.
    [40] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal
    classification: labelling unsegmented sequence data with recurrent neural
    networks,” in Proc. ICML, 2006.
    [41] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “I-vector
    based speaker recognition on short utterances,” in Proc. Interspeech, 2011.
    [42] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural
    network embeddings for text-independent speaker verification,” in Proc. Interspeech,
    2017.
    [43] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel,
    “Support vector machines versus fast scoring in the low-dimensional total variability
    space for speaker verification,” in Proc. Interspeech, 2009.
    [44] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of
    neural network acoustic models using i-vectors,” in Proc. ASRU, 2013.
    [45] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors:
    robust DNN embeddings for speaker recognition,” in Proc. ICASSP,
    2018.
    [46] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in
    speech recognition,” Computer Speech Language, vol. 16, no. 1, pp. 69–88,
    2002.
    [47] D. Povey, M. Hannemann, G. Boulianne, L. Burget, M. Janda, M. Karafiát,
    S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Vesel, and T. Vu,
    “Generating exact lattices in the WFST framework,” in Proc. ICASSP, 2012.
    [48] D. Povey, G. Boulianne, L. Burget, P. Motlicek, and P. Schwarz, “The Kaldi
    speech recognition toolkit,” in Proc. ASRU, 2011.
    [49] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete
    LDC93S6A,” Web Download. Philadelphia: Linguistic Data Consortium,
    1993.
    [50] Linguistic Data Consortium and NIST Multimodal Information Group, “CSRII
    (WSJ1) complete LDC94S13A,” Web Download. Philadelphia: Linguistic
    Data Consortium, 1997.
    [51] M. Eskenazi, J. Mostow, and D. Graff, “The CMU kids corpus LDC97S63,”
    Web Download. Philadelphia: Linguistic Data Consortium, 1997.
    [52] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa,
    C. Hacker, M. Russell, S. Steidl, and M. Wong, “The PF_STAR children’s
    speech corpus,” in Proc. Interspeech, 2005.
    [53] http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
    [54] https://en.wikipedia.org/wiki/Perplexity.
    [55] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for
    speech recognition,” in Proc. Interspeech, 2015.
    [56] https://github.com/YannickJadoul/Parselmouth.
    [57] S. Ghai and R. Sinha, “Exploring the role of spectral smoothing in context of
    children’s speech recognition,” in Proc. Interspeech, 2009.
    [58] P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur,
    “A pitch extraction algorithm tuned for ASR,” in Proc. ICASSP, 2014.
    [59] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training
    of acoustic models using lattice-free MMI,” in Proc. ICASSP, 2018.
    [60] http://sox.sourceforge.net/sox.html.

    QR CODE