以自編碼器架構之聲學模型和半監督式學習來改善孩童語音辨識

簡易檢索 / 詳目顯示

回結果列表

研究生：	戴強麟 Tai, Chiang-Lin
論文名稱：	以自編碼器架構之聲學模型和半監督式學習來改善孩童語音辨識 Improving Children Speech Recognition through Autoencoder-based acoustic modeling and Semi-supervised learning
指導教授：	蔡仁松 Tsay, Ren-Song
口試委員:	王新民 Wang, Hsin-Min 張俊盛 Chang, Chun-Sheng 蘇宜青 Su, Yi-Ching 劉奕汶 Liu, Yi-Wen
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	中文
論文頁數：	57
中文關鍵詞：	語音辨識、孩童、自編碼器、半監督式學習、模型調適
外文關鍵詞：	Speech recognition, Children, Autoencoder, Semi-supervised learning, Model adaptation
相關次數：	點閱：107 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

語音辨識是指將人聲轉為文字的一種技術。孩童語音辨識是近十年來的熱門研
究，即便語音辨識技術近乎成熟，對於廣大族群的辨識錯誤率已降至極低，但對於
特異族群，如孩童，語音辨識的錯誤率仍偏高。
原因來自於具備文本的語料稀缺、語音特徵變化起伏大、咬字不正確等情形，
最後一項需要透過統計方法去修改詞典，而前兩項則是最近這十年來，各方致力於
研究的焦點。回顧過去學者提出的改善辦法，焦點在於孩童語音特徵的校正或擴
充，或是模型訓練方法或構造的改變等，而本論文亦針對這兩項因素，提出改善方
法。
為了解決孩童語音特徵的變化問題，我們在聲學模型上，引進了自編碼器架
構，並命名為‘Filter-based Discriminative Autoencoder’，簡稱‘f-DcAE’，功能
在於加強過濾特徵中的非音素相關資訊，這樣的模型架構可以讓孩童語音測試集錯
誤率相對下降7.8%。
為了解決孩童語料稀缺帶來的模型強健性不足，我們將豐沛的成人語音資源混
合少量且具文本的孩童語音（In-domain），再利用深度學習演算法將其餘不具文本
的孩童語音（Out-of-domain）引入，強化模型對孩童語音辨識的能力，不僅有助
於In-domain 的孩童語音測試集的錯誤率降低，Out-of-domain 的孩童語音測試集
的錯誤率相對下降更可超過20%。

Automatic speech recognition is the technology that converts human speech into
text. Children’s speech recognition has been a hot topic nearly for a decade. Even
if the technique is mature and the error rate of speech recognition has become satisfactorily
low for the general population, for some minorities, e.g., children, the error
rate of recognition still rises.
The high error rate of children’s speech recognition may be attributed to the
scarcity of children’s speech, the great deviation of features of children’s speech,
and incorrect pronunciation. The last factor requires a statistical approach to do
the dictionary modification but the first two factors have been the focus of much
research in the last decade. In the past, researchers have focused on the feature
normalization and augmentation on children’s speech, or the change of model training
methods or model’s constructions. We also proposes improvement methods for
these two factors.
To solve the problem of variation of children’s speech characteristics, we shape
our acoustic model in the architecture of autoencoder and name it ‘Filter-based Discriminative
Autoencoder’（‘f-DcAE’ in short）. Such a modeling framework can
successfully reduce the error rate of the children test set by 7.8% by enhancing the
filtering of non-phoneme related information in the features.
To address the un-robustness of the model due to the scarcity of children corpus,
we mixed abundant adults’ speech resources with a small amount of children’s
speech including corresponding text (In-domain), and then used deep learning algorithms
to introduce additional non-textual children’s speech (Out-of-domain) to
enhance the model’s ability to recognize children’s speech. Not only does this help to reduce the error rate of the in-domain children test set, but the relative reduction
in the error rate of the out-of-domain children test set is more than 20%.
v

緒論1
1 問題描述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 改善方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
語音辨識概況5
1 語音辨識架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 聲學模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 語言模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 聲學模型之訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 訓練流程. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 神經網路模型. . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 訓練目標. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 語者向量. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 聲學和語言模型的整合及工具及成效評估. . . . . . . . . . . . . . 18
5.1 解碼器. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 成效評估. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
初步系統設計流程21
1 材料準備. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1 語料介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2 詞典和語言模型介紹. . . . . . . . . . . . . . . . . . . . . 22
2 前置作業. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 情景描述. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 標籤和詞圖之生成. . . . . . . . . . . . . . . . . . . . . . 23
2.3 Baseline 的生成過程. . . . . . . . . . . . . . . . . . . . . 24
模型設計與演算法流程27
1 語料特性. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 提議模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 半監督式訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 半監督式訓練與模型調適. . . . . . . . . . . . . . . . . . . . . . . 32
實驗數據與討論35
1 TDNN-based 之f-DcAE 的辨識結果與討論. . . . . . . . . . . . . 35
1.1 測試集擴增與模型建立. . . . . . . . . . . . . . . . . . . . 35
1.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 36
2 TDNN-f based 之f-DcAE 的辨識結果與討論. . . . . . . . . . . . 37
2.1 模型建立. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 38
3 TDNN-f based 之f-DcAE 的半監督式訓練. . . . . . . . . . . . . 39
3.1 流程細節. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 39
4 TDNN-f based 之f-DcAE 的半監督式訓練與模型調適. . . . . . . 41
4.1 流程細節. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 辨識結果與討論. . . . . . . . . . . . . . . . . . . . . . . . 41
5 錯誤率成份分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 In-domain 測試集之錯誤率分析. . . . . . . . . . . . . . . 43
5.2 Out-of-domain 測試集之錯誤率分析. . . . . . . . . . . . . 46
結論49
References 50
Appendix 54
口試委員們的提問與建議54
A.1 張俊盛教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.2 蘇宜青教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.3 王新民教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.4 蔡仁松教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.5 劉奕汶教授. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vii
                                

[1] P. Alexandros and N. Shrikanth, “Robust recognition of children’s speech,”
IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 603–
616, 2003.
[2] C. H. Lee, “Adaptive compensation for robust speech recognition,” in Proc.
ASRU, 1997.
[3] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control, ch. 9. Springer,
2006.
[4] Y. Ephraim and D. Malah, Fundamentals of Noise Reduction in Spring Handbook
of Speech Processing, ch. 43. Springer, 2008.
[5] S. Shahnawazuddin, A. Dey, and R. Sinha, “Pitch-adaptive front-end features
for robust children’s ASR,” in Proc. Interspeech, 2016.
[6] S. Shahnawazuddin, K. T. Deepak, G. Pradhan, and R. Sinha, “Enhancing
noise and pitch robustness of children’s ASR,” in Proc. ICASSP, 2017.
[7] W. Ahmad, S. Shahnawazuddin, H. Kathania, G. Pradhan, and A. B. Samaddar,
“Improving children’s speech recognition through explicit pitch scaling based
on iterative spectrogram inversion,” in Proc. Interspeech, 2017.
[8] H. K. Kathania, S. Shahnawazuddin, N. Adiga, and W. Ahmad, “Role of
prosodic features on children’s speech recognition,” in Proc. ICASSP, 2018.
[9] S. Shahnawazuddin, N. Adiga, and H. K. Kathania, “Effect of prosody modification
on children’s ASR,” IEEE Signal Processing Letters, vol. 24, no. 11,
pp. 1749–1753, 2017.
[10] R. Serizel and D. Giuliani, “Vocal tract length normalisation approaches to
DNN-based children’s and adults’ speech recognition,” in Proc. SLT, 2014.
[11] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving children’s speech
recognition through out-of-domain data augmentation,” in Proc. Interspeech,
2016.
[12] P. Sheng, Z. Yang, and Y. Qian, “Gans for children: a generative data augmentation
strategy for children speech recognition,” in Proc. ASRU, 2019.
[13] P. G. Shivakumar, A. Potamianos, S. Lee, and S. S. Narayanan, “Improving
speech recognition for children using acoustic adaptation and pronunciation
modeling,” in Proc. WOCCI, 2014.
[14] P. G. Shivakumar and P. Georgiou, “Transfer learning from adult to children
for speech recognition: evaluation, analysis and recommendations,” Computer
Speech Language, vol. 63, p. 101077, 2020.
[15] R. Tong, L. Wang, and B. Ma, “Transfer learning for children’s speech recognition,”
in Proc. IALP, 2017.
[16] N. F. Chen, R. Tong, D. Wee, P. Lee, B. Ma, and H. Li, “SingaKids-Mandarin:
speech corpus of Singaporean children speaking Mandarin Chinese,” in Proc.
Interspeech, 2016.
[17] M. H. Yang, H. S. Lee, Y. D. Lu, K. Y. Chen, Y. Tsao, B. Chen, and H. M. Wang,
“Discriminative autoencoders for acoustic modeling,” in Proc. Interspeech,
2017.
[18] P. T. Huang, H. S. Lee, S. S. Wang, K. Y. Chen, Y. Tsao, and H. M. Wang,
“Exploring the encoder layers of discriminative autoencoders for LVCSR,” in
Proc. Interspeech, 2019.
[19] X. lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep
denoising autoencoder,” in Proc. Interspeech, 2013.
[20] R. E. Zezario, J. W. Huang, X. Lu, Y. Tsao, H. T. Hwang, and H. M. Wang,
“Deep denoising autoencoder based post filtering for speech enhancement,” in
Proc. APSIPA, 2018.
[21] S. Jalalvand and D. Falavigna, “Stacked autoencoder for asr error detection
and word error rate prediction,” in Proc. Interspeech, 2015.
[22] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “Flat-start single-stage
discriminatively trained HMM-based models for ASR,” IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 1949–
1961, 2018.
[23] K. Lee, “On large-vocabulary speaker-independent continuous speech recognition,”
Speech Communication, vol. 7, no. 4, pp. 375 – 379, 1988.
[24] A. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief
networks,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 20, no. 1, pp. 14–22, 2012.
[25] A. Senior, H. Sak, F. de Chaumont Quitry, T. Sainath, and K. Rao, “Acoustic
modelling with CD-CTC-sMBR LSTM RNNs,” in Proc. ASRU, 2015.
[26] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word
LSTM model for large vocabulary speech recognition,” in Proc. Interspeech,
2017.
[27] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme
recognition using time-delay neural networks,” IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
[28] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture
for efficient modeling of long temporal contexts,” in Proc. Interspeech,
2015.
[29] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur,
“Semi-orthogonal low-rank matrix factorization for deep neural networks,”
in Proc. Interspeech, 2018.
[30] M. Federico, N. Bertoldi, and M. Cettolo, “Irstlm: an open source toolkit for
handling large scale language models,” 2008.
[31] B. Juang and L. R. Rabiner, “The segmental K-means algorithm for estimating
parameters of hidden Markov models,” IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 38, no. 9, pp. 1639–1641, 1990.
[32] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal
forced aligner: trainable text-speech alignment using Kaldi,” in Proc. Interspeech,
2017.
[33] K. Kumar, C. Kim, and R. M. Stern, “Delta-spectral cepstral coefficients for
robust speech recognition,” in Proc. ICASSP, 2011.
[34] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,”
IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272–
281, 1999.
[35] C. Leggetter and P. Woodland, “Maximum likelihood linear regression for
speaker adaptation of continuous density hidden Markov models,” Computer
Speech Language, vol. 9, no. 2, pp. 171 – 185, 1995.
[36] S. Rath, D. Povey, K. Veselý, and J. Cernocký, “Improved feature processing
for deep neural networks,” in Proc. Interspeech, 2013.
[37] K. Han, S. Hahm, B.-H. Kim, J. Kim, and I. Lane, “Deep learning-based telephony
speech recognition in the wild,” in Proc. Interspeech, 2017.
[38] K. Veselý, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative
training of deep neural networks,” in Proc. Interspeech, 2013.
[39] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang,
and S. Khudanpur, “Purely sequence-trained neural networks for ASR based
on lattice-free MMI,” in Proc. Interspeech, 2016.
[40] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural
networks,” in Proc. ICML, 2006.
[41] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “I-vector
based speaker recognition on short utterances,” in Proc. Interspeech, 2011.
[42] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural
network embeddings for text-independent speaker verification,” in Proc. Interspeech,
2017.
[43] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel,
“Support vector machines versus fast scoring in the low-dimensional total variability
space for speaker verification,” in Proc. Interspeech, 2009.
[44] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of
neural network acoustic models using i-vectors,” in Proc. ASRU, 2013.
[45] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “Xvectors:
robust DNN embeddings for speaker recognition,” in Proc. ICASSP,
2018.
[46] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in
speech recognition,” Computer Speech Language, vol. 16, no. 1, pp. 69–88,
2002.
[47] D. Povey, M. Hannemann, G. Boulianne, L. Burget, M. Janda, M. Karafiát,
S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Vesel, and T. Vu,
“Generating exact lattices in the WFST framework,” in Proc. ICASSP, 2012.
[48] D. Povey, G. Boulianne, L. Burget, P. Motlicek, and P. Schwarz, “The Kaldi
speech recognition toolkit,” in Proc. ASRU, 2011.
[49] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) complete
LDC93S6A,” Web Download. Philadelphia: Linguistic Data Consortium,
1993.
[50] Linguistic Data Consortium and NIST Multimodal Information Group, “CSRII
(WSJ1) complete LDC94S13A,” Web Download. Philadelphia: Linguistic
Data Consortium, 1997.
[51] M. Eskenazi, J. Mostow, and D. Graff, “The CMU kids corpus LDC97S63,”
Web Download. Philadelphia: Linguistic Data Consortium, 1997.
[52] A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa,
C. Hacker, M. Russell, S. Steidl, and M. Wong, “The PF_STAR children’s
speech corpus,” in Proc. Interspeech, 2005.
[53] http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
[54] https://en.wikipedia.org/wiki/Perplexity.
[55] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for
speech recognition,” in Proc. Interspeech, 2015.
[56] https://github.com/YannickJadoul/Parselmouth.
[57] S. Ghai and R. Sinha, “Exploring the role of spectral smoothing in context of
children’s speech recognition,” in Proc. Interspeech, 2009.
[58] P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Trmal, and S. Khudanpur,
“A pitch extraction algorithm tuned for ASR,” in Proc. ICASSP, 2014.
[59] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-supervised training
of acoustic models using lattice-free MMI,” in Proc. ICASSP, 2018.
[60] http://sox.sourceforge.net/sox.html.

簡易檢索 / 詳目顯示

相關論文