研究生: |
林子瀚 Lin, Tzu-Han |
---|---|
論文名稱: |
口語講座文本之自動生成:“紅樓夢” 系列 Toward Automatic Generation of Transcript from Spoken Lectures: the “Dream of the Red Chamber” Series |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
白明憲
Bai, Ming-Sian 廖元甫 Liao, Yuan-Fu 羅仕龍 Lo, Shih-Lung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2022 |
畢業學年度: | 111 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 語音辨識 、自動轉錄 、遷移學習 、講座語料 、紅樓夢文學 |
外文關鍵詞: | Speech recognition, Automatic transcription, Transfer learning, Lecture corpus, Dream of the Red Chamber |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,受疫情影響,越來越多課程採線上授課或是透過影音平台開放。當學生在搜尋或觀看這些開放式課程時,有對應的文本內容在學習及理解上是非常有幫助的。然而,傳統透過人工方式轉錄的字幕是非常耗時費力的,經常需要反覆勘誤校正。隨著語音辨識技術的成熟,許多學者嘗試將字幕生成自動化,但多數的應用都受限於本土化訓練資料取得不易,或是因開源的預訓練語音辨識模型其訓練資料與實際測試資料不匹配,導致結果不盡理想。為改善上述問題,本篇論文使用北京希爾貝殼公司開源的AISHELL-1語料庫生成預訓練模型,並應用遷移學習讓原先基於北京普通話訓練的模型逐漸適應台灣口音,進而開發一套適用於紅樓夢文學講座的自動轉錄系統。透過改變遷移層數和微調學習率來提高語音識別的準確性,同時也實驗不同語言模型組合對字錯率的影響。作為參考,我們也將模型的識別結果與Google語音辨識API的結果進行比較。實驗結果表明,遷移所有的神經網路層可以得到16.44%的字錯率(CER),進一步調整聲學模型的訓練參數後,更可以達到最低的字錯率15.83%。此外,透過減少語言模型的困惑度和詞彙外單詞的數量,能使CER再降低0.29%。
In recent years, due to the COVID-19 epidemic, more and more university lectures are given through distance learning and posted online in video format. When students study online, reading the subtitles of online courses is helpful for understanding. However, we usually need to work repeatedly on proofreading when manually transcribing spoken lectures into texts, which is time-consuming and labor-intensive. With the maturity of speech recognition technology, many scholars have attempted automatic transcription of online lectures. For lectures that are given with Taiwanese Mandarin accents, the scarcity of data poses a challenge. A strategy would be to pre-train a model using a larger Mandarin corpus, and then fine-tune with a Taiwanese-accented corpus. We chose the AISHELL-1 corpus released by Beijing AISHELL technology enterprises for pre-training our model, and then adapted it to recognize Taiwanese Mandarin by transfer learning, thereby developing an automatic transcription system for the lectures on Dream of the Red Chamber. The accuracy was improved by experimenting with the number of transferred layers and fine-tuning the learning rate. Meanwhile, the character error rate (CER) of different language model combinations was measured. We also compared the recognition results of the model with those provided by Google's speech recognition API. The results show that the whole network transfer yields a CER of 16.44%. By further adjusting the training parameters of the acoustic model, the lowest CER of 15.83% can be realized. In addition, by reducing the perplexity of the language model and the number of out-of-vocabulary words, the CER can be improved by 0.29%.
[1] S. Furui, “Recent advances in spontaneous speech recognition and understanding,” in Proc. ISCA/IEEE Workshop on Spontaneous Speech Processing and Recognition, p. paper MMO1, 2003.
[2] H. Yang, C. Oehlke, and C. Meinel, “German speech recognition: A solution for the analysis and processing of lecture recordings,” in 2011 10th IEEE/ACIS International Conference on Computer and Information Science, pp. 201–206, 2011.
[3] J. Glass, T. J. Hazen, S. Cyphers, I. Malioutov, D. Huynh, and R. Barzilay, “Recent progress in the MIT spoken lecture processing project,” in Interspeech 2007, ISCA, Aug 2007.
[4] P. Cerva, J. Silovsky, J. Zdansky, O. Smola, K. Blavka, K. Palecek, J. Nouza, and J. Malek, “Browsing, indexing and automatic transcription of lectures for distance learning,” in 2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP), pp. 198–202, 2012.
[5] P. Ghahremani, V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Investigation of transfer learning for asr using lf-mmi trained neural networks,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 279–286, 2017.
[6] B. Li, X. Wang, and H. Beigi, “Cantonese automatic speech recognition using transfer learning from mandarin,” 2019.
[7] T. K. Vintsyuk, “Speech discrimination by dynamic programming,” Cybernetics, vol. 4, pp. 52–57, 1968.
[8] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state markov chains,” The Annals of Mathematical Statistics, vol. 37, no. 6, pp. 1554–1563, 1966.
[9] K.-F. Lee, “On large-vocabulary speaker-independent continuous speech recognition,” Speech Communication, vol. 7, no. 4, pp. 375–379, 1988. Word Recognition in Large Vocabularies.
[10] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
[11] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
[12] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” 2014.
[13] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” 2012.
[14] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
[15] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. Interspeech 2018, pp. 3743–3747, 2018.
[16] L. E. Baum, T. Petrie, G. W. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains,” Annals of Mathematical Statistics, vol. 41, pp. 164–171, 1970.
[17] D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karafiát, S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Veselý, and N. T. Vu, “Generating exact lattices in the wfst framework,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4213–4216, 2012.
[18] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi,” in Proc. Interspeech 2016, pp. 2751–2755, 2016.
[19] S. Austin, R. Schwartz, and P. Placeway, “The forward-backward search algorithm,” in [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 697–700, 1991.
[20] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, (New York, NY, USA), pp. 369–376, Association for Computing Machinery, 2006.
[21] P. S. LaPlace, Essai philosophique sur les probabilités. Courcier, 1814.
[22] I. J. GOOD, “THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS,” Biometrika, vol. 40, pp. 237–264, Dec 1953.
[23] F. Jelinek and R. L. Mercer, “Interpolated estimation of Markov source parameters from sparse data,” in Proceedings, Workshop on Pattern Recognition in Practice (E. S. Gelsema and L. N. Kanal, eds.), pp. 381–397, North Holland, 1980.
[24] S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400–401, 1987.
[25] H. Ney, U. Essen, and R. Kneser, “On structuring probabilistic dependences in stochastic language modelling,” Computer Speech & Language, vol. 8, no. 1, pp. 1–38, 1994.
[26] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 181–184, 1995.
[27] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, (USA), pp. 310–318, Association for Computational Linguistics, 1996.
[28] M. Mohri, F. Pereira, and M. Riley, Speech Recognition with Weighted FiniteState Transducers, pp. 559–584. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008.
[29] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Dec 2011. IEEE Catalog No.: CFP11SRW-USB.
[30] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5, 2017.
[31] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. Interspeech 2015, pp. 3586–3589, 2015.
[32] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224, 2017.
[33] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25, no. 1, pp. 133–147, 1998.
[34] K. Kumar, C. Kim, and R. M. Stern, “Delta-spectral cepstral coefficients for robust speech recognition,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4787, 2011.
[35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd Edition). Wiley-Interscience, 2 ed., Nov 2000.
[36] M. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer Speech & Language, vol. 12, no. 2, pp. 75–98, 1998.
[37] M. Gales, “Semi-tied covariance matrices for hidden markov models,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272–281, 1999.
[38] S. Matsoukas, R. Schwartz, H. Jin, and L. Nguyen, “Practical implementations of speaker-adaptive training,” in DARPA Speech Recognition Workshop, 1997.
[39] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. Interspeech 2015, pp. 3214–3218, 2015.
[40] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[41] M. Karafiát, L. Burget, P. Matějka, O. Glembek, and J. Černocký, “ivectorbased discriminative adaptation for automatic speech recognition,” in 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 152–157, 2011.
[42] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 55–59, 2013.
[43] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1, pp. 19–41, 2000.