研究生: |
莊裕嵐 Chuang, Yu-Lan |
---|---|
論文名稱: |
泰雅語電腦輔助發音訓練系統 A Computer-assisted Pronunciation Training System for Atayal |
指導教授: |
劉奕汶
Liu, Yi-Wen |
口試委員: |
蘇宜青
Su, Yi-Ching 辛靜婷 Hsin, Ching-Ting 白明憲 Bai, Ming-Sian 賴穎暉 Lai, Ying-Hui |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2024 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 45 |
中文關鍵詞: | 泰雅語 、電腦輔助發音訓練 、少數語言建模 |
外文關鍵詞: | Atayal, Computer-assisted pronunciation training, Low-resourced language modelling |
相關次數: | 點閱:52 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究介紹應用於臺灣原住民語——賽考利克泰雅語的電腦輔助發音訓練系
統。發音學習輔助系統(computer-assisted pronunciation system)係一透過語音辨識技術,來自動化偵測學習者發音錯誤,以此來發音學習的目的。泰雅語為臺灣的一個獨特的原住民語,同時屬於南島語系。本研究利用少數語言的語音辨識系統建模技術,來實作針對泰雅族族語的電腦輔助發音訓練系統。本研究所提出的系統分成兩大部分,其一為診斷學習者發音時音韻表現的指標,以及檢測學習者語調變化的指標。在語音指標的部分,本研究採用一開源跨語種聲學模型——XLSR-53,對學習者的語音進行音位辨識。而後再透過序列比對演算法,將模型辨識出的音位序列和正確解答的音位序列做比較,從而實作語音指標。在語調指標的部分,先利用基音估測演算法,分別對學習者和教師的音檔,抽取出基音並計算出兩者的基音差分序列。而後利用動態時間規整,對齊兩差分基音序列後,再計算出兩序列之間的均方根誤差,來作為評估學習者語調的標準。本研究也簡單開發出兩個分析介面,其一為僅顯示出發音錯誤位置的介面,其二除顯示發音錯誤位置外,也根據錯誤類型進行分析反饋。最後,本研究邀請14 位受測者來接受系統測試,所有人皆偏好提供分析反饋的介面。
This research introduces a computer-assisted pronunciation training system for Squliq Atayal, an Indigenous language in Taiwan. A computer-assisted pronunciation training system (CAPT) is a system that uses speech recognition technology to automatically detect pronunciation error for language learners, and achieve the objective of learning pronunciation. Atayal is a unique Indigenous language in Taiwan, and belongs to the Austronesian language family. This research implements an Atayal CAPT system via low-resourced language modelling. The system comprises two parts. One is the phonetic metric which evaluates the phonetic performance during learners’ pronunciation, and the other is the intonational metric which evaluates learners’ intonational inflection. In the phonetic metric, this research applies
an open-sourced cross-lingual acoustic model, XLSR-53, to recognize phonemes in learners’ speech. Subsequently, this research compares the recognition result and the ground truth via pairwise alignment. In the intonational metric, this research extracts both learners’ and instructors’ pitch contours by a pitch estimation algorithm. After that, we aligns two delta pitch contour by dynamic time warping and
calculates the root-mean-square error (RMSE). To interact with users, we designs two simple analysis interfaces. One only displays positions of pronunciation errors, and the other not only shows the error positions but delivers a detailed diagnosis report based on the mistake types. In the end, the research invited 14 participants for testing the system interface, and every participant preferred the interface with detailed analysis feedback.
1. N. Ogawa and E. Asai, The myths and traditions of the Formosan native tribes: texts and notes. Taihoku Teikoku Daigaku Gengogaku Kenkyūshitsu chōsa„ 1935.
2. P. J.-k. Li, “Japanese contributions to the study of Formosan languages,” Journal of Taiwanese Languages and Literature, pp. 1–20, Jul 2009.
3. R. Blust, The Austronesian Languages, pp. 720–728. Pacific linguistics, Pacific Linguistics, Research School of Pacific and Asian Studies, Australian National University, revised ed., 2013.
4. P. J.-k. Li, “Formosan languages,” Thought and Speech, vol. 17, no. 4, pp. 293– 306, 1979.
5. P. J.-k. Li, “The great diversity of Formosan languages,” Language and Linguistics, vol. 9, no. 3, pp. 523–646.
6. L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 56, pp. 85–100, 2014.
7. W. Slam, Y. Li, and N. Urouvas, “Frontier research on low-resource speech recognition technology,” Sensors, vol. 23, no. 22, 2023.
8. T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in 2009 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 421–426, 2009.
9. A. Ito and M. Koizumi, “Spoken term detection of zero-resource language using machine learning,” in Proceedings of the 2018 International Conference on Intelligent Information Technology, ICIIT ’18, (New York, NY, USA), p. 45– 49, Association for Computing Machinery, 2018.
10 X. Li, Low-Resource Speech Recognition for Thousands of Languages. PhD thesis, Carngie Mellon University, August 2023.
11. S. Bhatt, A. Jain, and A. Dev, “Acoustic modeling in speech recognition: A systematic review,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 4, pp. 397–412, 2020.
12. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Selfsupervised cross-lingual speech representation learning at scale,” in Proc. Interspeech 2022, pp. 2278–2282, 2022.
13. Q. Xu, A. Baevski, and M. Auli, “Simple and effective zero-shot cross-lingual phoneme recognition,” in Proc. Interspeech 2022, pp. 2113–2117, 2022.
14. Y. Kheir, A. Ali, and S. Chowdhury, “Automatic pronunciation assessment - a review,” in Findings of the Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 8304– 8324, Association for Computational Linguistics, Dec. 2023.
15. A. Lee and J. Glass, “A comparison-based approach to mispronunciation detection,” in 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 382–387, 2012.
16. M. Wu, K. Li, W.-K. Leung, and H. Meng, “Transformer based end-to-end mispronunciation detection and diagnosis,” in Proc. Interspeech 2021, pp. 3954– 3958, 2021.
17. D. Korzekwa, J. Lorenzo-Trueba, T. Drugman, and B. Kostek, “Computerassisted pronunciation training—speech synthesis is almost all you need,” Speech Communication, vol. 142, pp. 22–33, 2022.
18. K. Sheoran, A. Bajgoti, R. Gupta, N. Jatana, G. Dhand, C. Gupta, P. Dadheech, U. Yahya, and N. Aneja, “Pronunciation scoring with goodness of pronunciation and dynamic time warping,” IEEE Access, vol. 11, pp. 15485–15495, 2023.
19. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Dec. 2011.
20. S. Coretta, “Map showing the distribution of the austronesian languages and primary subdivisions..”
21. P. BELLWOOD, “A hypothesis for Austronesian origins,” Asian Perspectives, vol. 26, no. 1, pp. 107–117, 1984.
22. J. M. Diamond, “Linguistics: Taiwan’s gift to the world,” Nature, vol. 403, pp. 709–710, 2000.
23. 中華民國總統府原住民族歷史正義與轉型正義委員會, “原住民族地區 16 族分 佈圖.”
24. O. C. Dahl, “Austronesian numerals,” Journal of Southeast Asian Linguistics Society, pp. 46–58, 1981.
25. O. Izumi, “Proto-austronesian *lima revisited: From archaic ”hand” in atayalic languages,” Language and Linguistics in Oceania, vol. 12, pp. 1–18, 6 2020.
26. P. J.-k. Li, “Numerals in Formosan languages,” Oceanic Linguistics, vol. 45, no. 1, pp. 133–152, 2006.
27. C.-T. Hsin, C. Compton-Lilly, M.-F. Hsieh, and D. T. Luu, “Creating books and sustaining Indigenous languages with two Atayal communities,” Journal of Early Childhood Literacy, 2023. Advance online publication.
28. Y. J. Thao, “Voices of Atayal people: Indigenous cultural memory in modern Taiwan society,” Advances in Social Sciences Research Journal, vol. 6, p. 141–158, Mar. 2019.
29. 中華民國教育部 and 中華民國原住民委員會, “原住民族語言書寫系統,” 2005.
30. 黃美金 and 吳新生, 臺灣南島語叢書 2 泰雅語語法概論, pp. 07–10. 中華民國 原住民族委員會, 2nd ed., 2017.
31. S. Egerod, “A statement on Atayal phonology,” Artibus Asiae. Supplementum, vol. 23, pp. 120–130, 1966.
32. P. J.-k. Li, “The phonological rules of Atayal dialects,” Bulletin of the Institute of History and Philology, vol. 53, no. 2, pp. 265–304, 1980.
33. K.-s. Wu, “Language contact and lexical borrowings in Atayal,” Master’s thesis, National Tsing Hua University, Jan 2013.
34. H.-c. J. Huang, “The nature of pretonic weak vowels in Squliq Atayal,” Oceanic Linguistics, vol. 57, no. 2, pp. 265–288, 2018.
35. Y. Yamada and Y.-c. Liao, “A phonology of Tayal,” Kōchi Daigaku gakujutsu kenkyū hōkoku: Jinbun kagaku, vol. 23, pp. 109–117, 1975.
36. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: a framework for self-supervised learning of speech representations,” NIPS ’20, (Red Hook, NY, USA), Curran Associates Inc., 2020.
37. B. Peters, J. Dehdari, and J. van Genabith, “Massively multilingual neural grapheme-to-phoneme conversion,” in Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems (E. Bender, H. Daumé III, A. Ettinger, and S. Rao, eds.), (Copenhagen, Denmark), pp. 19–26, Association for Computational Linguistics, Sept. 2017.
38. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, pp. 84 – 90, 2012.
39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
40. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv: Learning, 2016.
41. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, p. 369–376, Association for Computing Machinery, 2006.
42. W. Haque, A. Aravind, and B. Reddy, “Pairwise sequence alignment algorithms: a survey,” in Proceedings of the 2009 Conference on Information Science, Technology and Applications, p. 96–103, 2009
43. M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based highquality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
44. S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An improved version of MetricGAN for speech enhancement,” in Proc. Interspeech 2021, pp. 201–205, 2021.
45. A. McMahon, An Introduction to English Phonology 2nd edition, pp. 39–55. Edinburgh University Press, 2020.
46. M. Nespor, M. Shukla, and J. Mehler, “Stress-timed vs. syllable- timed languages,” The Blackwell Companion to Phonology, pp. 1147–1159, 2011.