簡易檢索 / 詳目顯示

研究生: 邱絢紋
Chiu, Hsun-Wen
論文名稱: Chinese Spell Checking Based on Noisy Channel Model
指導教授: 張俊盛
Chang, Jason S.
口試委員: 張嘉惠
陳信希
柯淑津
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 41
中文關鍵詞: 雜訊通道模型語言模型網路語料混淆字集
外文關鍵詞: Noisy Channel Model, Character-based Language Model, Web Corpus, Confusion Set
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 中文自動更正拼字或打字錯誤在文書處理、網路搜尋及自動作文評分都是很重要的議題。然而,中文改錯不同於一般拼音語言的拼寫改錯,中文沒有詞間的分隔符號,而且不同的中文輸入法可能會產生不同的錯字類型,所以使得中文改錯更加困難。本篇論文針對音似形似的錯誤提出了一個利用雜訊通道模型(Noisy Channel Model)改錯,首先利用網路語料庫產生混淆字集(Confusion Set)和對應的機率生成通道模型(Channel Model),接著透過雜訊通道模型中的通道模型和語言模型(Language Model)改錯。本系統的組成包含訓練階段和執行階段,在訓練階段我們利用網路語料中 n 連詞(ngrams)的頻率估計每一個字對應混淆字的機率,在執行階段,系統會根據輸入的句子產生多個候選字,最後利用通道模型和語言模型選出最合適的字。實驗結果顯示,本論文提出的方法所製作的雛形系統,有不錯的改錯精確率與召回率。


    Chinese spell checking is an important component of many Chinese NLP applications, including word processors, search engines, and automatic essay rating. Compared to English, Chinese has no word boundaries, and there are various Chinese input methods that cause different kinds of typos. Therefore, it is more difficult to develop a spell checker for Chinese. In this paper, we introduce a novel method for correcting Chinese errors based on sound or shape similarity. In our approach, potential typos in a given sentence are then corrected using a channel model and a character-based language model in the noisy channel model. In the training phase, we estimate the channel probabilities for each character based on ngrams in Web corpus. At run-time, the system generates correction candidates for each character in the given sentence and selects the appropriate correction using the channel model and the language model. The experimental results show that the proposed method achieves significantly better accuracy and recall than more complicated methods in the previous work.

    Contents Chinese Abstract i Abstract ii Acknowledgments iii Contents v List of Figures vi List of Tables viii 1 Introduction 1 2 Related Work 5 3 Method 9 3.1 ProblemStatement ............................. 10 3.2 TrainingChannelModel .......................... 11 3.2.1 LimitingConfusableCharacters .................. 11 3.2.2 RetrievingNgrams ......................... 13 3.2.3 Correcting Ngrams and Training Channel Model . . . . . . . . . 15 3.3 Run-timeTypoCorrection ......................... 19 4 Experiment and Discussion 22 4.1 ExperimentSetting ............................. 22 4.1.1 ConfusionSet............................ 24 4.1.2 GoogleChineseWeb5-gram.................... 24 4.1.3 ExistingChineseSpellChecker .................. 25 4.1.4 SinicaCorpus............................ 26 4.1.5 TestData .............................. 27 4.1.6 SystemsCompared......................... 29 4.1.7 EvaluationMetrics ......................... 30 4.2 Evaluation.................................. 33 5 Conclusion and Future Work 

    Chao-Huang Chang. A new approach for automatic chinese spelling correction. In Pro- ceedings of Natural Language Processing Pacific Rim Symposium, volume 95, pages 278–283. Citeseer, 1995.
    Yong-Zhi Chen and Shih-Hung Wu. Improve the detection of improperly used chinese characters with noisy channel model and detection template. Master’s thesis, Chaoyang University of Technology, 2010.
    Hsun-wen Chiu, Jian-cheng Wu, and Jason S Chang. Chinese spelling checker based on statistical machine translation. In Sixth International Joint Conference on Natural Language Processing, page 49, 2013.
    Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. Error detection and correc- tion based on chinese phonemic alphabet in chinese text. In Modeling Decisions for Artificial Intelligence, pages 463–476. Springer, 2007.
    Ta-Hung Hung and Shih-Hung Wu. Automatic chinese character error detecting system based on n-gram language model and pragmatics knowledge base. Master’s thesis, Chaoyang University of Technology, 2009.
    Zhongye Jia, Peilu Wang, and Hai Zhao. Graph model for chinese spell checking. In Sixth International Joint Conference on Natural Language Processing, page 88, 2013.
    C-L Liu, M-H Lai, K-W Tien, Y-H Chuang, S-H Wu, and C-Y Lee. Visually and phono- logically similar characters in incorrect chinese words: Analyses, identification, and applications. ACM Transactions on Asian Language Information Processing (TALIP), 10(2):10, 2011.
    Wei-Yun Ma and Keh-Jiann Chen. Introduction to ckip chinese word segmentation system for the first international chinese word segmentation bakeoff. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pages 168– 171. Association for Computational Linguistics, 2003.
    Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Abrash. Srilm at sixteen: Update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, page 5, 2011.
    Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. Reducing the false alarm rate of chinese character error detection and correction. In Proceed- ings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP 2010), pages 54–61, 2010.
    Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. Chi- nese word spelling correction based on n-gram ranked inverted index list. In Sixth International Joint Conference on Natural Language Processing, page 43, 2013.
    Li Zhuang, Ta Bao, Xiaoyan Zhu, Chunheng Wang, and Satoshi Naoi. A chinese ocr spelling check approach based on statistical language models. In Systems, Man and Cybernetics, 2004 IEEE International Conference on, volume 5, pages 4727–4732. IEEE, 2004.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE