簡易檢索 / 詳目顯示

研究生: 顏合淨
Yen, Ho-ching
論文名稱: 組合逐字翻譯以解決機器翻譯中的未知詞問題
Using Sublexical Translations to Handle the OOV Problem in Machine Translation
指導教授: 張俊盛
Chang, Jason S.
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 58
中文關鍵詞: 機器翻譯未知詞組成字翻譯
外文關鍵詞: machine translation, out-of-vocabulary word, sublexical translation
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • In this thesis, we introduce a method for translating out-of-vocabulary (OOV) words. The model focuses on combining the sublexical translations of an OOV to generate suitable translations. In our approach, we formulate wildcard search in the phrase translation table to find sublexical translations, which are then combined to form translation candidates, aimed at maximizing the probability of translating an OOV word. These translation candidates are ranked and filtered based on bilingual and monolingual information. We have incorporated the proposed OOV model into the state-of-the-art phrase-based machine translation system performing Chinese-to-English translation tasks. Experimental results show that the OOV module indeed helps to improve the quality of translation, especially for sentences containing more OOV words. Our methodology generates translations for the OOV words that can be incorporated into machine translation systems to ease the negative impact of OOV on translation quality.


    本論文提出一個翻譯未知詞的方法,可用於解決機器翻譯中的未知詞問題。產生翻譯的方法,主要是利用未知詞中各組成字的翻譯,使用不同的方式組合以產生該未知詞的翻譯。方法的主要步驟,是從機器翻譯系統原有的詞彙翻譯表中,以未知詞的組成字搭配萬用符號進行查詢,取得各組成字的翻譯,進而由組成字翻譯中組合出該未知詞的翻譯。我們利用雙語與單語資料來篩選和排序本方法產生的未知詞翻譯。本研究的測試方式,是將產生的未知詞翻譯與現有的機器翻譯系統整合,進行中文翻譯至英文的實驗。在評估方面,我們使用BLEU準則來進行評分。實驗結果顯示,當中文句含有較多的未知詞時,本系統提供的未知詞翻譯能改善整體的翻譯品質。本論文的主要貢獻在於,我們的方法利用機器翻譯系統現有的詞彙翻譯表,逐字翻譯未知詞的組成字,並從中組合出未知詞的翻譯,進而幫助解決機器翻譯中的未知詞問題。

    CHAPTER 1 Introduction 1 CHAPTER 2 Related Work 5 CHAPTER 3 The OOV Model 9 3.1 Problem Statement 9 3.2 Finding Sublexical Translations 11 3.2.1 Retrieving Translations 11 3.2.2 Extracting and Restraining Sublexical Translations 12 3.2.3 Pruning Less Probable Sublexical N-gram Translations 14 3.3 Run-Time Translation Candidates Ranking 18 CHAPTER 4 Experimental Setting 23 4.1 Underlying SMT System 23 4.2 Data sets 24 4.3 Query Types and Bilingual Resources 25 4.4 Parameter Setting 28 CHAPTER 5 Evaluation Results and Discussion 32 5.1 Experimental Results 32 5.2 Discussion 34 CHAPTER 6 Future Work and Summary 39 References 41 Appendix A - OOV Types and Their Examples in the NIST MT-08 Test Set 44 Appendix B - Example Translations with OOV Words Correctly Translated 48 Appendix C - Example Translations with OOV Words Partially Translated 50 Appendix D - Example Translations with OOV Words not in combination form 54

    Karunesh Arora, Michael Paul, and Eiichiro Sumita. 2008. Translation of Unknown Words in Phrase-Based Statistical Machine Translation for Languages of Rich Morphology. In Proceedings of the First International Workshop on Spoken Languages Technologies for Under-resourced Languages (SLUT).
    Yunbo Cao and Hang Li. 2002. Base Noun Phrase translation using web data and the EM algorithm. In Proceedings of the 19th International Conference on Computational Linguistics, pages 1-7.
    Matthias Eck, Stephan Vogel and Alex Waibel. 2008. Communicating Unknown Words in Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation(LREC).
    Pascale Fung and Percy Cheung. 2004. Multi-level Bootstrapping for Extracting Parallel Sentences from a Quasi-comparable Corpus. In Proceedings of the 20th International Conference on Computational Linguistics, pages 1051-1057.
    Nizar Habash. 2008. Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pages 57-60.
    Hany Hassan and Jeffrey Sorensen. 2005. An IntegratedApproach for Arabic-English Named Entity Translation. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 87-93.
    Chu-Ren Huang, Ru-Yng Chang, and Shiang-Bin Lee. 2004. Sinica BOW (Bilingual Ontological Wordnet): Integration of Bilingual WordNet and SUMO. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pages 1553-1556.
    Kevin Knight and Jonathan Graehl. 1997. Machine Transliteration. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 128-135.
    Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting. In Proceedings of the 10th conference on European chapter of the Association for Computational Linguistics, pages 187-193.
    Philipp Koehn. 2004. Statistical Significance Test for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388-395.
    Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne and David Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. In International workshop on Spoken Language Translation.
    Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177–180.
    Philippe Langlais and Alexandre Patry. 2007. Translating Unknown Words by Analogical Learning. In Proceedings of the 2007 Joint Conference on Empirical Methods in Nature Language Processing and Computational Nature Language Learning, pages 877-886.
    Zhifei Li and David Yarowsky. 2008. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, pages 425-433.
    Yuval Marton, Chris Callison-Burch, and Philip Resnik. 2009. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Nature Language Processing, pages 381-390.
    George A. Miller. 1995. Wordnet: A Lexical Database for English. Communications of the ACM, vol. 38, no. 11, pages 39-41.
    Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman, and Idan Szpektor. 2009. Source-language Entailment Modeling for Translating Unknown Terms. In Proceedings of the 47th Annual Meeting of ACL and the 4th IJCNLP of the AFNLP, pages 791–799.
    Masaaki Nagata, Teruka Saito, and Kenji Suzuki. 2001. Using the Web as a Bilingual Dictionary. In Proceedings of the ACL Workshop on Data-driven Methods in Machine Translation, pages 1-8.
    Franz Josef Och and Hermann Ney. 2003. A systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, vol. 29, no. 1, pages 19-51.
    Kishore Papineni, Salim Roukos, ToddWard, andWei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311-318.
    Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving Machine Translation Performance by Exploiting Non-parallel Corpora. Computational Linguistics, vol. 31, no. 14, pages 477-504.
    Andreas Stolcke. 2002. SRILM – An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, vol. 2, pages 901–904.
    Takaaki Tanaka, and Timothy Baldwin. 2003. Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 17-24.
    David Vilar, Jan-T. Peter, and Hermann Ney. 2007. Can we translate letters?. In Proceedings of the ACL workshop on Statistical Machine Translation, pages 33-39.
    Mei Yang and Katrin Kirchhoff. 2006. Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages. In Proceedings of the European Chapter of the Association for Computational Linguistics, pages 41-48.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE