簡易檢索 / 詳目顯示

研究生: 黃仲淇
Chung-Chi Huang
論文名稱: 利用平行語料庫與單語樹庫之雙語剖析研究
Learning Bilingual Parsing from Parallel Corpus and Monolingual Treebank
指導教授: 張俊盛
Jason S. Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 62
中文關鍵詞: 倒置轉移文法剖析樹字序文法結構
外文關鍵詞: Inversion Transduction Grammar, Parse Tree, Word Order, Syntactic Structure
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本論文中,我們提出新的演算法來學習Wu (1997) 所提出的倒置轉移文法(Inversion Transduction Grammar, ITG),並延伸應用在剖析(parsing)雙語句子中。我們的模型利用學習所得之雙語語法規則(bilingual grammar rule)為雙語句子產生帶有巢狀文法結構的剖析樹(nested syntactic structural parse tree),樹中顯現出一個文法結構(syntactic structure)及兩個語言在字序(word order)上的關係。在訓練階段,相對於Wu的ITG簡易版本的括弧轉移文法(Bracketing Tranduction Grammar)實驗中,沒有考慮元素組成類別(constituent category)對兩個語言排列對等結構(counterpart)的影響,我們利用大規模的平行語料庫與一個單語樹庫來找出語言的文法結構並數學化(model)了兩個語言在語法(syntax)的關係。基本上,我們的方法藉由平行語料字對應的結果(word alignment)將單語樹庫的文法規則投射到另一個語言。在投射的過程中,專注於發生順接(straight)與倒接(inverted)的次數,進而,為ITG規則推算出相關機率值;另一方面,在執行時期,我們則利用一個由底層開始的剖析器(bottom-up parser)為句組(sentence pair)建造出最有可能的雙語剖析樹。
    我們實際製作了程式,以香港平行語料的新聞部份及Andrew B. Clegg所提供的生成規則(production rules)為語料,使用提出的演算法來訓練,並使用 Och 等人(2000)的評估方法來評估模型的效率,實驗結果顯示,我們方法產生的字對應在對應錯誤率(alignment error rate)上優於先進的Giza++系統。證明電腦學習到的雙語語法規則有效的幫助雙語剖析,並提供較合理的重組懲罰(reorder penalty)。我們為平行語料所產生的雙語剖析樹除了可以拿來改善ITG 規則,也可以拿來幫助訓練統計式機器翻譯之解碼器(decoder)。


    We present a new method for learning to parse a bilingual sentence using Inversion Transduction Grammar trained on a parallel corpus and a monolingual treebank. The method produces a parse tree for a bilingual sentence, showing the shared syntactic structures of indivisual sentence and the differences of word order within a syntactic structure. The method involves estimating lexical translation probability based on an existing word alignment system, and inferring probability of ITG rules. At runtime, a CYK-styled bottom-up parser is employed to construct the most probable bilingual parse tree for any given sentene pair. We also describe an implementation of the proposed method. The experimental results indicate the proposed model produces word alignments better than those produced by Giza++, a state-of-the-art word alignment system, in terms of alignment error rate and F-measure. The bilingual parse trees produced for the parallel corpus can be exploited to refine the initial ITG rules and train a decoder for statistical machine translation.

    摘要 i ABSTRACT ii 致謝辭 iii Table of Contents iv List of Tables v List of Figures vi Chapter 1 Introduction 7 1.1 Backgroud 7 1.2 Motivation 7 1.3 Bilingual Parsing 9 Chapter 2 Related Works 14 Chapter 3 The Model 18 3.1 Problem Statement 18 3.2 Proposed Training Process 19 3.2.1 Tagging and Segmenting 20 3.2.2 Initial Word Alignments 22 3.2.3 Algorithm for Probability Estimation 23 3.3 Bottom-up parsing 27 3.3.1 Implementation 27 3.3.2 Example Parse 30 Chapter 4 Experiments 31 4.1 Training Setting 31 4.2 Evaluation Metrics 35 4.3 Evaluation Result 38 Chapter 5 Conclusion and Future Work 41 References 42 Appendix A –Grammar Rules Trained and Associated Probabilities 44 Appendix B – Some Tree Structures of the Test Set Produced by Giza++ with ITG 49 Appendix C – Sentence Pairs of the Test Set 53

    Andrew B. Clegg and Adrian Shepherd. 2005. Evaluating and integrating Treebank parsers on a biomedical corpus. In Association for Computational Linguistics Workshop on software 2005.
    Colin Cherry and Dekang Lin. 2003. A probability model to improve word alignment. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, volume 1, pages 88-95.
    David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the ACL, pages 263-270.
    Yuan Ding and Martha Palmer. 2005. Machine translation using probabilistic synchronous dependency insertion grammars. In Proceedings of 43rd Annual Meetings of the ACL, pages 541-548.
    WU Hua, WANG Haifeng, and LIU Zhanyi. 2005. Alignment model adaptation for domain-specific word alignment. In Proceedings of the 43rd Annual Meeting of the ACL, pages 467-474.
    I. Dan Melamed. 2003. Multitext grammars and synchronous parsers. In Proceedings of the 2003 Meeting of the North American chapter of the Association for Computational Linguistics.
    Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Conference of the Association for Computational Linguistics (ACL-00), pages 440-447.
    Kristina Toutanova, H. Tolga Ilhan and Christopher D. Manning. 2002. Extentions to HMM-based statistical word alignment models. In Proceedings of the Conference on Empirical Methods in Natural Processing Language.
    Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics, volume 2, pages 836-841
    Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377-403.
    Wei Wang, Ming Zhou, Jin-Xia Huang, and Chang-Ning Huang. 2002. Structure alignment using bilingual chunking. In Proceedings of the 19th international conference on Computational linguistics, volume 1, pages 1-7.
    Wei Wang and Ming Zhou. Improving word alignment models using structured monolingual corpora. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 198-205.
    Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceedings of the 39th Annual Conference of the Association for Computational Linguistics (ACL-01).
    Hao Zhang and Daniel Gildea. 2004. Syntax-based alignment: supervised or unsupervised? In Proceedings of the 20th International Conference on Computational Linguistics.
    Hao Zhang and Daniel Gildea. 2005. Stochastic lexicalized inversion transduction grammar for alignment. In Proceedings of the 43rd Annual Meeting of the ACL, pages 475-482.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE