Extraction of Bilingual Multiword Expressions with Application to Bilingual Concordancer

簡易檢索 / 詳目顯示

回結果列表

研究生：	白明弘 Bai, Ming-Hong
論文名稱：	Extraction of Bilingual Multiword Expressions with Application to Bilingual Concordancer
指導教授：	張俊盛 Chang, Jason S. 陳克健 Chen, Keh-Jiann
口試委員:	陳信希張俊盛蔡宗翰高照明陳克健
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2013
畢業學年度：	101
語文別：	英文
論文頁數：	97
中文關鍵詞：	機器翻譯、電腦輔助翻譯、詞語對齊、多詞表達
外文關鍵詞：	Machine Translation, Computer-Assisted Translation, Word Alignment, Multiword Expression
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

Bilingual concordancer 是一種建構在平行語料庫上的電腦輔助翻譯工具。當使用者輸入一個單字或片語時，bilingual concordancer從平行語料庫中抽出包含該單字或片語的句子。接著，在對譯的句子中標出對等翻譯出現的位置，以及依照翻譯相關性重新排列句子。這樣的輸出結果不僅讓使用者可以習得對等的翻譯，同時也可以從句子中研究或學習該單字或片語翻譯的使用方法。因此，對於詞典的編輯者、專業的翻譯者、或是第二語言學習者來說，bilingual concordancer 都是非常實用的工具。
多詞表達(multi-word expression)的對等翻譯抽取技術則是 bilingual concordancer 中最重要的技術。例如對等翻譯標示 (highlighting translation equivalents) 及產生對等翻譯表(translation equivalents list) 都需要依賴高品質的對等翻譯抽取技術。然而到目前為止，對等翻譯的抽取技術仍有許多改進的空間。
在本論文中，我們將探討現有多詞表達對等翻譯抽取的一些問題，包括過度對應 (over-alignment) 的問題，以及不足對應 (under-alignment) 的問題。我們將提出一個全新的對等翻譯抽取模型來解決這些問題，以提高翻譯的品質。同時，我們以所提出的模型，實際建構了一個 bilingual concordancer電腦輔助翻譯系統。為了測試系統的品質，我們以三組不同型態的多詞表達做為測試資料，來測試 bilingual concordancer ，並以現有的統計式翻譯模型做為比較的對像。

A bilingual concordancer is a computer-assisted translation tool that uses the parallel corpus as its knowledge base. Given a word or phrase, the bilingual concordancer retrieves aligned sentence pairs, which contain the word or phrase in the source sentences, from the parallel corpus. Then, it identifies the translation equivalents in the target sentences and reorders the sentence pairs according to the correlation from the query string and the translation equivalents. It helps not only on finding translation equivalents of the query but also presenting various contexts of occurrence. As a result, it is extremely useful for bilingual lexicographers, human translators and second language learners.
Extraction of bilingual multi-word expressions is the most important part of a bilingual concordancer. For example, highlighting translation equivalents in the target sentence and generating translation equivalent list are highly depend on a high quality extraction model. However, the existing models for extracting translation equivalents still have many problems and still room to improve.
In this thesis, we discuss some problems of the existing models for extracting bilingual multi-word expressions, including the over-alignment problem and the under-alignment problem. Then, we propose a novel model to address these problems to improve the quality the extracted translation equivalents. Further, we implement a bilingual concordancer employs the proposed translation extraction model. To measure the performance of the bilingual concordancer, we use three type of multi-word expression as our test target. The results are compared with the existing statistical machine translation models.

Contents
摘要    i
Abstract    ii
誌謝    iii
Contents    v
List of Figures    viii
List of Tables    x
Chapter 1    Introduction    1
1    Bilingual Concordancer    1
2    Extraction of Bilingual Multiword Expressions    4
3    Thesis Goals    5
Chapter 2    Extraction of Translation Equivalents for Multiword Expressions    7
1    Problem Statement    7
2    Extracting Translation Equivalences    10
2.1    Selecting Candidate Words    11
2.2    Local Normalized Correlation    11
2.3    Normalized Correlation    13
2.4    Generation and Ranking of Candidate Translations    13
2.5    Generating Possible Translations    14
2.6    Filtering Common Subsequences    15
2.7    Selection of Candidate Translations    17
3    Experiments    18
3.1    Evaluation of Word Candidates    18
3.2    Evaluating Extracted Translations    19
4    Applying  MWE Translations to MT    22
4.1    Experimental Settings    22
4.2    Selection of MWEs    23
4.3    Extra Information    23
4.4    Evaluation Results    24
5    Summary    25
Chapter 3    Bilingual Concordancer    27
1    The System    27
2    Extraction of Bilingual Multi-word Expressions    28
3    Ranking    28
4    Evaluation    29
4.1    Experimental Setting    29
4.2    Evaluation of Translation Spotting    31
4.3    Evaluation of Ranking    33
Chapter 4    Chinese Word Alignment    35
1    Problem Statement    35
2    Word Segmentation Adjustment    37
3    Affix Rule Method    37
3.1    Training Data    38
3.2    Word-to-Morpheme Alignment    39
3.3    Rule Extraction    40
4    Impurity Measure Method    42
4.1    Impurity Measure of Translation    43
4.2    Target Word Selection    44
4.3    Best Breaking Point    45
5    Experiment    46
6    Summary    48
Chapter 5    Translation of Unknown Words    49
1    Problem Statement    49
2    The TTR Model    51
2.1    Definition of TTR    52
2.2    Translation Process    53
2.3    Translation Probability and Lexical Weighting    55
2.4    Extraction of TTRs    57
2.5    Classifier and Rule Fitting Probability    59
2.6    Synchronous Morphological Rule    60
3    Experimental Setting    63
3.1    The baseline SMT System and Data Sets    63
3.2    Training    64
4    Experimental Results    64
4.1    Impact of Unknown Word Identification    65
4.2    OOV Classification    65
4.3    TTR selection    66
4.4    BLEU score    67
5    Summary    69
Chapter 6    Conclusion    70
Bibliography    72
Appendix A – Chinese Idioms for Testing Bilingual Concordancer    81
Appendix B – Lists of Template Rules    87
Publications    96

                                

[1] Anthony, L. 2012. Advancing AntConc: Design and Performance Improvements for Multi-Language. Proceedings of the Japan Association for English Corpus Studies (JAECS) Annual Conference, Sept. 29, 2012, Osaka University, Osaka, Japan.
[2] Ayan, Necip Fazil and Bonnie J. Dorr. 2006. Going Beyond AER: An Extensive Analysis of Word Alignments and Their Impact on MT. In Proceedings of ACL 2006, pages 9-16, Sydney, Australia.
[3] Bai, Ming-Hong, Yu-Ming Hsieh, Keh-Jiann Chen and Jason S. Chang. 2012. DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation. In Proceedings of ACL 2012, pages 55-60, Jeju Island, Korea.
[4] Bai, Ming-Hong, Jia-Ming You, Keh-Jiann Chen, Jason S. Chang. 2009. Acquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies. In Proceedings of EMNLP, pages 478-486.
[5] Bai, Ming-Hong, Keh-Jiann Chen and Jason S. Chang. 2008. Improving Word Alignment by Adjusting Chinese Word Segmentation. In Proc. of IJCNLP 2008. pp. 249-256.
[6] Bai, Ming-Hong, Keh-Jiann Chen and Jason S. Chang. 2006. Sense Extraction and Disambiguation for Chinese Words from Bilingual Terminology Bank. Computational Linguistics and Chinese Language Processing, 11(3):223-244.
[7] Barlow, Michael. 1995. A concordancer for parallel texts. Computers and Texts, 10, 14-16.
[8] Barlow, Michael. 1999. Monoconc 1.5 and Paraconc. International Journal of Corpus Linguistics, 4(1):173-184.
[9] Bach, Nguyen, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz, and Alan Black. The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System. In Proceedings of the IWSLT’07, Trento, Italy, 2007.
[10] Berger, Adam L., Stephen A. Della Pietra, Vincent J. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71.
[11] Bird, Steven and Edward Loper. 2004. NLTK: The Natural Language Toolkit. In Proceedings of ACL, pages 214-217.
[12] Bourdaillet, Julien, Stéphane Huet, Philippe Langlais and Guy Lapalme. 2010. TRANSSEARCH: from a bilingual concordance to a translation finder. Machine Translation, 24(3-4): 241–271.
[13] Bowker, Lynne, Michael Barlow. 2004. Bilingual concordancers and translation memories: A comparative evaluation. In Proceedings of the Second International Workshop on Language Resources for Translation Work, Research and Training , pages. 52-61.
[14] Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263-311.
[15] Callison-Burch, Chris, Colin Bannard and Josh Schroeder. 2005. A Compact Data Structure for Searchable Translation Memories. In Proceedings of EAMT.
[16] Callison-Burch, Chris, Philipp Koehn, Miles Osborne. 2006. Improved Statistical Machine Translation Using Paraphrases. In Proc. of HLT/NAACL 2006. pp. 17-24
[17] Chang, Jason S, David Yu, Chun-Jun Lee. 2001. Statistical Translation Model for Phrases(in Chinese). Computational Linguistics and Chinese Language Processing, 6(2):43-64.
[18] Chen, Keh-Jiann, Ming-Hong Bai. 1998. Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing. 3(1): 27-44.
[19] Chen, Keh-Jiann, Shing-Huan Liu. 1992. Word Identification for Mandarin Chinese Sentences. In Proceedings of 14th COLING, pages 101-107.
[20] Chen, Keh-Jiann, Wei-Yun Ma. 2002. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING 2002, pages 169-175, Taipei, Taiwan.
[21] Chiang, David. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proc. of ACL 2005. pp. 263-270.
[22] CKIP. 1993. Chinese Electronic Dictionary. Technical Report, No. 93-05, Academia Sinica, Taiwan.
[23] DeNero, John, Dan Klein. 2007. Tailoring Word Alignments to Syntactic Machine Translation. In Proceedings of ACL 2007, pages 17-24, Prague, Czech Republic.
[24] Deng, Yonggang, William Byrne. 2005. HMM word and phrase alignment for statistical machine translation. In Proceedings of HLT-EMNLP 2005, pages 169-176, Vancouver, Canada.
[25] Duda, Richard O., Peter E. Hart, David G. Stork. 2001. Pattern Classification. John Wiley & Sons, Inc.
[26] Fairon, C. 1999. GlossaNet: Parsing a web site as a corpus. In Le systeme INTEX, Lingvisticae Investigationes, volume XXII, pages 327-340. John Benjamins Publishing, Amsterdam/Philadelphia.
[27] Gao, Jianfeng, Jian-Yun Nie, Hongzhao He, Weijun Chen, Ming Zhou. 2002. Resolving Query Translation Ambiguity using a Decaying Co-occurrence Model and Syntactic Dependence Relations. In Proc. of SIGIR’02. pp. 183 -190.
[28] Gao, Jianfeng, Mu Li, Andi Wu and Chang-Ning Huang. 2005. Chinese word segmentation and named entity recognition: a pragmatic approach. Computational Linguistics, 31(4)
[29] Gao, Zhao-Ming. 2011. Exploring the effects and use of a Chinese–English parallel concordance. Computer-Assisted Language Learning 24.3 (July 2011): 255-275.
[30] Goldwater, Sharon, David McClosky. 2005. Improving Statistical MT through Morphological Analysis. In Proceedings of HLT/EMNLP 2005, pages 676-683, Vancouver, Canada.
[31] Huang, Chung-chi, Ho-ching Yen and Jason S. Chang. 2011. Using Sublexical Translations to Handle the OOV Problem in Machine Translation. ACM Transactions on Asian Language Information Processing, 10(3): Article 16.Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proc. of HLT/NAACL’03. pp. 127-133.
[32] Jian, Jia-Yan, Yu-Chia Chang and Jason S. Chang. 2004. TANGO: Bilingual Collocational Concordancer. In Proceedings of ACL, pages 166-169.
[33] Kitamura, Mihoko and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proc. of the 4th Annual Workshop on Very Large Corpora. pp. 79-87.
[34] Koehn, Philipp, Franz J. Och, Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proceedings of HLT/NAACL 2003, pages 48-54, Edmonton, Canada.
[35] Koehn, Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proc. EMNLP’04. pp. 388-395.
[36] Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL’07, demonstration session.
[37] Kupiec, Julian. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In Proceedings of ACL, pages 17-22.
[38] Lee, Young-Suk. 2004. Morphological Analysis for Statistical Machine Translation. In Proceedings of HLT-NAACL 2004, pages 57-60, Boston, USA.
[39] Lee, Young-Suk, Kishore Papineni, Salim Roukos. 2003. Language Model Based Arabic Word Segmentation. In Proceedings of ACL 2003, pages 399-406, Sapporo, Japan.
[40] Li, Zhifei and David Yarowsky. 2008. Unsupervised translation induction for Chinese abbreviations using monolingual corpora. In Proc. of ACL 2008. pp. 425-433.
[41] Liang, Percy, Ben Taskar, Dan Klein. 2006. Alignment by Agreement. In Proceedings of HLT-NAACL 2006, pages 104-111, New York, USA.
[42] Liou, Hsien-Chin, Jason S. Chang, Hao-Jan Chen, Chih-Cheng Lin, Meei-Ling Liaw, Zhao-Ming Gao, Jyh-Shing Roger Jang, Yuli Yeh, Thomas C. Chuang, Geeng-Neng You. 2006. Corpora Processing and Computational Scaffolding for a Web-based English Learning Environment: The Candle project. CALICO Journal, 24(1), 77–95.
[43] Ma, Wei-Yun, Keh-Jiann Chen. 2003. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of ACL 2003, Second SIGHAN Workshop on Chinese Language Processing, pp31-38, Sapporo, Japan.
[44] Ma, Wei-Yun and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff. In Proceedings of the second SIGHAN workshop on Chinese language processing, pages 168-171.
[45] Ma, Yanjun, Nicolas Stroppa, Andy Way. 2007. Bootstrapping Word Alignment via Word Packing. In Proceedings of ACL 2007, pages 304-311, Prague, Czech Republic.
[46] Ma, Yanjun, Sylwia Ozdowska, Yanli Sun, and Andy Way. 2008. Improving Word Alignment Using Syntactic Dependencies. In Proc. of ACL/HLT’08 Second Workshop on Syntax and Structure in Statistical Translation. pp. 69-77.
[47] Ma, Xiaoyi. 2006. Champollion: A Robust Parallel Text Sentence Aligner. In Proceedings of the Fifth International Conference on Language Resources and Evaluation..
[48] Marton, Yuval, Chris Callison-Burch and Philip Resnik. 2009. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In Proc. of ACL/AFNLP 2009. pp. 381-390.
[49] Melamed, Ilya Dan. 2001. Empirical Methods for Exploiting parallel Texts. MIT press.
[50] Mirkin, Shachar, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman and Idan Szpektor. 2009. Source-Language Entailment Modeling for Translating Unknown Terms. In Proc. of ACL/AFNLP 2009. pp. 791-799.
[51] Moore, Robert C. 2004. Improving IBM Word-Alignment Model 1. In Proceedings of ACL 2004, pages 519-526, Barcelona, Spain.
[52] Och, Franz Josef and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003.
[53] Och, Franz Josef, Christoph Tillmann, and Hermann Ney. 1999. Improved Alignment Models for Statistical Machine Translation. In Proc. of EMNLP/VLC’99. pp. 20-28.
[54] Och, Franz J. and Hermann Ney., 2000, Improved Statistical Alignment Models, In Proceedings of ACL, pages 440-447. Hong Kong.
[55] Och, Franz Josef. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL 2003. pp. 160-167.
[56] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL’02. pp. 311-318.
[57] Shima, Hideki, Ni Lao, Eric Nyberg, and Teruko Mitamura. Complex Cross-lingual Question Answering as Sequential Classification and Multi-Document Summarization Task. In Proceedings of NTCIR-7 Workshop, Japan, 2008.
[58] Smadja, Frank, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1):1-38.
[59] St. John, Elke. 2001. A Case for Using a Parallel Concordancer and Corpus for Beginners of a Foreign Language. Language Learning & Technology, 5(3), 185-203.
[60] Sudo, Kiyoshi, Satoshi Sekine, and Ralph Grishman. Cross-lingual information extraction system evaluation. In Proceedings of COLING ’04, page 882, Geneva, Switzerland, 2004. Association for Computational Linguistics.
[61] Vogel, Stefan, Hermann Ney, Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of COLING 1996, pages 836-841, Copenhagen, Denmark.
[62] Wilkinson, Michael (2011). "WordSmith Tools: The best corpus analysis program for translators?", in Translation Journal, Vol. 15, No 3
[63] Wu, Dekai, Xuanyin Xia. 1994. Learning an English-Chinese Lexicon from a Parallel Corpus. In Proceedings of AMTA 1994, pages 206-213, Columbia, MD.
[64] Wu, Dekai. 1997. Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23(3):377-403.
[65] Wu, Hua, Ming Zhou. 2003. Synonymous Collocation Extraction Using Translation Information. In Proc. of ACL’03. pp. 120-127.
[66] Wu Jian-Cheng. 2010. Learning to Find Translations for Terms on the Web. In Ph.D. Thesis, Computer Science, National Tsing Hua University, Taiwan.
[67] Wu, Jian-Cheng, Kevin C. Yeh, Thomas C. Chuang, Wen-Chi Shei, Jason S. Chang. 2003. TotalRecall: A Bilingual Concordance for Computer Assisted Translation and Language Learning. In Proceedings of ACL, pages 201-204.
[68] Yamamoto, Kaoru, Yuji Matsumoto. 2000. Acquisition of Phrase-level Bilingual Correspondence using Dependency Structure. In Proceedings of COLING 2000, pages 933-939.
[69] Zhang, Le. 2004. Maximum entropy modeling toolkit for python and c++. available at http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html.
[70] Zhang, Ying and Nguyen Bach. Virtual babel: Towards context-aware machine translation in virtual worlds. In Proceedings of the Twelfth Machine Translation Summit (MTSummit-XII), Ottawa, Canada, August 2009. International Association for Machine Translation.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文