學習於網路上尋求詞彙翻譯｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳鑑城 Wu, Jian-Cheng
論文名稱：	學習於網路上尋求詞彙翻譯 Learning to Find Translations for Terms on the Web
指導教授：	張俊盛 Chang, Jason S.
口試委員:
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2010
畢業學年度：	98
語文別：	英文
論文頁數：	177
中文關鍵詞：	自然語言處理、機器翻譯、網路語料庫、查詢擴充、音譯
外文關鍵詞：	natural language processing, machine translation, Web as Corpus, Query expansion, transliteration
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

複合名詞，專名實體，縮寫等詞語或名詞片語，常佔文章內容的一大部分。因此“詞語翻譯”對於建構辭典、機器翻譯（machine translation, MT）、跨語言資訊檢索（cross-language information retrieval, CLIR）和其他語言相關應用，皆扮演著重要的角色。而詞語翻譯可分為透過另一種語言來描述該詞語的涵義（又稱意譯），或依該詞語的原語言（source-language）的讀音的翻譯（又稱音譯）。然而，隨著全球化的進展以及科技的日新月異，新的詞語與日俱增，且常來不及收錄到詞典，造成了未知詞的問題（OOV, out of vocabulary）。此外，詞語在不同領域的翻譯亦有很大差異性。種種原因導致僅靠字典查找很難完善地處理詞語翻譯，這也使得詞語翻譯成為機器翻譯以及跨語言資訊檢索等研究或應用的一個棘手問題。在本論文中，我們提出了一套學習在網路上尋找詞語翻譯的新方法。該方法包含兩個處理階段：在訓練階段，我們使用雙語術語表來學習來源詞語及翻譯之間的表面樣式（source-target pattern）、詞素關係，以及領域特定關鍵詞，以求能為各類的詞語提供有效的查詢擴充詞，以及擷取翻譯。此階段目標在於獲得有效的擴充查詢後，我們就能透過搜尋引擎自網路上取得更多包含有翻譯資訊的混合語言資料（mixed-code data）。當在執行階段，我們會自動地將給定的詞語轉換為一組附有新增詞彙的擴充查詢式。擴充查詢的目的在於透過網路搜索引擎在大量的混合語言文件中搜尋時，大幅提昇回傳資料中含有適當翻譯（包括音譯或特定領域翻譯）的機會。獲得該查詢所回傳的摘要資訊後，我們隨即從中擷取翻譯候選詞彙並排序之。於本論文中，我們將所提出的方法實作成了一套名為TermMine的系統，經過對於TermMine的實驗和評估後，顯示本研究所提出的方法可以達到相當高的準確率（precision）和召回率（recall），並且在詞語翻譯方面優於現有的機器翻譯系統。

Terms, such as compound nouns, named entities, acronyms, and other noun phrases, make up a bulk of documents. “Term translation,” a term description rendered in an alternative language with its meaning or what it sounds like (which is also called transliteration), plays an important role in lexicon construction, machine translation (MT), cross-language information retrieval (CLIR), and other natural language processing applications. However, with the advent of globalization and technology, many new terms are created and usually become out of vocabulary (OOV). In addition, the translations of a term often vary in different domains. Term translation, therefore, is difficult to handle via simple dictionary lookup, and presents a serious problem for such tasks as MT and CLIR. In this thesis, we present novel methods for learning to find translations of a given term on the Web. The methods involve two processing parts: during the training stage, we use a bilingual term list to learn source-target surface patterns, morpheme relations, and domain-specific knowledge query expansion terms for collecting more mixed-code data containing relevant translations. At run time, the proposed methods automatically transform the given term into expanded queries aimed at maximizing the probability of retrieving appropriate translations including transliterations or domain-specific translations from a very large collection of mixed-code documents via a Web search engine. Then, the methods extract translation candidates from retrieved snippets of the results of submitting the queries, and finally rank the candidates. We present an implementation of a prototype system, TermMine, which applies the methods to find appropriate translations of a given term. Evaluation on a set of experiments shows that the proposed methods can achieve high precision and recall, and outperform existing state-of-the-art machine translation systems.

摘要        I
Abstract        III
Acknowledgements    V
Contents        VII
List of Figures    X
List of Tables    XIII
Chapter 1 Introduction    1
Chapter 2 Related Work    11
Chapter 3 Learning Source-Target Surface Patterns for Web-based Term Translation    21
3.1 Problem Statement    23
3.2 The TermMine System    24
3.2.1 Acquiring Source and Target Surface Patterns    24
3.2.2 Locating and Extracting Translations    27
3.3 Experimental Setting and Results    29
Chapter 4 Learning to Find English to Chinese Transliterations on the Web    34
4.1 Problem Statement    37
4.2 Learning Relationships for Query Expansion    38
4.3 Transliteration Search and Extraction    45
4.4 Experimental Setting and Results    48
Chapter 5 Mining the Web for Domain-Specific Translations    54
5.1 Problem Statement    57
5.2 Learning Domain Keywords for Query Expansion    58
5.2.1 Selecting Terms for Training    59
5.2.2 Generating and Filtering Keywords    61
5.2.3 Ranking Candidate Keywords for QE    64
5.3 Translation Extraction    68
5.4 Experiment Setting and Results    71
5.4.1 Training the Model    71
5.4.2 Term Translation Systems Compared    73
5.4.3 Test Data and Evaluation Procedure    75
5.4.4 Evaluating Domain Translations    77
5.4.5 New Evaluation (2009) on Different Query Expansion    80
5.5 Summary    84
Chapter 6 Discussion and Future Work    86
Bibliography    100
Publications    109
Appendix A – Lists Used from Wikipedia    114
Appendix B – Keywords Generated from Wikipedia    115
Appendix C – Testing Results for 26 Domains    117
Appendix D – Contexts Used for Testing Google Translate    168

                                

Jose Joao Almeida, Alberto Manuel Simoes, and JJose Alves de Castro. 2002. Grabbing parallel corpora from the web. Sociedade Espanola para el Procesamiento del Lenguaje Natural, 29:13–20.
Yaser Al-Onaizan and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of ACL-02, 400-408.
Ming-Hong Bai, Jia-Ming You, Keh-Jiann Chen, and Jason S. Chang. 2009. Acquiring translation equivalences of multiword expressions by normalized correlation frequencies. In Proceedings of EMNLP 2009, 478–486, Singapore, August.
Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics, 16(2): 79-85.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter Estimation. Computational Linguistics, 19(2): 263-311.
Yunbo Cao and Hang Li. 2002. Base noun phrase translation using web data and the EM algorithm. In Proceedings of COLING 2002, 1-7.
Jiang Chen and Jian-Yun Nie. 2000. Web parallel text mining for Chinese English cross-language information retrieval. In Proceedings of NAACL-ANLP.
Jisong Chen, Rowena Chau, Chung-Hsing Yeh. 2004. Discovering Parallel Text from the World Wide Web. In Proceedings of ACSW Frontiers 2004, 157-161.
Conrad Chen, Hsin-Hsi Chen. 2006. A high-accurate Chinese-English NE backward translation system combining both lexical information and web statistics, In Proceedings of the COLING/ACL on Main conference poster sessions, 81-88.
Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien. 2004. Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval. In Proceedings of the 27th ACM-SIGIR, 146-153.
Mona Diab and Steve Finch. 2000. A statistical word-level translation model for comparable corpora. In Proceedings of RIAO 2000.
Bonnie J. Dorr, Pamela W. Jordan, and John W. Benoit. 1998. A Survey of current paradigms in machine translation. LAMP-TR-027/UMIACS-TR-98-72/S-TR-3961, Computer Science Department, University of Maryland.
Pascale Fung. 1995. Compiling Bilingual Lexicon Entries from a Nonparallel English-Chinese Corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora, 173-183.
Pascale Fung and Lo Yuen Yee. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of ACL-98, 414-420.
Jianfeng Gao, Jian-Yun Nie, Endong Xun, Jian Zhang, Ming Zhou, and Changning Huang. 2001. Improving query translation for cross-language information retrieval using statistical models. In the Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 96-104, New Orleans, Louisiana, United States.
Fei Huang, Ying Zhang, and Stephan Vogel. 2005. Mining key phrase translations from web corpora through crosslingual query expansion. In Proceedings of the 28th annual international ACM SIGIR, 669-670.
John W. Hutchins and Harold Somers. 1992. Introduction to Machine Translation. Academic Press.
John W. Hutchins. 1995. Machine translation: A brief history. In Concise history of the language sciences: from the Sumerians to the cognitivists. Koerner E.F. Konrad and Asher E. Ronald, Eds. Oxford: Pergamon Press, 431-445.
Isao Goto, Naoto Kato, Noriyoshi Uratani, AND Terumasa Ehara. 2003. Transliteration Considering Context Information based on the Maximum Entropy Method, In Proceeding of the MT-Summit IX, 125–132.
Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, and Qingsheng Zhu. 2009. Mining Bilingual Data from the Web with Adaptively Learnt Patterns. In Proceedings of ACL-09, 870–878.
Mihoko Kitamura and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the Fourth Workshop on Very Large Corpora, 79-87, Copenhagen, Denmark.
Kevin Knight and Jonathan Graehl. 1997. Machine transliteration. In Proceedings of ACL-97, 128-135.
Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics 24(4):599-612.
Philipp Koehn, and Kevin Knight. 2003. Feature-rich statistical translation of noun phrases. In Proceedings of ACL-03, 311-318.
Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Proceedings of AMTA, 115-124.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, 177-180.
Julian Kupiec. 1993. An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In Proceedings of ACL-93, 23–30.
KL Kwok, P Deng, N Dinstl, HL Sun, W Xu, P Peng, CHINET: a Chinese name finder system for document triage. In Proceedings of 2005 International Conference on Intelligence.
Chun-Jen Lee and Jason S. Chang. 2003. Acquisition of English-Chinese Transliterated Word Pairs from Parallel-Aligned Texts using a Statistical Machine Transliteration Model, In Proceedings of HLT-NAACL 2003 Workshop, 96-103.
Chun-Jen Lee, Jason S. Chang, and Jyh-Shing Roger Jang. 2006. Extraction of transliteration pairs from parallel corpora using a statistical transliteration model, Information Sciences 176, 67–90.
Haizhou Li, Min Zhang, and Jian Su. 2004. A joint source-channel model for machine transliteration. In Proceedings of ACL-04, 160–167.
Dekang Lin, Shaojun Zhao, Benjamin Van Durme andMarius Pasca. 2008. Mining Parenthetical Translations from the Web by Word Alignment. In Proceed-ings of ACL-08:HLT, 994–1002, Columbus,Ohio, USA.
Tracy Lin, Jian-Cheng Wu, and Jason S. Chang. 2004. Extraction of Name and Transliteration in Monolingual and Parallel Corpora. In Proceedings of AMTA 2004, 177-186.
Adam Lopez. 2007. A survey of statistical machine translation. LAMP-TR-135/CS-TR-4831/UMIACS-TR-2006-47, Computer Science Department, University of Maryland.
Wen-Hsiang Lu, Lee-Fung Chien, and Hsi-Jian Lee. 2002. Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing, 1(2): 159-172.
Wen-Hsiang Lu, Lee-Fung Chien, and Hsi-Jian Lee. 2003. LiveTrans: Translation suggestion for cross-language Web search from Web anchor texts and search results. In Proceedings of ROCLING 2003, 57-72.
Ilya Dan Melamed. 1996. A geometric approach to mapping bitext correspondence. In Brill, E. and Church, K., editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1-12. Association for Computational Linguistics, Somerset, New Jersey.
Ilya Dan Melamed. 2000. Models of translational equivalence among words. Computational Linguistics, 26(2):221–249, Jun.
Ilya Dan Melamed. 2001. Empirical Methods for Exploiting parallel Texts. MIT press.
Robert C. Moore. 2001. Towards a simple and accurate statistical approach to learning translation relationships among words. In Proceedings of the workshop on Data-driven methods in machine translation, 1-8, July 07-07, 2001, Toulouse, France.
Dragos Stefan Munteanu, Alexander Fraser and Daniel Marcu. 2004. Improved machine translation performance via parallel sentence extraction from Comparable Corpora. In Proceedings of HLT-NAACL, 265-272.
Masaaki Nagata, Teruka Saito, and Kenji Suzuki. 2001. Using the web as a bilingual dictionary. In Proceedings of the workshop on Data-driven methods in machine translation, 1-8.
Jong-Hoon Oh and Key-Sun Choi. 2005. An ensemble of grapheme and phoneme for machine transliteration. In Proceedings of IJCNLP05, 450–461.
Jong-Hoon Oh and Hitoshi Isahara. 2006. Mining the Web for Transliteration Lexicons: Joint-Validation Approach, In IEEE/WIC/ACM International Conference on Web Intelligence, 254-261.
Chiew Kin Quah. 2006. Translation and Technology, Palgrave Textbooks in Translation and Interpretation, Palgrave MacMillan.
Philip Resnik. 1998. Parallel strands: A preliminary investigation into mining the web for bilingual text. In Proceedings of ACL-99, 527-534.
Philip Resnik. 1999. Mining the Web for bilingual text. In Proceedings of ACL-99, 527–534.
Hee-Cheol Seo, Sang-Bum Kim, Hae-Chang Rim, Sung-Hyon Myaeng. 2005. Improving query translation in English-Korean cross-language information retrieval. In the Information Processing and Management: an International Journal, 507-522.
Li Shao and Hwee Tou Ng. 2004. Mining new word translations from comparable corpora. In Proceedings of COLING, 618-624.
Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1):1-38.
Chuan-Yao Su. 2006. Bilingual proper nouns extraction through web mining. Master thesis, National Chao Tung University, Taiwan.
Jian-Cheng Wu, Tracy Lin, and Jason S. Chang. 2005. Learning source-target surface patterns for web-based terminology translation. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, 37-40.
Jian-Cheng Wu and Jason S. Chang. 2007. Learning to find English to Chinese transliterations on the Web. In Proceedings of EMNLP-CoNLL, 996-1004.
Jian-Cheng Wu, Peter Wei-Huai Hsu, Chiung-Hui Tseng, Jason S. Chang. 2008. Mining the Web for Domain-Specific Translations. In Proceedings of AMTA 2008, 21-25.
Kaoru Yamamoto and Yuji Matsumoto. 2000. Acquisition of phrase-level bilingual correspondence using dependency structure. In Proceedings of COLING 2000, 933-939, Saarbrueken, Germany.
David Yarowsky. 1993. One sense per collocation. In Proceedings of ARPA Human Language Technology Workshop, 266 - 271.
Ying Zhang, Fei Huang, Stephan Vogel. 2005. Mining translations of OOV terms from the web through cross-lingual query expansion. In Proceedings of the 28th Annual International ACM SIGIR, 669-670.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文