簡易檢索 / 詳目顯示

研究生: 柯明憲
Ming Hsien Ko
論文名稱: 雙語語料庫之多字詞語對應
Alignment of Multi-word Expressions in Parallel Corpora
指導教授: 張俊盛
Jason S. Chang
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 95
中文關鍵詞: 多字詞語上下文詞素對應
外文關鍵詞: multiword expression, context, morpheme, alignment
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本論文中,我們提出抽取英文多字詞語(multiword expression)之中文對應的新方法。主要的概念是延續Brown等人 (1991)及Gale等人 (1991)對單字翻譯與對應的研究構想。Brown觀察到字義會隨上下文而變化,因此可由上下文更精確決定單字的翻譯;而Gale則利用單字構詞變化型態,由較高頻的雙語詞彙對應,抽取出其他頻率較低的對應。受到這兩項研究的啟發,我們使用現存的詞彙對應工具,估算出詞彙的初始詞彙翻譯機率(lexical translation probability),再加上針對上下文的考量與詞素(morpheme)修改詞彙翻譯機率。將多字詞語依照修改後的語彙翻譯機率進行逐字對應後,便能輕易地擷取其翻譯。
    我們實際製作了程式,以香港平行新聞語料庫的74萬句平行語料為基礎,嘗試擷取語料中的英文多字詞語與其中文對應。我們從新聞語料庫及WordNet語料庫中隨機抽取動詞片語進行測試。兩組實驗都顯示,與目前最佳詞彙對應工具比較,無論是在嚴格評估或寛鬆評估的情況下,提出的方法皆有效地提高了對應的精確率。實驗結果顯示,我們的作法能彌補逐字翻譯的不足,且能克服單字對應工具高精確率但低召回率的問題,建立了更可靠也更有彈性的翻譯記憶體,將可以改進詞彙對應與機器翻譯的效果。


    In this paper, we introduce a method to align an English multiword expression (MWE) with Chinese translation equivalent (TE) in a given bilingual parallel corpus. In our approach, we make use of an existing word alignment tool which provides information about word alignment results to estimate context-independent lexical translation probability (LTP). However, such estimates for lexical translation probability often have unsatisfactory precision and recall rates due to the inherent limitation of word alignment. Consider context and morphological information may ease the problem. More specifically, words with related meanings usually have some characters in common. Therefore, we build on word-alignment results to estimate context-sensitive LTP at the morphological level for further alignment.
    At runtime, we align each word in an English MWE individually with Chinese words in the target sentence according to this new version of LTP, and combine the alignments into the final TE.
    We implement the method on verbal MWEs randomly selected from WordNet and several MWEs manually examined from corpus. The evaluation of the experimental results shows our procedure outperforms the underlying word alignment tool. Our methodology helps to establish more precise translation memory, which may improve the performance of machine translation systems.

    摘要 i ABSTRACT ii 致謝辭 iii Table of Contents iv List of Tables v List of Figures vi Chapter 1 Introduction 1 Chapter 2 Related Work 8 Chapter 3 Multiword Expression MT 13 3.1 Problem Statement 13 3.2 Alignment Extraction for Multiword Expression 14 3.3 The Procedure at Training Time 22 3.4 The Procedure at Runtime 23 Chapter 4 Experiment and Evaluation 26 4.1 Experimental Settings 26 4.2 Evaluation 28 4.3 Discussion 29 Chapter 5 Conclusion and Future Work 31 References 32 Appendix A – Example of LTP at Morphological Level 36 LTP of “report” in the context of “make” 36 Appendix B – Experimental Result in Test Set 1 37 Appendix C – Experimental Result in Test Set 2 66

    Altenberg, B. and Granger, S. (2001) The Grammatical and Lexical Patterning of Make in Native and Non-native Student Writing. In Applied Linguistics 22, 2: 173-194.
    Brown, Peter F.; Cocke, John; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Jelinek, Frederick; Lafferty, John D.; Mercer, Robert L. & Roossin, Paul S. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics, volume 16(2): 79-85.
    Brown, Peter F.; Lai, Jennifer C.; & Mercer, Robert L. (1991). Aligning Sentences in Parallel Corpora. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 264-270. Association for Computational Linguistics.
    Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; & Mercer, Robert L. (1991). Word-Sense Disambiguation Using Statistical Methods. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 264-270. Association of Computational Linguistics.
    Brown, Peter F.; Pietra, Stephen A.; Pietra, V. J. D.; & Mercer, Robert L. (1993). The Mathematics of Statistical Machine Translation. In Computational Linguistics, 19(2): 263-313
    Butt, M. (2003). The Light Verb Jungle. In Workshop on Multiword Constructions.
    Catizone, R., Russell, G., & Warwick, S. (1989). Deriving Translation Data from Bilingual Texts. In Proceedings of the First International Lexical Acquisition Workshop, Detroit, USA.
    Chen, Stanley F. (1993). Aligning Sentences in Bilingual Corpora Using Lexical Information. In Proceedings, 31st Annual Meeting of the ACL, Columbus, Ohio, 9-16. Association for Computational Linguistics.
    Dagan, Ido; Church, Kenneth W.; & Gale, William A. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. In Proceedings, Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1-8. Association for Computational Linguistics.
    Dice, Lee R. (1945). Measures of the Amount of Ecologic Association between Species. Journal of Ecology, 26: 297-302.
    van der Eijk, Pim. (1993). Automating the Acquisition of Bilingual Terminology. In Proceedings, Sixth Conference of the European Chapter of the Association for Computational Linguistic, Utrecht, The Netherlands, 113-119. Association for Computational Linguistics.
    Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. In Proceedings of ACL-1995, pp. 236-243
    Fung, P.; & McKeown, Kathleen R. (1994). Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping. In Proceedings, 1st Conference of the Association for Machine Translation in the Americas (AMTA), Columbia, Maryland, 81-88.
    Gale, William A., & Church, Kenneth W. (1991). Identifying Word Correspondences in Parallel Texts. In Proceedings, DARPA Speech and Natural Language Workshop, Pacific Grove, California, 152-157. Morgan Kaufmann, Sam Mateo, California.
    Gale, William A., & Church, Kenneth W. (1991). Identifying Word Correspondences in Parallel Texts. In Proceedings Speech and Natural Language Workshop, pp. 152-157
    Gale, William A. & Church, Kenneth W. (1991). A Program for Aligning Sentences in Bilingual Corpora. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 177-184. Association for Computational Linguistics.
    Gale, William A. & Church, Kenneth W. (1993). A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1): 75-102.
    Kitamura, M.; & Matsumoto Y. (1996). Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the Fourth Annual Workshop on Very Large Corpora (WVLC-4), Copenhagen.
    Kupiec, J. (1993). An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In Proceedings, 31st Annual Meeting of the ACL, Columbus, Ohio, 17-22. Association for Computational Linguistics.
    Kumano, A.; & Hirakawa, H. (1994). Building an MT dictionary from parallel texts based on linguistic and statistical information. In Proceedings of COLING 1994, 76-81.
    Lambert, P.; & Castell, N. (2004). Alignment of Parallel Corpora Exploiting Asymmetrically Aligned Phrases. In Proc. of the LREC 2004 Workshop on the Amazing Utility of Parallel and Comparable Corpora, Lisbon, Portugal.
    Melamed, I. D. (1995). Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons. In Proceedings of the Third Workshop on Very Large Corpora, pp. 184-198.
    Melamed, I. D. (1997). Automatic Discovery of Non-compositional Compounds in Parallel Data. In 2nd Conference on Empirical Methods in Natural Language Processing, Providence.
    Moore, R. C. (2001). Towards a Simple and Accurate Statistical Approach to Learning Translational Relationships Among Words. In Proceedings of ACL-2001, pp. 79-86.
    Och, F. J.; & H. Ney. (2000). A Comparison of Alignment Models for Statistical Machine Translation. In Proceedings of COLING 2000, 1086-1090.
    Och, F. J.; & H. Ney. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19-51, March.
    Simard, Michel; Foster, George F.; & Isabelle, Pierre. (1992). Using Cognates to Align Sentences in Bilingual Corpora. In Proceedings, 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-92), Montréal, Canada, 67-81.
    Sörensen, Thorvald J. (1948). A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and its Application to Analysis of the Vegetation of Danish Commons. Biologiske Skrifter, 5(4): 1-34.
    Smadja, Frank. (1992). How to Compile a Bilingual Collocation Lexicon Automatically. In Proceedings, AAAI-92 Workshop on Statistically-Based NLP Techniques, San Jose, California, 65-71. American Association for Artificial Intelligence.
    Smadja, Frank. (1993). Retrieving Collocations from Text: Xtract. Computational Linguistics, 19(1): 143-177.
    Smadja, Frank; McKeown, Kathleen R.; & Hatzivassiloglou, Vasileios. (1996). Translating Collocations for Bilingual Lexicons: A Statistical Approach. In Computational Linguistics, Vol. 22 No. 1.
    Stevenson, Suzanne; Fazly, Afsaneh; & North, Ryan. (2004). Statistical Measures of the Semi-Productivity of Light Verb Constructions. In 2nd ACL Workshop on Multiword Expressions: Integrating Proceeding, July 2004, pp. 1-8.
    Wu, D.; & Xia, X. (1994) Learning an English-Chinese Lexicon from a Paarallel Corpus. In Proceedings of AMTA-94, pp. 206-213
    Yarowsky, D. (1993) One Sense per Collocation. In Proceedings of the ARPA Human Language Technology Workshop.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE