雙語語料庫之多字詞語對應｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	柯明憲 Ming Hsien Ko
論文名稱：	雙語語料庫之多字詞語對應 Alignment of Multi-word Expressions in Parallel Corpora
指導教授：	張俊盛 Jason S. Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	95
中文關鍵詞：	多字詞語、上下文、詞素、對應
外文關鍵詞：	multiword expression, context, morpheme, alignment
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本論文中，我們提出抽取英文多字詞語(multiword expression)之中文對應的新方法。主要的概念是延續Brown等人 (1991)及Gale等人 (1991)對單字翻譯與對應的研究構想。Brown觀察到字義會隨上下文而變化，因此可由上下文更精確決定單字的翻譯；而Gale則利用單字構詞變化型態，由較高頻的雙語詞彙對應，抽取出其他頻率較低的對應。受到這兩項研究的啟發，我們使用現存的詞彙對應工具，估算出詞彙的初始詞彙翻譯機率(lexical translation probability)，再加上針對上下文的考量與詞素(morpheme)修改詞彙翻譯機率。將多字詞語依照修改後的語彙翻譯機率進行逐字對應後，便能輕易地擷取其翻譯。
我們實際製作了程式，以香港平行新聞語料庫的74萬句平行語料為基礎，嘗試擷取語料中的英文多字詞語與其中文對應。我們從新聞語料庫及WordNet語料庫中隨機抽取動詞片語進行測試。兩組實驗都顯示，與目前最佳詞彙對應工具比較，無論是在嚴格評估或寛鬆評估的情況下，提出的方法皆有效地提高了對應的精確率。實驗結果顯示，我們的作法能彌補逐字翻譯的不足，且能克服單字對應工具高精確率但低召回率的問題，建立了更可靠也更有彈性的翻譯記憶體，將可以改進詞彙對應與機器翻譯的效果。

In this paper, we introduce a method to align an English multiword expression (MWE) with Chinese translation equivalent (TE) in a given bilingual parallel corpus. In our approach, we make use of an existing word alignment tool which provides information about word alignment results to estimate context-independent lexical translation probability (LTP). However, such estimates for lexical translation probability often have unsatisfactory precision and recall rates due to the inherent limitation of word alignment. Consider context and morphological information may ease the problem. More specifically, words with related meanings usually have some characters in common. Therefore, we build on word-alignment results to estimate context-sensitive LTP at the morphological level for further alignment.
At runtime, we align each word in an English MWE individually with Chinese words in the target sentence according to this new version of LTP, and combine the alignments into the final TE.
We implement the method on verbal MWEs randomly selected from WordNet and several MWEs manually examined from corpus. The evaluation of the experimental results shows our procedure outperforms the underlying word alignment tool. Our methodology helps to establish more precise translation memory, which may improve the performance of machine translation systems.

摘要    i
ABSTRACT    ii
致謝辭    iii
Table of Contents    iv
List of Tables    v
List of Figures    vi
Chapter 1  Introduction    1
Chapter 2  Related Work    8
Chapter 3  Multiword Expression MT    13
3.1     Problem Statement    13
3.2     Alignment Extraction for Multiword Expression    14
3.3    The Procedure at Training Time    22
3.4    The Procedure at Runtime    23
Chapter 4  Experiment and Evaluation    26
4.1   Experimental Settings    26
4.2   Evaluation    28
4.3    Discussion    29
Chapter 5  Conclusion and Future Work    31
References    32
Appendix A – Example of LTP at Morphological Level    36
LTP of “report” in the context of “make”    36
Appendix B – Experimental Result in Test Set 1    37
Appendix C – Experimental Result in Test Set 2    66

                                

Altenberg, B. and Granger, S. (2001) The Grammatical and Lexical Patterning of Make in Native and Non-native Student Writing. In Applied Linguistics 22, 2: 173-194.
Brown, Peter F.; Cocke, John; Della Pietra, Stephen A.; Della Pietra, Vincent J.; Jelinek, Frederick; Lafferty, John D.; Mercer, Robert L. & Roossin, Paul S. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics, volume 16(2): 79-85.
Brown, Peter F.; Lai, Jennifer C.; & Mercer, Robert L. (1991). Aligning Sentences in Parallel Corpora. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 264-270. Association for Computational Linguistics.
Brown, Peter F.; Della Pietra, Stephen A.; Della Pietra, Vincent J.; & Mercer, Robert L. (1991). Word-Sense Disambiguation Using Statistical Methods. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 264-270. Association of Computational Linguistics.
Brown, Peter F.; Pietra, Stephen A.; Pietra, V. J. D.; & Mercer, Robert L. (1993). The Mathematics of Statistical Machine Translation. In Computational Linguistics, 19(2): 263-313
Butt, M. (2003). The Light Verb Jungle. In Workshop on Multiword Constructions.
Catizone, R., Russell, G., & Warwick, S. (1989). Deriving Translation Data from Bilingual Texts. In Proceedings of the First International Lexical Acquisition Workshop, Detroit, USA.
Chen, Stanley F. (1993). Aligning Sentences in Bilingual Corpora Using Lexical Information. In Proceedings, 31st Annual Meeting of the ACL, Columbus, Ohio, 9-16. Association for Computational Linguistics.
Dagan, Ido; Church, Kenneth W.; & Gale, William A. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. In Proceedings, Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1-8. Association for Computational Linguistics.
Dice, Lee R. (1945). Measures of the Amount of Ecologic Association between Species. Journal of Ecology, 26: 297-302.
van der Eijk, Pim. (1993). Automating the Acquisition of Bilingual Terminology. In Proceedings, Sixth Conference of the European Chapter of the Association for Computational Linguistic, Utrecht, The Netherlands, 113-119. Association for Computational Linguistics.
Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. In Proceedings of ACL-1995, pp. 236-243
Fung, P.; & McKeown, Kathleen R. (1994). Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping. In Proceedings, 1st Conference of the Association for Machine Translation in the Americas (AMTA), Columbia, Maryland, 81-88.
Gale, William A., & Church, Kenneth W. (1991). Identifying Word Correspondences in Parallel Texts. In Proceedings, DARPA Speech and Natural Language Workshop, Pacific Grove, California, 152-157. Morgan Kaufmann, Sam Mateo, California.
Gale, William A., & Church, Kenneth W. (1991). Identifying Word Correspondences in Parallel Texts. In Proceedings Speech and Natural Language Workshop, pp. 152-157
Gale, William A. & Church, Kenneth W. (1991). A Program for Aligning Sentences in Bilingual Corpora. In Proceedings, 29th Annual Meeting of the ACL, Berkeley, California, 177-184. Association for Computational Linguistics.
Gale, William A. & Church, Kenneth W. (1993). A Program for Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1): 75-102.
Kitamura, M.; & Matsumoto Y. (1996). Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. In Proceedings of the Fourth Annual Workshop on Very Large Corpora (WVLC-4), Copenhagen.
Kupiec, J. (1993). An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In Proceedings, 31st Annual Meeting of the ACL, Columbus, Ohio, 17-22. Association for Computational Linguistics.
Kumano, A.; & Hirakawa, H. (1994). Building an MT dictionary from parallel texts based on linguistic and statistical information. In Proceedings of COLING 1994, 76-81.
Lambert, P.; & Castell, N. (2004). Alignment of Parallel Corpora Exploiting Asymmetrically Aligned Phrases. In Proc. of the LREC 2004 Workshop on the Amazing Utility of Parallel and Comparable Corpora, Lisbon, Portugal.
Melamed, I. D. (1995). Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons. In Proceedings of the Third Workshop on Very Large Corpora, pp. 184-198.
Melamed, I. D. (1997). Automatic Discovery of Non-compositional Compounds in Parallel Data. In 2nd Conference on Empirical Methods in Natural Language Processing, Providence.
Moore, R. C. (2001). Towards a Simple and Accurate Statistical Approach to Learning Translational Relationships Among Words. In Proceedings of ACL-2001, pp. 79-86.
Och, F. J.; & H. Ney. (2000). A Comparison of Alignment Models for Statistical Machine Translation. In Proceedings of COLING 2000, 1086-1090.
Och, F. J.; & H. Ney. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19-51, March.
Simard, Michel; Foster, George F.; & Isabelle, Pierre. (1992). Using Cognates to Align Sentences in Bilingual Corpora. In Proceedings, 4th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-92), Montréal, Canada, 67-81.
Sörensen, Thorvald J. (1948). A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and its Application to Analysis of the Vegetation of Danish Commons. Biologiske Skrifter, 5(4): 1-34.
Smadja, Frank. (1992). How to Compile a Bilingual Collocation Lexicon Automatically. In Proceedings, AAAI-92 Workshop on Statistically-Based NLP Techniques, San Jose, California, 65-71. American Association for Artificial Intelligence.
Smadja, Frank. (1993). Retrieving Collocations from Text: Xtract. Computational Linguistics, 19(1): 143-177.
Smadja, Frank; McKeown, Kathleen R.; & Hatzivassiloglou, Vasileios. (1996). Translating Collocations for Bilingual Lexicons: A Statistical Approach. In Computational Linguistics, Vol. 22 No. 1.
Stevenson, Suzanne; Fazly, Afsaneh; & North, Ryan. (2004). Statistical Measures of the Semi-Productivity of Light Verb Constructions. In 2nd ACL Workshop on Multiword Expressions: Integrating Proceeding, July 2004, pp. 1-8.
Wu, D.; & Xia, X. (1994) Learning an English-Chinese Lexicon from a Paarallel Corpus. In Proceedings of AMTA-94, pp. 206-213
Yarowsky, D. (1993) One Sense per Collocation. In Proceedings of the ARPA Human Language Technology Workshop.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文