統計式片語對應與翻譯模型｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	游大緯 Ta-wei Yu
論文名稱：	統計式片語對應與翻譯模型 A New Approach to Statistical Translation Model for Phrases
指導教授：	張俊盛 Jason S. Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2002
畢業學年度：	90
語文別：	中文
論文頁數：	66
中文關鍵詞：	統計式機器翻譯、片語翻譯、跨語言檢索
外文關鍵詞：	Statistical Machine Translation, Phrase Translation, Cross-language Information Retrieval
相關次數：	點閱：4 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

　　機器翻譯是自然語言處理研究上最重要的課題之一。過去運用機器翻譯比較成功的例子，多是特定領域文件的翻譯。近來因網際網路與搜尋引擎的盛行，機器翻譯在跨語言檢索（Cross-Language Information Retrieval）中的角色開始受到重視。在跨語言檢索時，通常是對查詢字詞或片語進行翻譯（Query Translation），翻譯的結果對檢索的效果有很大的影響。我們希望透過統計式片語機器翻譯（Statistical Phrase Translation Model, SPTM）的做法來進行查詢關鍵詞的翻譯，以獲取良好的跨語言檢索效果。
過去的相關研究基本上可分為兩大類方法：統計為本做法和詞彙為本做法。在統計為本的做法中，以IBM Watson研究中心的Brown等（1988, 1990, 1993）提出的統計式機器翻譯做法，在理論上較為嚴謹，在架構與做法上較為明確可行。

我們提出一種新的模型來改進Brown的做法。新模型將Brown模型翻譯機率（Translation Probability）的三個機率函數──詞彙翻譯機率（Lexical Translation Probability）、孳生機率（Fertility Probability）、位置扭曲機率（Distortion Probability），轉化成兩個機率函數──詞彙翻譯機率及指派機率（Assignment Probability）。

我們以BDC漢英字典裡的65,078對名詞片語為語料，使用EM演算法做了一系列的實驗，並使用Och等人（2000）的評估方法來評估我們的實驗結果，得到召回率（Recall）為92.0%，準確率（Precision）為91.3%，錯誤率（Error Rate）為8.4%。我們亦研究中文斷詞及EM演算法的起始模型對實驗結果的影響，發現中文斷詞對訓練結果有小幅度幫助，而較佳的起始模型會得到較好的訓練結果。

Machine Translation is one of the most difficult problems in the field of natural language processing. In the past, MT has been applied to professional communication in the process of translating technical and corporate document in a specific domain. Recently, because of the rapid development of Internet and the need to access information across the language, people began to look into the role that MT can play in Cross Language Information Retrieval. The prevalent approach to CLIR is based on translation of query phrases. We propose a noval approach based on Statistical Phrase Translation Model (SPTM), aimed at achieving a tighter estimation of phrase translation probability.
Experiments were conducted using bilingual phrases in the BDC Electronic Chinese-English Dictionary. The training of alignment model is done by the EM-algorithm. For evaluation, we adapted the methodology used by Och et al. (2000) to assess the performance of the experiment. We obtained the recall rate of 92.0%, the precesion rate of 91.3% and the error rate of 8.4%.

The effect of Chinese segmentation and initial model of EM algorithm was also studied. We found that Chinese segmentation can improve the traning result slightly. A better initial model was found to improve the performance of the EM algorithm significantly.

第一章 緒論……………………………………………………… 1
第二章 統計式機器翻譯模型…………………………………… 4

第三章 統計式片語翻譯模型…………………………………… 6

第四章 實驗……………………………………………………… 12

    4.1 雙語語料……………………………………………… 12

    4.2 實驗設計與起始機率值的設定……………………… 13

    4.3 EM演算法的第一輪計算……………………………… 17

        4.3.1 第一次對應最佳化…………………………… 17

        4.3.2 指派機率函數值的重新估算………………… 18

        4.3.3 詞彙翻譯機率值的重新估算………………… 20

    4.4  EM演算法的第二輪計算……………………………… 25

    4.5  EM演算法的第三至五輪計算………………………… 27

第五章 實驗結果評估及討論…………………………………… 29

    5.1 結果觀察……………………………………………… 29

    5.2 評估方法……………………………………………… 30

    5.3 實驗結果評估與比較………………………………… 31

    5.4 中文斷詞的影響……………………………………… 35

    5.5 起始模型的影響……………………………………… 40

第六章 結論與未來研究………………………………………… 44

    6.1 結論…………………………………………………… 44

    6.2 未來研究方向………………………………………… 44

附錄一 評估效能的雙語條目及參考答案……………………… 47

    a. 2英文字之雙語條目250個……………………………… 47

    b. 3英文字之雙語條目250個……………………………… 53

附錄二 【實驗一】第二輪對應錯誤的部分條目……………… 59

    a. 2英文字………………………………………………… 59

    b. 3英文字………………………………………………… 61

參考文獻………………………………………………………… 63

1. BDC 1992 The BDC Chinese-English electronic dictionary (version 2.0), Behavior Design Corporation, Taiwan.
2. Brown, P. F., Cocke J., Della Pietra S. A., Della Pietra V. J., Jelinek F., Mercer R. L., and Roosin P. S. 1988 A Statistical Approach to Language Translation, In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, pp. 71-76.
3. Brown, P. F., Cocke J., Della Pietra S. A., Della Pietra V. J., Jelinek F., Lafferty J. D., Mercer R. L., and Roosin P. S. 1990 A Statistical Approach to Machine Translation, Computational Linguistics, 16/2, pp. 79-85.
4. Brown, P. F., Della Pietra S. A., Della Pietra V. J., and Mercer R. L. 1993 The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 19/2, pp. 263-311.
5. Chang, J. S. et al. 2001. Nathu IR System at NTCIR-2. In Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, pp. (5) 49-52, National Institute of Informatics, Japan.
6. Chang, J. S., Ker S. J., and Chen M. H. 1998 Taxonomy and Lexical Semantics – From the Perspective of Machine Readable Dictionary, In Proceedings of the third Conference of the Association for Machine Translation in the Americas (AMTA), pp. 199-212.
7. Chen, H.H., G.W. Bian and W.C. Lin. 1999. Resolving Translation Ambiguity and Target Polysemy in Cross-Language Information Retrieval. In Proceedings of the 37th Annual Meeting of the Association for Computation Linguistics, pp 215-222.
8. Dagan, I., Church K. W. and Gale W. A. 1993 Robust Bilingual Word Alignment or Machine Aided Translation, In Proceedings of the Workshop on Very Large Corpora Academic and Industrial Perspectives, pp. 1-8.
9. Fung, P. and McKeown K. 1994 Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping, In Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA), pp. 81-88, Columbia, Maryland, USA.
10. Gale, W. A. and Church K. W. 1991 Identifying Word Correspondences in Parallel Texts, In Proceedings of the Fourth DARPA Speech and Natural Language Workshop, pp. 152-157.
11. Gey, F C and A. Chen. 1997. Phrase Discovery for English and Cross-Language Retrieval at TREC-6. In Proceedings of the 6th Text Retrieval Evaluation Conference, pp 637-648.
12. Ide, N. and J Veronis. 1998. Special Issue on Word Sense Disambiguation, editors, Computational Linguistics, 24/1.
13. Isabelle, P. 1987 Machine Translation at the TAUM Group, In M. King, editor, Machine Translation Today: The State of the Art, Proceedings of the Third Lugano Tutorial, pp. 247-277.
14. Kando, Noriko, Kenro Aihara, Koji Eguchi and Hiroyuki Kato. 2001. Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, National Institute of Informatics, Japan.
15. Kay, M. and Röscheisen M. 1988 Text-Translation Alignment, Technical Report P90-00143, Xerox Palo Alto Research Center.
16. Ker, S. J. and Chang J. S. 1997 A Class-base Approach to Word Alignment, Computational Linguistics, 23/2, pp. 313-343.
17. Knight, K. and J Graehl. 1997. Machine Transliteration, In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of ACL European Chapter, pp. 128-135.
18. Kupiec, Julian. 1993 An Algorithm for finding noun phrase correspondence in bilingual corpus, In ACL 31, 23/2, pp. 17-22.
19. Kwok, K L. 2001. NTCIR-2 Chinese, Cross-Language Retrieval Experiments Using PIRCS. In Proceedings of the Second NTCIR Workshop Meeting on Evaluation of Chinese and Japanese Text Retrieval and Text Summarization, pp. (5) 14-20, National Institute of Informatics, Japan.
20. Longman Group 1992 Longman English-Chinese Dictionary of Contemporary English, Published by Longman Group (Far East) Ltd., Hong Kong.
21. McCarley, J. Scott. 1999. Should we Translate the Documents or the Queries in Cross-Language Information Retrieval? In Proceedings of the 37th Annual Meeting of the Association for Computation Linguistics, pp 208-214.
22. Melamed, I. D. 1996 Automatic Construction of Clean Broad-Coverage Translation Lexicons, In Proceedings of the second Conference of the Association for Machine Translation in the Americas (AMTA), pp. 125-134.
23. Nagao, M. 1986 Machine Translation: How Far Can it Go? Oxford University Press, Oxford.
24. Oard, D W and J. Wang. 1999. Effect of Term Segmentation on Chinese/English Cross-Language Information Retrieval. In Proceedings of the Symposium on String and Processing and Information Retrieval. http://www.glue.umd.edu/~oard/research.html.
25. Och, Franz Josef and Hermann Ney. 2000. Improved Statistical Alignment Models. In Proceedings of the 38th Annual Meeting of the Association for Computation Linguistics.
26. Pirkola, A. 1998. The Effect of Query Structure and Dictionary Setups in Dictionary-based Cross-Language Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 55-63.
27. Shei, CC. 2000. Computational Approach to the Teaching of Translation Skills: Some Initial Ideas on the Project of TransFree. 21世紀口筆譯教學的趨勢與展望研討會, 台北, 台灣師大.
28. Shei, CC and M Pain. 2001. An ESL Writer’s Collocational Aid, Computer Assisted Language Learning (CALL) 13(2): 167-182.
29. Smadja, F., McKeown K., and Hatzivassiloglou V. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach, Computational Linguistics, 22/1, pp. 1-38.
30. Sun Le, Jin Youbing, Du Lin and Sun Yufang. 2000. Word Alignment of English-Chinese Bilingual Corpus Based on Chunks,
31. Utsuro, T., Ikeda H., Yamane M., Matsumoto M., and Nagao M. 1994 Bilingual Text Matching Using Bilingual Dictionary and Statistics, In Proceedings of the 15th International Conference on Computational Linguistics, pp. 1076-1082.
32. Wu, D. and Xia X. 1994 Learning an English-Chinese Lexicon from a Parallel Corpus, In Proceedings of the first Conference of the Association for Machine Translation in the Americas (AMTA), pp. 206-213.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文