機器翻譯系統為本之雙語網頁對應｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	余鍵亨 Yu Jian-Heng
論文名稱：	機器翻譯系統為本之雙語網頁對應 Automatic Alignment of Bilingual Web Pages Using Machine Translation Systems
指導教授：	張俊盛 Jason S. Chang
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2005
畢業學年度：	93
語文別：	中文
論文頁數：	45
中文關鍵詞：	平行語料、機器翻譯系統、相似度
外文關鍵詞：	parallel corpora, machine translation system, similarity measure
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本論文中，我們提出一個藉由機器翻譯系統，自動對應網路上雙語網頁的方法。我們先利用程式在網路上自動收集擷取大量的雙語網頁，然後利用機器翻譯系統翻譯原文網頁，接著在原文與譯文的網頁中，擷取互為翻譯的中文與英文詞彙，並且記錄文字的位置，統計互為翻譯詞彙的位置關係。之後，我們計算機器翻譯與真實翻譯的相似性，也就是相同字詞重疊的比例，來估計原文網頁與與譯文網頁間的互為翻譯的可能性，比對之前我們先以所提出的位置關係模組，來過濾掉字詞相同，但位置關係不符合常態的對應。位置關係結合相似性比對可以有效率的過濾不正確的對應，進而可以提升搜尋網頁對應的準確率。
我們也將敘述如何將提出的做法實作成程式，以美國之音（Voice of America）的中英文雙語新聞網頁為實驗語料，並使用三種機器翻譯系統來測試。實驗結果證明我們的方法表現得令人滿意，所對應到的中英文雙語網頁大約有99%的精確率，及99%的召回率。

We introduce a method which automatically aligns bilingual pages on the Web by using a machine translation system. In our approach, a source page is translated using a MT system and aligned with most similar candidate pages. The method involves collecting bilingual pages on the Web, translating each source page, aligning and evaluating each source page with the possible candidate pages based on some similarity measures. The similarity measure is based on a simple approach of counting the ratio of shared characters between a machine translated page and an actual target pages. In order to achieve optimal performance, shared characters are checked for their positions in both pages. These position pairs deviated from the norm are excluded from consideration in calculating page to page similarity.
We describe an implementation of the method by using three MT systems to align online English and Chinese news pages at Voice of America (VOA) website. Experimental results indicate that English and Chinese Web news pages can be aligned with the average precision rate of 99%. The results indicate that the method can be applied to collect online parallel corpora easily and efficiently, therefore, meeting the increasing demand for parallel corpora to be used as a linguistic resource in various natural language processing tasks.

摘  要                                                                         i
Abstract                                                                         ii
Acknowledgement    iii
Table of Contents    iv
List of Tables    v
List of Figures    vi
Chapter 1   Introduction    1
1.1    Background    1
1.2    Components of the proposed method    4
1.3    Motivation    5
Chapter 2  Related Work    8
Chapter 3  Aligning Bilingual Web Pages    11
3.1    Gathering and Preprocessing Web Pages    12
3.2    Using MT systems to translate the Source pages    14
3.3    Use Position-based Model to Rule Out Unlikely Word Pairs    16
3.4    Dice Coefficient as Similarity Measure for Source and Target Pages    17
3.5    Using Bi-direction Ranking to Align Web Pages    21
Chapter 4    Experiment    23
4.1    Features of Voice of America News    23
4.2    Experimental Settings    25
4.2.1    Preprocessing    25
4.2.2    Machine Translation    27
4.2.3    Position-based Model    28
Chapter 5    Evaluation and Discussion    31
5.1    Test Data and Evaluation Metrics    31
5.2    Experiment and Evaluation Results    32
5.3    Discussion    36
Chapter 6  Conclusion and Future Work    39

                                

Almeida, Jose Jo˜ao, Alberto Manuel Sim˜oes,and Jose Alves Castro. 2002. Grab-bing parallel corpora from the web. Num-ber 29, pages 13–20. Sociedade Espa˜nola para el Procesamiento del Lenguaje Nat-ural, Sep.

Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics, 16(2):79-85.

Brown, P., Lai, J. C., and Mercer, R. 1991. Aligning Sentences in Parallel Corpora. In Proceedings of ACL-91, Berkeley CA. 1991

Berger A., Brown P., Pietra S. d., Pietra V. d., Gillett J., Lafferty J., Mercer R., Printz H. and Ures L.'Candide System for Machine Translation' in Human Language Technology: Proceedings of the ARPA Workshop on Speech and Natural Language (1994)

Chen, Jiang and Jian-Yun Nie. 2000. Web parallel text mining for chinese english cross-language information retrieval. In International Conference on Chinese Language Computing, Chicago, Illinois.

Chen, J., Chau, R. and Yeh, C.-H., 2004. Discovering Parallel Text from The World Wide Web, in J. Hogan, P. Montague, M. Purvis, C. Steketee, (Eds.) Volume 32: ACSW Frontiers 2004, ACS Series Conferences in Research and Practice in Information Technology, The Australasian Workshop on Data Mining and Web Intelligence (DMWI2004), Dunedin, New Zealand, 157-161.

Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26:297–302.

Davis, Mark and Dunning Ted. 1995. A TREC evaluation of query translation methods for multi-lingual text retrieval. In Fourth Text Retrieval Conference (TREC-4). NIST.

Gale W. & Church K. W., “A Program for Aligning Sentences in Bilingual Corpora,” Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991.

Harris, B. (1988). "Bi-text, a new concept in translation theory." Language Monthly 54: 8-10.

Jian-Cheng Wu, Kevin C. Yeh, Thomas C. Chuang, Wen-Chi Shei and Jason S. Chang(2003). "TotalRecall: A Bilingual Concordance for Computer Assisted Translation andLanguage Learning" In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), Interactive Posters and Demos Session, Sapporo, Japan, July 7 - 10, 2003.

Jia-Yan Jian, Y.C. Chang and Jason S. Chang(2004). "TANGO: Bilingual Collocational Concordancer" Proceedings of the 42nd Annual Conference of the Association for Computational Linguistics (ACL-04) Barcelona, Spain.

K. W. Church, "Char_align: A Program for Aligning Parallel Texts at the Character Level," Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, OH, 1993.

Melamed, I. Dan. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), Brown University, August.

Ma, Xiaoyi and Mark Liberman. 1999. Bits: A method for bilingual text search over the web. In Machine Translation Summit VII, September 13th, 1999, Kent Ridge Digital Labs, National University of Singapore

Ng, Hwee Tou, & Wang, Bin, & Chan, Yee Seng (2003). Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03). (pp. 455-462). Sapporo, Japan.

Macklovitch, E., M. Simard, P. Langlais. (2000) TransSearch: A Free Translation Memory on the World Wide Web. In the Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece, pp.1201-1208.

Oard, Douglas W. 1997. Cross-language text retrieval research in the USA. In Third DELOS Workshop. European Research Consortium for Informatics and Mathematics, March.

Pascale Fung and Kenneth Church. 1994. Kvec: A new approach for aligning parallel texts. In Proceedings of COLING 9J, pages 1096-1102, Kyoto, Japan, August.

Papineni, K., S. Roukos, T. Ward, W-J. Zhu. 2001. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceeding of the 40th Annual Meeting of ACL, philadelphia, PA.

Resnik, Philip 1998. Parallel strands: A preliminary investigation into mining the web for bilingual text. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas, AMTA-98, in Lecture Notes in Artificial Intelligence, 1529, Langhorne, PA, October 28-31.

Salton, G (1989) Automatic Text Processing: The Transformation, analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading. MA.

Simard, M., G. Foster & P. Isabelle (1992), Using cognates to align sentences in bilingual corpora. In Proceedings of TMI92, Montreal, Canada, pp. 67-81.

Stanley F. Chen. Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting of the ACL, pages 9-16, Columbus, Ohio, June 1993.

Yuji Matsumoto, Hiroshi Ikeda Masaya Yamane, Takehito Utsuro, and Makoto Nagao. 1994. Bilingual text matching using bilingual dictionary and statistics. In Proc. 15th COLING, pages 1076-1082.

Yamamoto, Kaoru and Yuji Matsumoto: 2000, `Acquisition of phrase-level bilingual correspondence using dependency structure`, in Proceedings of COLING-2000, pp. 933–939.

Yarowsky, David, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research (HLT-2001), pages 161-168, San Diego, CA, March.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文