研究生: |
張至 Chang, Chee |
---|---|
論文名稱: |
Learning to Find Translations and Transliterations on the Web based on Conditional Random Fields |
指導教授: | 陳煥宗 |
口試委員: |
張智星
蔡宗翰 陳煥宗 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 英文 |
論文頁數: | 51 |
中文關鍵詞: | 機器翻譯 、跨語言資訊擷取 、維基百科 、conditional random fields |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在近年自然語言處理與資訊檢索的研究中,跨語言系統大多建立於大規模平行語料庫的基礎上,例如統計式機器翻譯、跨語言檢索、跨語言知識連結、多語系知識架構等。然而,就算在取得了極大規模的平行語料庫作為訓練,也時常無法涵蓋足夠的專有名詞與術語。在這個研究中,我們提出一個新的方法。基於機器學習,我們的方法可以用於自動擷取網路上專有名詞與術語的翻譯與音譯。在我們的研究路線裡,我們從中文與英文維基百科的跨語言文章對應中取得了一個小型的專有名詞與術語的中英對照表,並透過網路搜尋引擎取的中英交錯的網頁摘要。我們接著利用此對照表,自動的標記這些網頁摘要。利用各種可自由取得的外部知識,包含中研院資訊所雙語 WordNet 知識架構、國立編譯館中英術語翻譯表、由維基百科擷取的人名與地名音譯表,我們自動的產生四種不同的特徵值,包含翻譯特徵值、音譯特徵值、文字表面樣式特徵值、距離特徵值,並使用 CRF++ 套件自動的訓練一個 conditional random field (CRF) 模型。在執行階段,我們經由使用者提供的英文術語或專有名詞,利用網路搜尋引擎取得中英交錯的網頁摘要,並使用訓練好的 CRF 模型來擷取可能的候選翻譯或音譯,最後輸出最高頻率候選翻譯或音譯。經由初步的實驗,結果顯示,在相似的驗證過程中,我們所提出的方法有效的結合了前述幾種不同的特徵值,其準確度與涵蓋度效能皆超越前人研究約百分之十。
In recent years, state-of-the-arts cross-linguistic systems have been based on parallel cor- pora. However, it is difficult at times to find translations of a certain technical term or named entity even with a very large parallel corpus. In this paper, we present a new method for learning to find translations on the Web for a given term. In our approach, we use a small set of terms and translations to obtain mixed-code snippets returned by a search engine. We then automatically annotate the data with translation tags, automati- cally generate features to augment the tagged data, and automatically train a conditional random fields model for identifying translations. At runtime, we obtain mixed-code web- pages containing the given term, and run the model to extract translations as output. Pre- liminary experiments and evaluation results show our method cleanly combines various features, resulting in a system which outperforms previous work.
Bian, G.-W., & Chen, H.-H. (2000). Cross-language information access to multilin- gual collections on the internet. Journal of the American Society for Informa- tion Science, 51(3), 281–296. Available from http://dx.doi.org/10.1002/ (SICI)1097-4571(2000)51:3<281::AID-ASI7>3.0.CO;2-8
Cao, Y., & Li, H. (2002). Base noun phrase translation using web data and the em algorithm. In Proceedings of the 19th international conference on compu- tational linguistics - volume 1 (pp. 1–7). Stroudsburg, PA, USA: Association for Computational Linguistics. Available from http://dx.doi.org/10.3115/ 1072228.1072239
Chang, J. Z., Chang, J. S., & Jang, R. J.-S. (2012, July). Learning to find translations and transliterations on the web. In Proceedings of the 50th annual meeting of the association for computational linguistics (volume 2) (pp. 130–134). Jeju Island, Korea: Association for Computational Linguistics. Available from http://www .aclweb.org/anthology/P12-2026
Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., & Chien, L.-F. (2004). Translating unknown queries with web corpora for cross-language information re- trieval. In Proceedings of the 27th annual international acm sigir conference on research and development in information retrieval (pp. 146–153). New York, NY,USA: ACM. Available from http://doi.acm.org/10.1145/1008992.1009020 Gale, W. A., & Church, K. W. (1991). Identifying word correspondence in parallel texts. In Proceedings of the workshop on speech and natural language (pp. 152–157). Stroudsburg, PA, USA: Association for Computational Linguistics. Available from
http://dx.doi.org/10.3115/112405.112428
Giles, J. (2005, Dec 15). Internet encyclopaedias go head to head. Nature, 438(7070), 900–901. Available from http://dx.doi.org/10.1038/438900a
Google. (2010). Freebase data dumps (August 16th, 2010 ed.). http://download .freebase.com/datadumps/.
Gravano, L., & Henzinger, M. H. (2006, December). Systems and methods for using anchor text as parallel corpora for cross-language information retrieval (No. 7146358). Available from http://www.freepatentsonline.com/7146358 .html
Huang, C.-R. (2003, oct.). Sinica bow: integrating bilingual wordnet and sumo ontology. In Natural language processing and knowledge engineering, 2003. proceedings. 2003 international conference on (p. 825 -826).
Knight, K., & Graehl, J. (1998, December). Machine transliteration. Comput. Linguist., 24(4), 599–612. Available from http://dl.acm.org/citation.cfm?id=972764 .972767
Koehn, P., & Knight, K. (2003). Feature-rich statistical translation of noun phrases. In
Proceedings of the 41st annual meeting on association for computational linguistics - volume 1 (pp. 311–318). Stroudsburg, PA, USA: Association for Computational Linguistics. Available from http://dx.doi.org/10.3115/1075096.1075136
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual cor- pora. In Proceedings of the 31st annual meeting on association for computational linguistics (pp. 17–22). Stroudsburg, PA, USA: Association for Computational Lin-
guistics. Available from http://dx.doi.org/10.3115/981574.981577
Kwok, K., Deng, P., Dinstl, N., Sun, H., Xu, W., P.Peng, et al. (2005). Chinet: a chinese name finder system for document triage. In Proceedings of 2005 international
conference on intelligence analysis.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning (pp. 282–289). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Available from http:// dl.acm.org/citation.cfm?id=645530.655813
Lin, D., Zhao, S., Van Durme, B., & Pas ̧ca, M. (2008, June). Mining parenthetical translations from the web by word alignment. In Proceedings of acl-08: Hlt (pp. 994–1002). Columbus, Ohio: Association for Computational Linguistics. Available from http://www.aclweb.org/anthology/P/P08/P08-1113
Lu, W.-H., Chien, L.-F., & Lee, H.-J. (2004, April). Anchor text mining for translation of web queries: A transitive translation approach. ACM Trans. Inf. Syst., 22(2), 242–269. Available from http://doi.acm.org/10.1145/984321.984324
Melamed, I. D. (2000, June). Models of translational equivalence among words. Com- put. Linguist., 26(2), 221–249. Available from http://dx.doi.org/10.1162/ 089120100561683
Nagata, M., Saito, T., & Suzuki, K. (2001). Using the web as a bilingual dictionary. In Proceedings of the workshop on data-driven methods in machine translation - volume 14 (pp. 1–8). Stroudsburg, PA, USA: Association for Computational Lin- guistics. Available from http://dx.doi.org/10.3115/1118037.1118050
Quah, C. K. (2006). Translation and technology. Palgrave Macmillan.
Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996, March). Translating colloca- tions for bilingual lexicons: a statistical approach. Comput. Linguist., 22(1), 1–38. Available from http://dl.acm.org/citation.cfm?id=234285.234287
Sproat, R. W., & Shih, C. (1990). A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, 4(4), 336–351.
Wu, J.-C., Lin, T., & Chang, J. S. (2005). Learning source-target surface patterns for web-based terminology translation. In Proceedings of the acl 2005 on interactive poster and demonstration sessions (pp. 37–40). Stroudsburg, PA, USA: Association for Computational Linguistics. Available from http://dx.doi.org/10.3115/ 1225753.1225763
Zhang, Y., Huang, F., & Vogel, S. (2005). Mining translations of oov terms from the web through cross-lingual query expansion. In Proceedings of the 28th annual international acm sigir conference on research and development in information retrieval (pp. 669–670). New York, NY, USA: ACM. Available from http:// doi.acm.org/10.1145/1076034.1076182