研究生: |
吳俊鍇 Wu, Chun-Kai. |
---|---|
論文名稱: |
基於弱監督式跨語言條目嵌入模型之跨百科跨語言文章鏈結 Cross-Language Cross-Encyclopedia Article Linking based on Weakly Supervised Cross-Encyclopedia Entity Embedding |
指導教授: |
許聞廉
Hsu, Wen-Lian 蔡宗翰 Tsai, Tzong-Han |
口試委員: |
宋定懿
Sung, Ting-Yi 蘇柏齊 Su, Po-Chyi |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 33 |
中文關鍵詞: | 百科全書 、跨語言 、文章連結 、特徵學習 |
外文關鍵詞: | Encyclopedia, Cross-lingual, Article-linking, Feature learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
跨語言文章鏈結是一項從百科全書中找尋不同語言之對等的文章配對,這項研究的困難在於如何從數個有著相似標題或者內容的候選文章中釐清哪著才是正確的對應文章。現有的研究方法大多數均採特徵工程的方法,專注於發展基於文字或者文章連結的特徵,但是這樣的方法通常都很費時而且有些特徵只能在相同的百科全書中運用。在這篇論文中,我們提出跨百科命名實體嵌入模以解決上述提到的問題。不同於先前的研究,我們提出的方法並不需依賴已知的跨語言文章連結,並且我們將提出的方法運用於英文維基百科與中文百度百科之煎的跨語言文章鏈結上。我們的所設計的特徵使效能相較於基線方法提升了29.62\%,而目前最佳系統相較於基線方法僅有26.86\%的效能提升,測試30次不同的實驗結果顯示,我們的系統相較於目前最佳系統有2.76\%的平均效能提升,經由統計檢驗得到我們的方法和目前最佳方法有顯著差異。
Cross-language article linking (CLAL) is the task of finding corresponding article pairs of different languages across encyclopedias. This task is a difficult disambiguation problem that one article must be selected among several candidate articles with similar titles and contents. Existing works focus on engineering text-based or link-based features for this task, which is time-consuming job and some are only applicable within the same encyclopedia. In this paper, we address these problems by proposing cross-encyclopedia entity embedding. Unlike other works, the proposed method does not rely on known cross-language pairs. We apply the proposed method to CLAL between English Wikipedia and Chinese Baidu Baike. Our features improve performance relative to the baseline by 29.62\%. Tested 30 times, our system achieved an average improvement of 2.76\% over the current best system (26.86\% over baseline), a statistically significant result.
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
2014.
[2] Nacéra Bennacer, Mia Johnson Vioulès, Maximiliano Ariel López, and Gianluca
Quercini. A Multilingual Approach to Discover Cross-Language Links in
Wikipedia, pages 539–553. 2015.
[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
[4] Danqi Chen and Christopher D Manning. A fast and accurate dependency parser
using neural networks. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages 740–750, 2014.
[5] Yuhang Guo, Guohua Tang, Wanxiang Che, Ting Liu, and Sheng Li. Hit approaches
to entity linking at tac 2011. In TAC, 2011.
[6] Ben Hachey, Will Radford, and James R Curran. Graph-based named entity linking
with wikipedia. In WISE, pages 213–226. Springer, 2011.
[7] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric P. Xing. Entity
hierarchy embedding. In Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 1292–1300, 2015.
[8] Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. Overview
of the tac 2010 knowledge base population track. In Third Text Analysis Conference
(TAC 2010), volume 3, pages 3–3, 2010.
[9] Taesung Lee and Seung-won Hwang. Bootstrapping entity translation on weakly
comparable corpora. In Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 631–640.
Association for Computational Linguistics, 2013.
[10] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef,
Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge base extracted
from wikipedia. Semantic Web, 6(2):167–195, 2015.
[11] Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu, Rahul Iyer, and Katia
Sycara. Joint embedding of hierarchical categories and entities for concept categorization
and dataless classification. arXiv preprint arXiv:1607.07956, 2016.
[12] Paul McNamee and Hoa Trang Dang. Overview of the tac 2009 knowledge base
population track. In Text Analysis Conference (TAC), volume 17, pages 111–113,
2009.
[13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean.
Distributed representations of words and phrases and their compositionality. In
Proceedings of the 2013 Conference on Neural Information Processing Systems
(NIPS), pages 3111–3119, 2013.
[14] Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction,
evaluation and application of a wide-coverage multilingual semantic network.
Artificial Intelligence, 193:217–250, 2012.
[15] Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf R. B. Trieschnigg, Djoerd
Hiemstra, and Franciska de Jong. WikiTranslate: Query Translation for CrossLingual
Information Retrieval Using Only Wikipedia, pages 58–65. 2009.
[16] Jong-Hoon Oh, Daisuke Kawahara, Kiyotaka Uchimoto, Jun’ichi Kazama, and
Kentaro Torisawa. Enriching multilingual language resources by discovering missing
cross-language links in wikipedia. In Proceedings of the 2008 IEEE/WIC/ACM
International Conference on Web Intelligence and Intelligent Agent Technology,
volume 1, pages 322–328, 2008.
[17] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
[18] Philipp Sorg and Philipp Cimiano. Enriching the crosslingual link structure of
wikipedia-a classification-based approach. In Proceedings of the AAAI 2008 Workshop
on Wikipedia and Artifical Intelligence, pages 49–54, 2008.
[19] Chen-Tse Tsai and Dan Roth. Cross-lingual wikification using multilingual embeddings.
In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
pages 589–598, 2016.
[20] Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. Cross-lingual models
of word embeddings: An empirical comparison. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1661–1670, 2016.
[21] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint
arXiv:1506.05869, 2015.
[22] Yu-Chun Wang, Chun-Kai Wu, and Richard Tzong-Han Tsai. Cross-language and
cross-encyclopedia article linking using mixed-language topic model and hypernym
translation. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pages 586–591. Association
for Computational Linguistics, 2014.
[23] Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. Cross-lingual knowledge
linking across wiki knowledge bases. In Proceedings of the 21st international
conference on World Wide Web, pages 459–468, 2012.
[24] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding
and orthogonal transform for bilingual word translation. In Proceedings of the
2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 1006–1011, 2015.
[25] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint
learning of the embedding of words and entities for named entity disambiguation.
arXiv preprint arXiv:1601.01343, 2016.
[26] Hee-Geun Yoon, Hyun-Je Song, Seong-Bae Park, and Se-Young Park. A
translation-based knowledge graph embedding preserving logical property of relations.
In Proceedings of the 2016 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
pages 907–916. Association for Computational Linguistics, 2016.
[27] Wei Zhang, Chew Lim Tan, Jian Su, Bin Chen, Wenting Wang, Zhiqiang Toh,
Yanchuan Sim, Yunbo Cao, and Chin-Yew Lin. I2r-nus-msra at tac 2011: Entity
linking. In TAC, 2011.
[28] Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. Learning continuous word
embedding with metadata for question retrieval in community question answering.
In Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 250–259, 2015.