簡易檢索 / 詳目顯示

研究生: 吳俊鍇
Wu, Chun-Kai.
論文名稱: 基於弱監督式跨語言條目嵌入模型之跨百科跨語言文章鏈結
Cross-Language Cross-Encyclopedia Article Linking based on Weakly Supervised Cross-Encyclopedia Entity Embedding
指導教授: 許聞廉
Hsu, Wen-Lian
蔡宗翰
Tsai, Tzong-Han
口試委員: 宋定懿
Sung, Ting-Yi
蘇柏齊
Su, Po-Chyi
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 33
中文關鍵詞: 百科全書跨語言文章連結特徵學習
外文關鍵詞: Encyclopedia, Cross-lingual, Article-linking, Feature learning
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 跨語言文章鏈結是一項從百科全書中找尋不同語言之對等的文章配對,這項研究的困難在於如何從數個有著相似標題或者內容的候選文章中釐清哪著才是正確的對應文章。現有的研究方法大多數均採特徵工程的方法,專注於發展基於文字或者文章連結的特徵,但是這樣的方法通常都很費時而且有些特徵只能在相同的百科全書中運用。在這篇論文中,我們提出跨百科命名實體嵌入模以解決上述提到的問題。不同於先前的研究,我們提出的方法並不需依賴已知的跨語言文章連結,並且我們將提出的方法運用於英文維基百科與中文百度百科之煎的跨語言文章鏈結上。我們的所設計的特徵使效能相較於基線方法提升了29.62\%,而目前最佳系統相較於基線方法僅有26.86\%的效能提升,測試30次不同的實驗結果顯示,我們的系統相較於目前最佳系統有2.76\%的平均效能提升,經由統計檢驗得到我們的方法和目前最佳方法有顯著差異。


    Cross-language article linking (CLAL) is the task of finding corresponding article pairs of different languages across encyclopedias. This task is a difficult disambiguation problem that one article must be selected among several candidate articles with similar titles and contents. Existing works focus on engineering text-based or link-based features for this task, which is time-consuming job and some are only applicable within the same encyclopedia. In this paper, we address these problems by proposing cross-encyclopedia entity embedding. Unlike other works, the proposed method does not rely on known cross-language pairs. We apply the proposed method to CLAL between English Wikipedia and Chinese Baidu Baike. Our features improve performance relative to the baseline by 29.62\%. Tested 30 times, our system achieved an average improvement of 2.76\% over the current best system (26.86\% over baseline), a statistically significant result.

    摘要 iii Abstract iv 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Articles in Online Encyclopedias . . . . . . . . . . . . . . . . . 3 1.2.2 Formal Definition of Cross-language article linking . . . . . . . 7 1.3 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Works 9 2.1 Article linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Entity linking . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Cross-language article linking . . . . . . . . . . . . . . . . . . 10 2.1.3 Cross-language encyclopedia article linking . . . . . . . . . . . 11 2.2 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Entity Embedding . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Methodology 15 3.1 Problem Definition of Cross-language Article Linking . . . . . . . . . 15 3.1.1 Candidate Selection . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Candidate Ranking . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Cross-Encyclopedia Entity Embedding Model . . . . . . . . . . . . . . 18 3.3 Training Data Compilation for Cross-Encyclopedia Entity Embedding . 19 3.4 Learning Cross-Encyclopedia Entity Embedding . . . . . . . . . . . . 20 4 Experiment and Results 22 4.1 Evaluation Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Cross-Language Article Linking Results . . . . . . . . . . . . . . . . . 23 4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5 Discussion 27 6 Conclusion 29 References 31

    [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
    by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
    2014.
    [2] Nacéra Bennacer, Mia Johnson Vioulès, Maximiliano Ariel López, and Gianluca
    Quercini. A Multilingual Approach to Discover Cross-Language Links in
    Wikipedia, pages 539–553. 2015.
    [3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
    Journal of machine Learning research, 3(Jan):993–1022, 2003.
    [4] Danqi Chen and Christopher D Manning. A fast and accurate dependency parser
    using neural networks. In Proceedings of the 2014 Conference on Empirical Methods
    in Natural Language Processing (EMNLP), pages 740–750, 2014.
    [5] Yuhang Guo, Guohua Tang, Wanxiang Che, Ting Liu, and Sheng Li. Hit approaches
    to entity linking at tac 2011. In TAC, 2011.
    [6] Ben Hachey, Will Radford, and James R Curran. Graph-based named entity linking
    with wikipedia. In WISE, pages 213–226. Springer, 2011.
    [7] Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric P. Xing. Entity
    hierarchy embedding. In Proceedings of the 53rd Annual Meeting of the Association
    for Computational Linguistics and the 7th International Joint Conference on
    Natural Language Processing (Volume 1: Long Papers), pages 1292–1300, 2015.
    [8] Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, and Joe Ellis. Overview
    of the tac 2010 knowledge base population track. In Third Text Analysis Conference
    (TAC 2010), volume 3, pages 3–3, 2010.
    [9] Taesung Lee and Seung-won Hwang. Bootstrapping entity translation on weakly
    comparable corpora. In Proceedings of the 52nd Annual Meeting of the Association
    for Computational Linguistics (Volume 1: Long Papers), pages 631–640.
    Association for Computational Linguistics, 2013.
    [10] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
    Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef,
    Sören Auer, et al. Dbpedia–a large-scale, multilingual knowledge base extracted
    from wikipedia. Semantic Web, 6(2):167–195, 2015.
    [11] Yuezhang Li, Ronghuo Zheng, Tian Tian, Zhiting Hu, Rahul Iyer, and Katia
    Sycara. Joint embedding of hierarchical categories and entities for concept categorization
    and dataless classification. arXiv preprint arXiv:1607.07956, 2016.
    [12] Paul McNamee and Hoa Trang Dang. Overview of the tac 2009 knowledge base
    population track. In Text Analysis Conference (TAC), volume 17, pages 111–113,
    2009.
    [13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean.
    Distributed representations of words and phrases and their compositionality. In
    Proceedings of the 2013 Conference on Neural Information Processing Systems
    (NIPS), pages 3111–3119, 2013.
    [14] Roberto Navigli and Simone Paolo Ponzetto. BabelNet: The automatic construction,
    evaluation and application of a wide-coverage multilingual semantic network.
    Artificial Intelligence, 193:217–250, 2012.
    [15] Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf R. B. Trieschnigg, Djoerd
    Hiemstra, and Franciska de Jong. WikiTranslate: Query Translation for CrossLingual
    Information Retrieval Using Only Wikipedia, pages 58–65. 2009.
    [16] Jong-Hoon Oh, Daisuke Kawahara, Kiyotaka Uchimoto, Jun’ichi Kazama, and
    Kentaro Torisawa. Enriching multilingual language resources by discovering missing
    cross-language links in wikipedia. In Proceedings of the 2008 IEEE/WIC/ACM
    International Conference on Web Intelligence and Intelligent Agent Technology,
    volume 1, pages 322–328, 2008.
    [17] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global
    vectors for word representation. In Proceedings of the 2014 Conference on Empirical
    Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
    [18] Philipp Sorg and Philipp Cimiano. Enriching the crosslingual link structure of
    wikipedia-a classification-based approach. In Proceedings of the AAAI 2008 Workshop
    on Wikipedia and Artifical Intelligence, pages 49–54, 2008.
    [19] Chen-Tse Tsai and Dan Roth. Cross-lingual wikification using multilingual embeddings.
    In Proceedings of the 2016 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies,
    pages 589–598, 2016.
    [20] Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. Cross-lingual models
    of word embeddings: An empirical comparison. In Proceedings of the 54th Annual
    Meeting of the Association for Computational Linguistics (Volume 1: Long
    Papers), pages 1661–1670, 2016.
    [21] Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint
    arXiv:1506.05869, 2015.
    [22] Yu-Chun Wang, Chun-Kai Wu, and Richard Tzong-Han Tsai. Cross-language and
    cross-encyclopedia article linking using mixed-language topic model and hypernym
    translation. In Proceedings of the 52nd Annual Meeting of the Association for
    Computational Linguistics (Volume 2: Short Papers), pages 586–591. Association
    for Computational Linguistics, 2014.
    [23] Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. Cross-lingual knowledge
    linking across wiki knowledge bases. In Proceedings of the 21st international
    conference on World Wide Web, pages 459–468, 2012.
    [24] Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding
    and orthogonal transform for bilingual word translation. In Proceedings of the
    2015 Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, pages 1006–1011, 2015.
    [25] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint
    learning of the embedding of words and entities for named entity disambiguation.
    arXiv preprint arXiv:1601.01343, 2016.
    [26] Hee-Geun Yoon, Hyun-Je Song, Seong-Bae Park, and Se-Young Park. A
    translation-based knowledge graph embedding preserving logical property of relations.
    In Proceedings of the 2016 Conference of the North American Chapter of
    the Association for Computational Linguistics: Human Language Technologies,
    pages 907–916. Association for Computational Linguistics, 2016.
    [27] Wei Zhang, Chew Lim Tan, Jian Su, Bin Chen, Wenting Wang, Zhiqiang Toh,
    Yanchuan Sim, Yunbo Cao, and Chin-Yew Lin. I2r-nus-msra at tac 2011: Entity
    linking. In TAC, 2011.
    [28] Guangyou Zhou, Tingting He, Jun Zhao, and Po Hu. Learning continuous word
    embedding with metadata for question retrieval in community question answering.
    In Proceedings of the 53rd Annual Meeting of the Association for Computational
    Linguistics and the 7th International Joint Conference on Natural Language Processing
    (Volume 1: Long Papers), pages 250–259, 2015.

    QR CODE