簡易檢索 / 詳目顯示

研究生: 杜惟真
To, Wei-Jen
論文名稱: 綜合知識庫與統計的字詞相似度預測方法
A Hybrid Learning-based Method for Estimating Word Similarity in Collocation Clustering
指導教授: 張俊盛
Chang, Jason
口試委員: 陳浩然
Chen, Hao-Jan
蘇豐文
Soo, Von-Wun
學位類別: 碩士
Master
系所名稱:
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 43
中文關鍵詞: 搭配詞相似度相似字擷取機器學習
外文關鍵詞: Collocation Similarity, Synonym Retrieval, Machine Learning
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了一套預測搭配詞相似度的機器學習方法,此方法是針對同樣中心詞的搭配詞進行相似度預測。我們的方法是從不同的語料庫中,綜合知識庫和統計方法產生訓練特徵值,以作為機器學習模型的訓練資料。我們分別從 WordNet 同義詞網結構與其相關資訊、網路規模的 n-gram 與中英雙語翻譯資料,產生考量中心詞之下的字詞相似特徵值。我們建構了一套搭配詞相似度的預測系統: ColloSim ,此系統可以預測牛津搭配字典的搭配詞相似度。初步的實驗評估顯示我們所提出的方法相較於最先進的方法有更好的表現。結果也顯示我們的方法在中心詞影響搭配詞相似度的情況下,能夠些微改進相似度預測準確性。


    We introduce a method for learning to estimate the similarity between collocations. In our approach, collocate pairs under certain headword are transformed into thesaurus-based and distributional similarity features from multiple sources. The method involves automatically generating similarity features, including WordNet-based features, n-gram based features, translation based features and headword-sensitive features for predicting the similarity between given collocate pairs. We present a similarity estimation prototype, ColloSim, that applies the method to collocations from a collocation dictionary. Evaluation on a set of collocates and their headword show that the method achieve reasonable good performance comparable to state-of-the-arts. Our method supports estimating the similarity of headword-sensitive collocations, resulting in additional improvement of the accuracy in predicting semantic similarity between collocate pairs.

    Abstract p.i 摘要 p.ii Acknowledgements p.iii Contents p.iv List of Figures p.vi List of Tables p.vii 1 Introduction p.1 2 Related Work p.7 2.1 Thesaurus-basedSimilarity p.8 2.2 DistributionalSimilarity p.9 3 Methodology p.11 3.1 ProblemStatement p.11 3.2 Similarity Features based on WordNet p.12 3.2.1 VerbalFeature p.14 3.2.2 AdjectivalFeatures p.15 3.3 Extract n-gram based Feature p.19 3.4 Similarity Features based onTranslations p.22 3.5 Headword-Sensitive Feature based on n-gram p.23 3.5.1 Retrieving and Filtering N-grams p.24 3.5.2 Transforming N-grams to N-frames p.26 3.5.3 Generating Context Vector p.27 3.5.4 Computing Headword-sensitive Similarity p.29 4 Experiment and Evaluation p.31 4.1 ExperimentalSetting p.31 4.1.1 Dataset p.32 4.1.2 Feature Generation Resources and Parameters p.33 4.1.3 Classification Model and Parameter settings p.34 4.2 SystemCompared p.34 4.3 EvaluationResults p.37 5 Conclusion and Future Work p.39 Reference p.40

    Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
    Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In
    Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
    Timothy Chklovski and Patrick Pantel. Verbocean: Mining the web for fine- grained semantic verb relations. In EMNLP, volume 4, pages 33–40, 2004.
    Chieh Ya Huang. Automatic generation of synonyms and paraphrases based on web grams. 2016.
    Chu-Ren Huang, Ru-Yng Chang, and Hshiang-Pin Lee. Sinica bow (bilingual ontological wordnet): Integration of bilingual wordnet and sumo. In LREC, 2004.
    Jia-Yan Jian, Yu-Chia Chang, and Jason S Chang. Tango: Bilingual collocational concordancer. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 19. Association for Computational Linguistics, 2004.
    Jay J Jiang and David W Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008, 1997.
    Daniel Jurafsky and James H Martin. Speech and language processing. 2008.
    Adam Kilgarriff. Thesauruses for natural language processing. In Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, pages 5–13. IEEE, 2003.
    Adam Kilgarriff and David Tugwell. Word sketch: Extraction and display of significant collocations for lexicography. 2001.
    Claudia Leacock and Martin Chodorow. Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49(2):265–283, 1998.
    Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics-Volume 2, pages 768–774. Association for Computational Linguistics, 1998.
    Dekang Lin et al. An information-theoretic definition of similarity. In ICML, volume 98, pages 296–304. Citeseer, 1998.
    Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in con- tinuous space word representations. In Hlt-naacl, volume 13, pages 746–751, 2013.
    George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
    George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4):235–244, 1990.
    Sergei Nirenburg and Victor Raskin. The subworld concept lexicon and the lexicon management system. Computational Linguistics, 13(3-4):276–289, 1987.
    Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007, 1995.
    A. Rozovskaya and D. Roth. Algorithm selection and model adaptation for esl correction tasks. In ACL, Portland, Oregon, 6 2011. Association for Computational Linguistics. URL http://cogcomp.cs.illinois.edu/papers/ RozovskayaRo11(1).pdf.
    Yoshimasa Tsuruoka and Jun’ichi Tsujii. Bidirectional inference with the easiest- first strategy for tagging sequence data. In Proceedings of the conference on human language technology and empirical methods in natural language process- ing, pages 467–474. Association for Computational Linguistics, 2005.
    Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John Mc- Naught, Sophia Ananiadou, and Jun’ichi Tsujii. Developing a robust part-of- speech tagger for biomedical text. In Panhellenic Conference on Informatics, pages 382–392. Springer, 2005.
    Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994.
    Alisa Zhila, Wen-tau Yih, Christopher Meek, Geoffrey Zweig, and Tomas Mikolov. Combining heterogeneous models for measuring relational similarity. In HLT- NAACL, pages 1000–1009, 2013.

    QR CODE