簡易檢索 / 詳目顯示

研究生: 張伯豐
Chang, Po-Feng
論文名稱: 探索兩短句間之順序配對相似性
Exploring Sequential Pairing Similarity between Two Short Sentences
指導教授: 李育杰
Lee, Yuh-Jye
張介玉
Chang, Chieh-Yu
口試委員: 葉倚任
Yeh, Yi-Ren
陳宜欣
Chen, Yi-Shin
學位類別: 碩士
Master
系所名稱: 理學院 - 數學系
Department of Mathematics
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 30
中文關鍵詞: 短句相似度動態時間校正單詞配對單詞相似度模型轉移
外文關鍵詞: short sentence similarity, dynamic time warping, word pairing, word similarity, model transfer
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 短句相似度的判斷是自然語言處理的一個重要分支,它已被應用於許多領域,如信息檢索、聊天機器人、假新聞檢測、社群軟體訊息分析等等。短句相似度問題的解決辦法可以大致分為兩類,一是將句子以向量表示並比較兩個向量之間的相似度;二是以字為單位,先將句子相似度問題簡化為字之間的相似度問題,再透過整合字與字之間的相似度分數來做出預測。在本文中,我們關心的是第二類的做法,這類做法中常見的方式是用句法分析判斷句子的結構,而這需要用到文法方面的知識,所以在較為口語的資料集(如Facebook和Twitter)表現可能較不理想。因此,我們提出一套無需文法知識的架構來判斷短句相似度:考量到字的順序對句子語義的影響,我們以動態時間校正為核心,利用它對齊過程中找到的路徑來將問題簡化為字與字之間的相似度比較,並利用這些結果得到兩句話的相似度分數。
    在實驗階段,我們針對兩種情境去探討架構中各參數的選擇。第一種情境是利用設計的架構判斷短句間的相似度,我們會探討各參數合適的選擇以及各個選擇會在哪類句子上有較佳的表現。第二種情境是利用設計的架構提取兩句話的關聯性,我們會依據所選用的整合模型(SVM、XGBoost和LightGBM)討論哪些參數組合能夠提取出較有代表性的資訊,並探討不同整合模型間的共識。根據實驗結果,我們發現在第一個情境下的表現不理想,這是因為我們的方法都是針對特定情形作設計,但因為這些方法間具有互補性,所以在第二種情境中會有良好的表現。此外,由於我們使用的其中一個資料集樣本數量多且句子來源廣,我們嘗試將在其上訓練出的模型套用到另一個資料集上,結果顯示這樣的推廣方式值得探討,因此如何將它做改良會做為我們後續的研究方向。


    Short sentence similarity is an important branch of natural language processing. It has been applied in many fields, such as information retrieval, chatbots, fake news detection, community software message analysis, etc. Methods for short sentence similarity can be separated into two categories. One is to represent sentences as vectors, and uses the similarity between vectors as the similarity score. The other is to simplify the problem into similarities between words, and then integrate the similarities of words to make the prediction. In this thesis, we focus on the second category. One common method is to use syntactic analysis to determine the structure of sentences. However, this method needs knowledge of grammar, so the performance might not good on colloquial data, such as Facebook and Twitter. Therefore, we propose an architecture that does not require grammar knowledge. Considering the influence of the order of words on the semantics of sentences, we use dynamic time warping (DTW) as the core of our architecture, use the path it found to pair words, and simplify the problem into similarities between words.
    In the experiments, we explore the choice of parameters in our architectures for two scenarios. The first scenario is to use the architecture of the design to determine the similarity between the short sentences. The second one is to extract the relevance of two sentences by our architecture. Results show that the performances in the first scenario are not good. This is because those choices are designed for specific situations, and are not generalizable for the whole dataset. However, since those choices are complementary, the results are good in the second scenario. On the other hand, since one of our dataset contains large amount of samples and widely categories of sentences, we try to apply the model trained on it to another dataset. The results show that this idea is worth discussing, and we remain it as our future work.

    1 Introduction 1 2 Related Works 3 2.1 Word Count 3 2.2 Longest Common Subsequence 3 2.3 Word2vec 4 2.4 Word Mover’s Distance 5 3 Proposed Methods6 3.1 Algorithms 7 3.1.1 Basic Dynamic Time Warping 7 3.1.2 Advanced Dynamic Time Warping 8 3.1.3 Subsentence of the Longer Sentence 8 3.2 Distance Functions and Similarity Functions 9 3.2.1 Word2vec Based Methods 9 3.2.2 WordNet Based Methods 10 4 Experimental Results 12 4.1 Datasets 12 4.1.1 STS Benchmark Dataset (STS) 12 4.1.2 Question Pairs Dataset (Quora) 13 4.2 Preprocessing 13 4.3 Results in the First Scenario 13 4.4 Results in the Second Scenario 16 4.5 Feature Selection 18 4.5.1 Feature Selection via SVM 19 4.5.2 Feature Selection via XGBoost 21 4.5.3 Feature Selection via LightGBM 23 4.6 Transferability of Ensemble Models 25 5 Conclusions and Future Works 27 References 29

    [1] Juan Ramos et al. "Using tf-idf to determine word relevance in document queries". In: Proceedings of the first instructional conference on machine learning. Vol.242.Piscataway, NJ. 2003, pp. 133–142.
    [2] Matt Kusner et al. "From word embeddings to document distances". In: International conference on machine learning. 2015, pp. 957–966.
    [3] Lasse Bergroth, Harri Hakonen, and Timo Raita. "A survey of longest common subsequence algorithms". In: Proceedings Seventh International Symposium onString Processing and Information Retrieval. SPIRE 2000. IEEE. 2000, pp. 39–48.
    [4] Rafael Ferreira et al. "A new sentence similarity method based on a three-layer sentence representation". In:Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 01. IEEE Computer Society. 2014, pp. 110–117.
    [5] Tomas Mikolov et al. "Efficient estimation of word representations in vector space". In: arXiv preprint arXiv: 1301.3781 (2013).
    [6] Tomas Mikolov et al. "Distributed representations of words and phrases and their compositionality". In: Advances in neural information processing systems. 2013,pp. 3111–3119.
    [7] STS benchmark dataset. URL:http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark.
    [8] Question pairs dataset. URL:https://www.kaggle.com/quora/question-pairs-dataset.
    [9] Chih-Chung Chang and Chih-Jen Lin. "LIBSVM: A library for support vector machines". In: ACM transactions on intelligent systems and technology (TIST)2.3 (2011), p. 27.
    [10] Tianqi Chen and Carlos Guestrin. "Xgboost: A scalable tree boosting system". In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM. 2016, pp. 785–794.
    [11] Guolin Ke et al. "Lightgbm: A highly efficient gradient boosting decision tree". In: Advances in Neural Information Processing Systems. 2017, pp. 3146–3154.

    QR CODE