中文語意搭配詞之預測｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	何品翰 Ho, Pin-Han
論文名稱：	中文語意搭配詞之預測 Prediction of Chinese Meaningful Word Pairs
指導教授：	許聞廉 Hsu, Wen-Lian
口試委員:	馬偉雲 Ma, Wei-Yun 戴敏育 Day, Min-Yuh
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2024
畢業學年度：	112
語文別：	中文
論文頁數：	55
中文關鍵詞：	簡化法、中文語意搭配詞、知識庫、自動標註
外文關鍵詞：	Reduction Method, Chinese Meaningful Word Pairs, Knowledge Base, Auto-labeling
相關次數：	點閱：111 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

此篇論文的主題是中文語意搭配詞之預測，而此方法主要應用為簡化法。首先簡化法的目的在於將複雜句簡化成簡單句，將具有修飾詞的句子做縮減以取得核心的部分，藉此取得主詞、動詞、受詞的主要結構(SVO結構)。另外透過找到詞和詞的修飾關係，能夠收集句子中特定目標詞的修飾語，形成依存的結構。透過上述將複雜句簡化成簡單句的結果，能夠抓到句子的結構及修飾關係，有利於後續語法分析器(parser)做使用。
而此篇之中文語意搭配詞主要是在收集句子中的具有修飾關係的詞彙組合，目前收集的修飾類別有5種。在方法部分採用神經網路模型加上知識庫的方法來綜合預測修飾的關係。模型部分採用BERT模型為預訓練模型加上分類器，為多類別的預測模型；在知識庫部分則為收集常見的修飾關係詞彙組合，用以預測可能的修飾關係。最後綜合兩者結果以產出最後的預測關係及類別。
在資料集方面，語料來源採用哈工大語料及小學數學語料。此兩份語料在製作及標註方面均為本研究產生。而本研究在語料標註方面，修飾關係部分是基於哈工大的語法分析工具採用自動標註方式，收集有修飾關係的詞彙組合。在非修飾關係部分則採用選取目標詞(名詞和動詞)前後特定範圍內的詞彙和目標詞的組合來做收集。
而實驗結果部分，目前結果表明，結合知識庫和神經網絡模型的方法在預測修飾關係方面較單獨使用知識庫或單獨使用神經網路有較佳之準確度。在知識庫中透過詞彙組合的類別篩選，篩選出單一修飾關係類別的詞彙組合，能顯著提升模型的綜合預測效果。另外自動標註方法的準確率均達到99%以上，也驗證了自動標註的有效性和可靠性。

The topic of this paper is the Prediction of Chinese Meaningful Word Pairs, primarily applying a reduction method. The purpose of the reduction method is to simplify complex sentences into simple ones, reducing sentences with modifiers to obtain the core parts, in order to acquire the main structure of subject, verb, and object (SVO structure). Additionally, this reduction method allows collecting modifiers for specific target words in sentences, forming a dependency structure. Through the results mentioned above of simplifying complex sentences into simple ones, the structure and modifying relationships of sentences can be captured, which is beneficial for subsequent parsing.
In this paper, Chinese meaningful word pairs primarily focus on collecting word pairs with modifying relationships within sentences. Among the collected modifying relationships, five types of modifying categories have been collected. In the methodology section, a combination of neural network models and knowledge bases is used to predict modifying relationships. The model part uses the BERT model as a pre-trained model plus a classifier, forming a multi-class prediction model. The knowledge base part collects common modifier combinations to predict possible modifying relationships. Finally, the results of both are combined to produce the final predicted relationships and categories.
Regarding the dataset, the corpus sources are from the Harbin Institute of Technology (HIT) corpus and elementary school mathematics corpus. Both of these corpora were created and annotated for this research. In terms of corpus annotation, the modifying relationships were automatically annotated based on the HIT's parser, collecting word pairs with modifying relationships. For non-modifying relationships, word pairs were collected by selecting words within a specific range before and after the target words (nouns and verbs).
The experimental results show that the method combining knowledge bases and neural network models outperforms in predicting modifying relationships compared to using knowledge bases or neural networks alone. In the knowledge base, filtering word pairs by category, selecting word pairs with a single modifying relationship category, significantly improves the model's overall prediction performance. Furthermore, the accuracy of the automatic annotation method consistently reaches over 99%, validating its effectiveness and reliability.

摘要    i
Abstract    ii
誌謝    iii
第一章 緒論    1
1.1 研究動機與目的    1
1.2 研究貢獻    2
1.3 論文架構    2
第二章 相關文獻探討    4
2.1 句子簡化 (Sentence Simplification)    4
2.2 句子壓縮 (Sentence Compression)    5
2.3 關係提取 (Relation Extraction)    6
2.3.1 關係提取方法介紹與比較    6
2.3.2 聯合提取方法(joint extraction methods)介紹    8
2.3.3 關係提取之學習來源    9
第三章 方法    11
3.1 詞彙組合和語意搭配詞的關係    11
3.2 關係分類的目的    11
3.3 關係分類的類別    12
3.4 關係預測整體流程    13
3.5 神經網路模型架構    15
3.5.1 預訓練模型 (Pretrained Model)    16
3.5.2 分類模型 (Classifier)    17
3.6 知識庫的建立    19
3.7 資料集的收集    21
3.7.1 收集修飾關係組合方法    22
3.7.2 收集非修飾關係組合方法    25
3.7.3 關係組合之資料分布    28
3.8 神經網路模型之訓練方法    30
3.9 神經網路模型參數設定    31
第四章 實驗與結果討論    32
4.1 模型效果    32
4.2 知識庫篩選比較    33
4.3 人工驗證自動標註之正確率    37
4.4 錯誤分析    38
4.5 整體模型方法和哈工大對應表方法的比較    39
第五章 結論與未來展望    41
5.1 結論    41
5.2 未來展望    41
參考文獻    43
附錄    47
A.    哈工大依存關係對應修飾關係分類表    47
B.    知識庫頻率篩選    49
C.    知識庫和Bigram的比較    52


                                

[1] "處理自然語言的簡化法," Institute of Information Science, Academia Sinica, https://iptt.sinica.edu.tw/shares/929, May 2023 (Jul. 17, 2024).

[2] "自然語言簡化法 (Reduction)," Taiwan Bioinformatics Institute, http://www.tbi.org.tw/enews/TMBD/Vol38.html, October 2021 (Jul. 17, 2024).

[3] F. Alva-Manchego, C. Scarton, and L. Specia, "Data-Driven Sentence Simplification: Survey and Benchmark," Computational Linguistics, vol. 46, no. 1, pp. 135–187, 2020.

[4] M. T. Nguyen, C. M. Bui, D. T. Le, and T. L. Le, "Sentence compression as deletion with contextual embeddings," in International Conference on Computational Collective Intelligence, Cham: Springer International Publishing, 2020, pp. 427-440.

[5] K. Filippova, "Multi-sentence compression: Finding shortest paths in word graphs," in Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 2010, pp. 322-330.

[6] A. Bordes, S. Chopra, and J. Weston, "Question answering with subgraph embeddings," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 615–620.

[7] A. Madotto, C. Wu, and P. Fung, "Mem2Seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems," in Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1468–1478.

[8] C. Xiong, R. Power, and J. Callan, "Explicit semantic ranking for academic search via knowledge graph embedding," in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1271–1279..

[9] X. Han, et al., "More data, more relations, more context and more openness: A review and outlook for relation extraction," arXiv preprint arXiv:2004.03186, 2020.

[10] H. Wang, K. Qin, R. Y. Zakari, et al., "Deep neural network-based relation extraction: an overview," Neural Comput & Applic, vol. 34, pp. 4781–4801, 2022.

[11] P. Zhou, S. Zheng, J. Xu, Z. Qi, H. Bao, and B. Xu, "Joint Extraction of Multiple Relations and Entities by Using a Hybrid Neural Network," in Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2017, Lecture Notes in Computer Science, vol. 10565, M. Sun, X. Wang, B. Chang, and D. Xiong, Eds. Cham: Springer, 2017.

[12] H. Peng, et al., "Learning from context or names? an empirical study on neural relation extraction," arXiv preprint arXiv:2010.01923, 2020.

[13] Y. Cui, W. Che, T. Liu, B. Qin, and Z. Yang, "Pre-Training With Whole Word Masking for Chinese BERT," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021.

[14] Y. Huang, et al., "D‐BERT: Incorporating dependency‐based attention into BERT for relation extraction," CAAI Transactions on Intelligence Technology, vol. 6, no. 4, pp. 417-425, 2021..

[15] J. Devlin, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

[16] S. Wang, "The Survey of Joint Entity and Relation Extraction," in Computing and Data Science. CONF-CDS 2021. Communications in Computer and Information Science, vol. 1513, W. Cao, A. Ozcan, H. Xie, and B. Guan, Eds. Singapore: Springer, 2021.

[17] M. Shardlow, "A survey of automated text simplification," International Journal of Advanced Computer Science and Applications, vol. 4, no. 1, pp. 58-70, 2014.

[18] S. Štajner, K. C. Sheang, and H. Saggion, "Sentence simplification capabilities of transfer-based models," in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 12172-12180.

[19] J. Clarke and M. Lapata, "Global inference for sentence compression: An integer linear programming approach," J. Artif. Intell. Res., vol. 31, pp. 399-429, 2008.

[20] K. Filippova and Y. Altun, "Overcoming the lack of parallel data in sentence compression," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1481-1491.

[21] M. E. Califf and R. J. Mooney, "Relational learning of pattern-match rules for information extraction," in Proceedings of CoNLL, 1997.

[22] K. Fundel, R. Küffner, and R. Zimmer, "RelEx—Relation extraction using dependency parse trees," Bioinformatics, vol. 23, no. 3, pp. 365-371, 2007.

[23] M. Cui, et al., "A survey on relation extraction," in Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence: Second China Conference, CCKS 2017, Chengdu, China, August 26–29, 2017, Revised Selected Papers 2, Springer Singapore, 2017, pp. 50-58.

[24] G. Bekoulis, J. Deleu, T. Demeester, et al., "Joint entity recognition and relation extraction as a multi-head selection problem," Expert Syst. Appl., vol. 114, pp. 34–45, 2018.

[25] J. Lee, S. Seo, and Y. S. Choi, "Semantic relation classification via bidirectional LSTM networks with entity-aware attention using latent entity typing," Symmetry, vol. 11, no. 6, p. 785, 2019.

[26] S. Wu and Y. He, "Enriching pre-trained language model with entity information for relation classification," arXiv preprint arXiv:1905.08284, 2019.

[27] M. Eberts and A. Ulges, "Span-based joint entity and relation extraction with transformer pretraining," arXiv preprint arXiv:1909.07755, 2019.

[28] C. Dong, et al., "A survey of natural language generation," ACM Computing Surveys, vol. 55, no. 8, pp. 1-38, 2022.

[29] Y. Safovich and A. Azaria, "Fiction sentence expansion and enhancement via focused objective and novelty curve sampling," in 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2020, pp. 835-843.

[30] T. Iqbal and S. Qureshi, "The survey: Text generation models in deep learning," Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 2515-2528, 2022.

[31] Y. Zhang, Y. Wang, and J. Yang, "Lattice LSTM for Chinese sentence representation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1506-1519, 2020.

[32] J. Devlin, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

簡易檢索 / 詳目顯示

相關論文