研究生: |
何品翰 Ho, Pin-Han |
---|---|
論文名稱: |
中文語意搭配詞之預測 Prediction of Chinese Meaningful Word Pairs |
指導教授: |
許聞廉
Hsu, Wen-Lian |
口試委員: |
馬偉雲
Ma, Wei-Yun 戴敏育 Day, Min-Yuh |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 55 |
中文關鍵詞: | 簡化法 、中文語意搭配詞 、知識庫 、自動標註 |
外文關鍵詞: | Reduction Method, Chinese Meaningful Word Pairs, Knowledge Base, Auto-labeling |
相關次數: | 點閱:26 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
此篇論文的主題是中文語意搭配詞之預測,而此方法主要應用為簡化法。首先簡化法的目的在於將複雜句簡化成簡單句,將具有修飾詞的句子做縮減以取得核心的部分,藉此取得主詞、動詞、受詞的主要結構(SVO結構)。另外透過找到詞和詞的修飾關係,能夠收集句子中特定目標詞的修飾語,形成依存的結構。透過上述將複雜句簡化成簡單句的結果,能夠抓到句子的結構及修飾關係,有利於後續語法分析器(parser)做使用。
而此篇之中文語意搭配詞主要是在收集句子中的具有修飾關係的詞彙組合,目前收集的修飾類別有5種。在方法部分採用神經網路模型加上知識庫的方法來綜合預測修飾的關係。模型部分採用BERT模型為預訓練模型加上分類器,為多類別的預測模型;在知識庫部分則為收集常見的修飾關係詞彙組合,用以預測可能的修飾關係。最後綜合兩者結果以產出最後的預測關係及類別。
在資料集方面,語料來源採用哈工大語料及小學數學語料。此兩份語料在製作及標註方面均為本研究產生。而本研究在語料標註方面,修飾關係部分是基於哈工大的語法分析工具採用自動標註方式,收集有修飾關係的詞彙組合。在非修飾關係部分則採用選取目標詞(名詞和動詞)前後特定範圍內的詞彙和目標詞的組合來做收集。
而實驗結果部分,目前結果表明,結合知識庫和神經網絡模型的方法在預測修飾關係方面較單獨使用知識庫或單獨使用神經網路有較佳之準確度。在知識庫中透過詞彙組合的類別篩選,篩選出單一修飾關係類別的詞彙組合,能顯著提升模型的綜合預測效果。另外自動標註方法的準確率均達到99%以上,也驗證了自動標註的有效性和可靠性。
The topic of this paper is the Prediction of Chinese Meaningful Word Pairs, primarily applying a reduction method. The purpose of the reduction method is to simplify complex sentences into simple ones, reducing sentences with modifiers to obtain the core parts, in order to acquire the main structure of subject, verb, and object (SVO structure). Additionally, this reduction method allows collecting modifiers for specific target words in sentences, forming a dependency structure. Through the results mentioned above of simplifying complex sentences into simple ones, the structure and modifying relationships of sentences can be captured, which is beneficial for subsequent parsing.
In this paper, Chinese meaningful word pairs primarily focus on collecting word pairs with modifying relationships within sentences. Among the collected modifying relationships, five types of modifying categories have been collected. In the methodology section, a combination of neural network models and knowledge bases is used to predict modifying relationships. The model part uses the BERT model as a pre-trained model plus a classifier, forming a multi-class prediction model. The knowledge base part collects common modifier combinations to predict possible modifying relationships. Finally, the results of both are combined to produce the final predicted relationships and categories.
Regarding the dataset, the corpus sources are from the Harbin Institute of Technology (HIT) corpus and elementary school mathematics corpus. Both of these corpora were created and annotated for this research. In terms of corpus annotation, the modifying relationships were automatically annotated based on the HIT's parser, collecting word pairs with modifying relationships. For non-modifying relationships, word pairs were collected by selecting words within a specific range before and after the target words (nouns and verbs).
The experimental results show that the method combining knowledge bases and neural network models outperforms in predicting modifying relationships compared to using knowledge bases or neural networks alone. In the knowledge base, filtering word pairs by category, selecting word pairs with a single modifying relationship category, significantly improves the model's overall prediction performance. Furthermore, the accuracy of the automatic annotation method consistently reaches over 99%, validating its effectiveness and reliability.
[1] "處理自然語言的簡化法," Institute of Information Science, Academia Sinica, https://iptt.sinica.edu.tw/shares/929, May 2023 (Jul. 17, 2024).
[2] "自然語言簡化法 (Reduction)," Taiwan Bioinformatics Institute, http://www.tbi.org.tw/enews/TMBD/Vol38.html, October 2021 (Jul. 17, 2024).
[3] F. Alva-Manchego, C. Scarton, and L. Specia, "Data-Driven Sentence Simplification: Survey and Benchmark," Computational Linguistics, vol. 46, no. 1, pp. 135–187, 2020.
[4] M. T. Nguyen, C. M. Bui, D. T. Le, and T. L. Le, "Sentence compression as deletion with contextual embeddings," in International Conference on Computational Collective Intelligence, Cham: Springer International Publishing, 2020, pp. 427-440.
[5] K. Filippova, "Multi-sentence compression: Finding shortest paths in word graphs," in Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 2010, pp. 322-330.
[6] A. Bordes, S. Chopra, and J. Weston, "Question answering with subgraph embeddings," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 615–620.
[7] A. Madotto, C. Wu, and P. Fung, "Mem2Seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems," in Proceedings of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1468–1478.
[8] C. Xiong, R. Power, and J. Callan, "Explicit semantic ranking for academic search via knowledge graph embedding," in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1271–1279..
[9] X. Han, et al., "More data, more relations, more context and more openness: A review and outlook for relation extraction," arXiv preprint arXiv:2004.03186, 2020.
[10] H. Wang, K. Qin, R. Y. Zakari, et al., "Deep neural network-based relation extraction: an overview," Neural Comput & Applic, vol. 34, pp. 4781–4801, 2022.
[11] P. Zhou, S. Zheng, J. Xu, Z. Qi, H. Bao, and B. Xu, "Joint Extraction of Multiple Relations and Entities by Using a Hybrid Neural Network," in Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2017, Lecture Notes in Computer Science, vol. 10565, M. Sun, X. Wang, B. Chang, and D. Xiong, Eds. Cham: Springer, 2017.
[12] H. Peng, et al., "Learning from context or names? an empirical study on neural relation extraction," arXiv preprint arXiv:2010.01923, 2020.
[13] Y. Cui, W. Che, T. Liu, B. Qin, and Z. Yang, "Pre-Training With Whole Word Masking for Chinese BERT," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3504-3514, 2021.
[14] Y. Huang, et al., "D‐BERT: Incorporating dependency‐based attention into BERT for relation extraction," CAAI Transactions on Intelligence Technology, vol. 6, no. 4, pp. 417-425, 2021..
[15] J. Devlin, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[16] S. Wang, "The Survey of Joint Entity and Relation Extraction," in Computing and Data Science. CONF-CDS 2021. Communications in Computer and Information Science, vol. 1513, W. Cao, A. Ozcan, H. Xie, and B. Guan, Eds. Singapore: Springer, 2021.
[17] M. Shardlow, "A survey of automated text simplification," International Journal of Advanced Computer Science and Applications, vol. 4, no. 1, pp. 58-70, 2014.
[18] S. Štajner, K. C. Sheang, and H. Saggion, "Sentence simplification capabilities of transfer-based models," in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 12172-12180.
[19] J. Clarke and M. Lapata, "Global inference for sentence compression: An integer linear programming approach," J. Artif. Intell. Res., vol. 31, pp. 399-429, 2008.
[20] K. Filippova and Y. Altun, "Overcoming the lack of parallel data in sentence compression," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1481-1491.
[21] M. E. Califf and R. J. Mooney, "Relational learning of pattern-match rules for information extraction," in Proceedings of CoNLL, 1997.
[22] K. Fundel, R. Küffner, and R. Zimmer, "RelEx—Relation extraction using dependency parse trees," Bioinformatics, vol. 23, no. 3, pp. 365-371, 2007.
[23] M. Cui, et al., "A survey on relation extraction," in Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence: Second China Conference, CCKS 2017, Chengdu, China, August 26–29, 2017, Revised Selected Papers 2, Springer Singapore, 2017, pp. 50-58.
[24] G. Bekoulis, J. Deleu, T. Demeester, et al., "Joint entity recognition and relation extraction as a multi-head selection problem," Expert Syst. Appl., vol. 114, pp. 34–45, 2018.
[25] J. Lee, S. Seo, and Y. S. Choi, "Semantic relation classification via bidirectional LSTM networks with entity-aware attention using latent entity typing," Symmetry, vol. 11, no. 6, p. 785, 2019.
[26] S. Wu and Y. He, "Enriching pre-trained language model with entity information for relation classification," arXiv preprint arXiv:1905.08284, 2019.
[27] M. Eberts and A. Ulges, "Span-based joint entity and relation extraction with transformer pretraining," arXiv preprint arXiv:1909.07755, 2019.
[28] C. Dong, et al., "A survey of natural language generation," ACM Computing Surveys, vol. 55, no. 8, pp. 1-38, 2022.
[29] Y. Safovich and A. Azaria, "Fiction sentence expansion and enhancement via focused objective and novelty curve sampling," in 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2020, pp. 835-843.
[30] T. Iqbal and S. Qureshi, "The survey: Text generation models in deep learning," Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6, pp. 2515-2528, 2022.
[31] Y. Zhang, Y. Wang, and J. Yang, "Lattice LSTM for Chinese sentence representation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1506-1519, 2020.
[32] J. Devlin, et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.