簡易檢索 / 詳目顯示

研究生: 黃甚華
Huang, Shen-Hua
論文名稱: 語意簡化法在生醫文獻上的應用-以兩個基因的關係抽取為例
The Application of Semantic Reduction in Biomedical Literature - A Case Study of Relation Extraction Between Two Genes
指導教授: 許聞廉
Hsu, Wen-Lian
口試委員: 戴敏育
Day, Min-Yuh
張詠淳
Chang, Yung-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2024
畢業學年度: 113
語文別: 中文
論文頁數: 39
中文關鍵詞: 簡化法可解釋系統蛋白質交互作用生醫關聯擷取
外文關鍵詞: Reduction Method, Interpretable System, Biomedical Relation Extraction, Protein-Protein Interaction
相關次數: 點閱:38下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究重點在於建立一個可解釋的基因關係萃取系統。其主要應用方法為簡化法。透過簡化法,我們能將複雜的句子透過合併修飾詞的方式,簡化為較為簡單的結構,從而使系統能以更少的pattern來涵蓋更全面的句子。對於系統判斷錯誤的部分,我們可以通過添加知識迅速修正錯誤,而對於已辨識的部分,系統也能提供解釋其判斷邏輯的能力。
    在資料集方面,我們採用了最新的PEDD資料集[21],其規模較大,提供了更多樣本以供訓練和測試。PEDD的註釋由具有生物醫學專業背景的專家進行,這確保了資料的準確性和一致性。此外,PEDD專門用於AI CUP生物醫學論文分析競賽,提供了實際應用背景,使其更符合當前研究需求。資料來源則集中於2015年至2018年間的文獻,確保了資料的時效性,避免了許多舊資料集使用較早文獻的問題。
    在實驗結果方面,初步結果顯示,透過簡化法,我們不僅達到了更高的準確度,還能清楚地解釋每個目標是基於什麼原因與我們的pattern相符。這種可解釋性使我們能夠更好地理解系統的判斷邏輯,增強了結果的透明度和可靠性。


    This study focuses on establishing an interpretable gene relationship extraction system. The primary method employed is simplification. Through the simplification method, we can reduce complex sentences by merging modifiers into simpler structures, enabling the system to cover a wider range of sentences with fewer patterns. For misjudgments made by the system, errors can be quickly corrected by incorporating additional knowledge, while for correctly identified parts, the system can also explain the reasoning behind its decisions.

    Regarding the dataset, we adopted the latest PEDD dataset, which is larger in scale and provides more samples for training and testing. The annotations of PEDD were conducted by experts with biomedical backgrounds, ensuring the accuracy and consistency of the data. Moreover, PEDD was specifically designed for the AI CUP Biomedical Paper Analysis Competition, offering a practical application context that aligns well with current research needs. The dataset primarily draws from literature published between 2015 and 2018, ensuring its timeliness and avoiding issues present in older datasets that rely on outdated literature.

    As for the experimental results, preliminary findings show that the simplification method not only achieved higher accuracy but also clearly explained why each target aligns with our patterns. This interpretability allows for a deeper understanding of the system's decision logic, enhancing the transparency and reliability of the results.

    摘要........................................................................................... I ABSTRACT................................................................................ II 致謝........................................................................................... III 第一章 緒論................................................................................ 1 1.1 研究動機與目的..................................................................... 1 1.2 研究貢獻............................................................................... 2 1.3 論文架構............................................................................... 2 第二章 相關文獻探討................................................................... 3 2.1 句子簡化 (SENTENCE SIMPLIFICATION).............................. 3 2.2 提取系統.............................................................................. 4 第三章 方法................................................................................ 7 3.1 資料集.................................................................................. 7 3.2 簡化法的目的以及方法.......................................................... 8 3.2.1 簡化法的原因..................................................................... 8 3.2.2 修飾關係的定義................................................................ 8 3.2.3 後綴前處理........................................................................ 9 3.2.4 Bag of Genes (BOG)........................................................ 10 3.2.5 修飾詞縮減........................................................................ 11 3.2.6 修飾子句縮減..................................................................... 14 3.2.7 Positive List 以及Negative List......................................... 15 3.2.8 Pattern Matching............................................................. 16 3.2.9 整體的工作流程................................................................. 19 3.3 如何尋找PATTERN和POSITIVE LIST以及NEGATIVE LIST..... 20 3.4 WORKFLOW OF ENSEMBLE............................................... 21 第四章 實驗與結果討論............................................................... 22 4.1 BIORIRE MODEL效果........................................................... 22 4.2 錯誤分析.............................................................................. 24 4.3 與預訓練神經網路模型比較................................................... 26 4.4 不同MODEL結果的比較........................................................ 27 第五章 結論與未來展望............................................................... 29 5.1 結論..................................................................................... 29 5.2 未來展望.............................................................................. 30 參考文獻.................................................................................... 32 附錄........................................................................................... 36 A. POSITIVE LIST................................................................... 36 B. NEGETIVE LIST................................................................. 37 C. TRIGGER VERB LIST.......................................................... 38

    [1] "處理自然語言的簡化法," Institute of Information Science, Academia Sinica, https://iptt.sinica.edu.tw/shares/929, May 2023 (Jul. 17, 2024).

    [2] "自然語言簡化法 (Reduction)," Taiwan Bioinformatics Institute, http://www.tbi.org.tw/enews/TMBD/Vol38.html, October 2021 (Jul. 17, 2024).

    [3] L. Wong, "PIES, a protein interaction extraction system," in Biocomputing 2001, pp. 520-531, 2000.

    [4] C. Blaschke, M. A. Andrade, C. A. Ouzounis, and A. Valencia, "Automatic extraction of biological information from scientific text: protein-protein interactions," in ISMB, vol. 7, pp. 60-67, Aug. 1999.

    [5] B. J. Stapley and G. Benoit, "Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts," in Biocomputing 2000, pp. 529-540, 1999.

    [6] J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, "Mining MEDLINE: abstracts, sentences, or phrases?" in Biocomputing 2002, pp. 326-337, 2001.

    [7] S. Raychaudhuri, J. T. Chang, P. D. Sutphin, and R. B. Altman, "Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature," Genome Research, vol. 12, no. 1, pp. 203-214, 2002.

    [8] C. Ding and H. Peng, "Minimum redundancy feature selection from microarray gene expression data," Journal of Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185-205, 2005.

    [9] B. Cui, H. Lin, and Z. Yang, "SVM-based protein-protein interaction extraction from Medline abstracts," in 2007 Second International Conference on Bio-Inspired Computing: Theories and Applications, Sept. 2007, pp. 182-185.

    [10] S. Kim, H. Liu, L. Yeganova, and W. J. Wilbur, "Extracting drug–drug interactions from literature using a rich feature-based linear kernel approach," Journal of Biomedical Informatics, vol. 55, pp. 23-30, 2015.

    [11] L. Li, P. Zhang, T. Zheng, H. Zhang, Z. Jiang, and D. Huang, "Integrating semantic information into multiple kernels for protein-protein interaction extraction from biomedical literatures," PloS One, vol. 9, no. 3, p. e91898, 2014.

    [12] T. Polajnar, T. Damoulas, and M. Girolami, "Protein interaction sentence detection using multiple semantic kernels," Journal of Biomedical Semantics, vol. 2, pp. 1–18, 2011.

    [13] Z. Zhao, Z. Yang, H. Lin, J. Wang, and S. Gao, "A protein-protein interaction extraction approach based on deep neural network," International Journal of Data Mining and Bioinformatics, vol. 15, no. 2, pp. 145–164, 2016.

    [14] Y. Peng and Z. Lu, "Deep learning for extracting protein-protein interactions from biomedical literature," in BioNLP 2017, p. 29, 2017.

    [15] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.

    [16] S. Gururangan, A. Marasovic, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, "Don't stop pretraining: Adapt language models to domains and tasks," In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360, 2020.

    [17] R. Evans and C. Orăsan, "Identifying signs of syntactic complexity for rule-based sentence simplification," Natural Language Engineering, vol. 25, no. 1, pp. 69–119, 2019.

    [18] O. Cetinoglu, S. Zarrieß, J. Kuhn, M. Butt, and T. H. King, "Dependency-based sentence simplification for increasing deep LFG parsing coverage," in Proceedings of the LFG13 Conference, July 2013, pp. 191–211.

    [19] K. Omelianchuk, V. Raheja, and O. Skurzhanskyi, "Text simplification by tagging," in Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, Apr. 2021, pp. 11–25.

    [20] D. Kumar, L. Mou, L. Golab, and O. Vechtomova, "Iterative edit-based unsupervised sentence simplification," arXiv preprint arXiv:2006.09639, 2020.

    [21] M. S. Huang, J. C. Han, P. Y. Lin, Y. T. You, R. T. H. Tsai, and W. L. Hsu, "Surveying biomedical relation extraction: A critical examination of current datasets and the proposal of a new resource," Briefings in Bioinformatics, vol. 25, no. 3, p. bbae132, 2024.

    [22] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, "Stanza: A Python natural language processing toolkit for many human languages," in Proceedings of the Association for Computational Linguistics (ACL) System Demonstrations, 2020.

    [23] H. Alachram, H. Chereda, T. Beißbarth, E. Wingender, and P. Stegmaier, "Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks," PloS One, vol. 16, no. 10, p. e0258623, 2021.

    [24] J. Bastings, I. Titov, W. Aziz, D. Marcheggiani, and K. Sima’an, “Graph convolutional encoders for syntax-aware neural machine translation,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1957–1967, 2017.

    [25] C. Yang, J. Deng, X. Chen, and Y. An, "SPBERE: Boosting span-based pipeline biomedical entity and relation extraction via entity information," Journal of Biomedical Informatics, vol. 145, p. 104456, 2023.

    [26] S. S. Roy and R. E. Mercer, "Extracting drug-drug and protein-protein interactions from text using a continuous update of tree-transformers," in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, July 2023, pp. 280–291.

    QR CODE