研究生: |
蘇宇辰 Su, Yu-Chen |
---|---|
論文名稱: |
以生物文獻探勘技術辨識蛋白質互動片段之研究 A Study of Biomedical Text Mining for Protein-protein Interaction Passage Extraction |
指導教授: |
許聞廉
Hsu, Wen-Lian |
口試委員: |
張詠淳
Chang, Yung-Chun 戴鴻傑 Dai, Hong-Jie |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 52 |
中文關鍵詞: | 文字探勘 、蛋白質交互作用 、交互作用模式 、卷積樹核 |
外文關鍵詞: | Text Mining, Protein-Protein Interaction, Interaction Pattern Generation, Convolution Tree Kernel |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,生物醫學文獻數量的大幅增長使得自動化之關係抽取的需求更為迫切,在實體關係的種類中,蛋白質交互作用提供細胞之功能與組織結構的多樣觀點,而此類知識能解答生物途徑之分子機轉。從生醫文獻中辨識蛋白質之間是否存在交互作用的方法是在文字探勘領域時常被探討的主題之一,本研究提出一產生交互作用之模式(pattern)的模組以獲取常見的蛋白質交互作用規則,先前亦曾用於參加2015年BioCreative之競賽。本研究亦提出結合蛋白質交互作用規則和卷積樹核(Convolution Tree Kernel)的interaction pattern tree kernel以辨識蛋白質交互作用,而interaction pattern tree透過branching、pruning和ornamenting三個步驟將語法和語意的資訊結合至樹狀結構之中。本研究所提出的方法以LLL, IEPA, HPRD50, AIMed和BioInfer作為資料庫,並透過交叉驗證(cross-validation)、交叉學習(cross-learning)和跨語料庫(cross-corpus)的方式評估效能,實驗結果顯示本研究的方法有效且較數個知名的蛋白質交互作用抽取方法為佳。除此之外,本研究亦探討了數種有效的特徵(features)及建議的研究方向,或可供未來研究參考。
In recent years, the amount of biomedical literatures grows rapidly and thus the need for automated relation extraction methods becomes critical. Among all types of relations, knowledge about protein–protein interactions, including information concerning various aspects of the structural and functional organization of cells, can shed light on molecular mechanisms of biological processes. Therefore, identifying the interactions between proteins mentioned in biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this paper we propose PIPE, an interaction pattern generation module used in BioCreative 2015 competition to capture frequent protein-protein interaction (PPI) patterns within text. We also present an interaction pattern tree kernel method that integrates the PPI patterns with convolution tree kernel to extract protein-protein interactions, and the interaction pattern tree is constructed through three operations including branching, pruning and ornamenting. The proposed tree structure incorporates syntactic, content, and semantic information in text. Methods were evaluated on LLL, IEPA, HPRD50, AIMed, and BioInfer corpora using cross-validation, cross-learning, and cross-corpus evaluation. Empirical evaluations demonstrate that our method is effective and outperforms several well-known PPI extraction methods. Moreover, we discuss further the features that may be useful for future research.
1 A.Airola, S.Pyysalo, J.Björne, T.Pahikkala,F.Ginter, and Salakoski T. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics,vol.9: S2, 2008.
2 A. Moschitti. A study on convolution kernels for shallow semantic parsing. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 21-26, 2004.
3 A. Moschitti. Efficient convolution kernels for dependency and constituent syntactic trees. In Proceedings of the 17th European Conference on Machine Learning, pp. 318-329, 2006.
4 C. Cooper and A. M. Frieze. The cover time of random regular graphs. SIAM Journal on Discrete Mathematics, vol. 18, pp. 728-740, 2005.
5 C.D. Manning and H. Schütze. Foundations of statistical natural language processing: MIT Press, Cambridge, Massachusetts, 1stedn., 1999.
6 C. Giuliano, A. Lavelli, and L. Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. pp. 401-408, 2006.
7 C. Nedellec. Learning language in logic-genic interaction extraction challenge. In Proceedings of the Learning Language in Logic 2005 Workshop at the International Conference on Machine Learning, pp. 97-99, 2005.
8 D.C. Comeau, R. Islamaj Dogan, P. Ciccarese, K.B. Cohen, M. Krallinger, F. Leitner, Z. Lu, Y. Peng, F. Rinaldi, M. Torii, A. Valencia, K. Verspoor, T.C. Wiegers, C.H. Wu, and W.J. Wilbur. BioC: A Minimalist Approach to Interoperability for Biomedical Text Processing. Database, 2013: doi: 10.1093/database/bat064.
9 D. Hanisch, K. Fundel, H.T. Mevissen, R. Zimmer, and J. Fluck. Prominer: rule-based protein and gene entity recognition. BMC Bioinformatics, vol.6: S14, 2005.
10 D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, and U. Leser. A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature. PLoS Computational Biology, vol. 6, issue 7, pp.1-19, 2010.
11 E.M. Phizicky and S. Fields. Protein-protein interactions: Methods for detection and analysis. Microbiol Rev, vol. 59, pp. 94-123, 1995.
12 G. Erkan, A. Özgür, and D. R. Radev. Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In Proceedings of the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 228-237, 2007.
13 I. Xenarios, E. Fernandez, L. Salwinski, X.J. Duan, M.J. Thompson, E.M. Marcotte, and D. Eisenberg. DIP: The database of interacting proteins: 2001 update. Nucleic Acids Research, vol. 29, issue 1, pp. 239 - 241, 2001.
14 J.D. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii. Overview of BioNLP'09 shared task on event extraction, In Proceeding of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pp. 1-9, 2009.
15 J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edn., 2011.
16 K. Fundel, R. Ku¨ ffner, and R. Zimmer. RelEx - relation extraction using dependency parse trees. Bioinformatics, issue 23, pp. 365-371, 2007.
17 K.P. Kamune and A. Avinash. Hybrid Approach to Pronominal Anaphora Resolution in English Newspaper Text. International Journal of Intelligent Systems and Applications,7(2):56, 2015.
18 L. Li, R. Guo, Z. Jiang and D. Huang. An Approach to Improve Kernel-Based Protein Protein Interaction Extraction by Learning from Large-Scale Network Data. Methods, 2015.
19 L. Lovász. Random walks on graphs: a survey. Janos Bolyai Mathematical Society, Budapest 2, pp. 1-46, 1993.
20 L. Qian and G. Zhou. Tree kernel-based protein–protein interaction extraction from biomedical literature. Journal of Biomedical Informatics, vol. 45, pp. 535-543, 2012.
21 L.S. Van, Y. Saeys, B. Baets, and Y.V. Peer. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In Proceedings of 3rd International Symposium on Semantic Mining in Biomedicine, pp. 77-84, 2008.
22 M. Collins and N. Duffy. Convolution kernels for natural language. In Proceedings of Annual Conference on Neural Information Processing Systems, pp. 625-632, 2001.
23 M.F. Porter. An algorithm for suffix stripping, in Readings in Information Retrieval, Karen Sparck Jones and Peter Willet (ed), San Francisco: Morgan Kaufmann, 1997.
24 M. Marneffe, B.MacCartney and C.D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.
25 M. Miwaa, R. Sætre, Y. Miyao, and J. Tsujii, Protein–protein interaction extraction by leveraging multiple kernels and parsers. International Journal of Medical Informatics, vol. 78, issue 12, pp. 39-46, 2009.
26 M. Zhang, G.D. Zhou, and A.T. Aw. Exploring syntactic structured features over parse trees for relation extraction using kernel methods. Information Processing and Management, vol.44, pp. 687-701, 2008.
27 M. Zhang, J. Zhang, J. Su, and G.D. Zhou. A composite kernel to extract relations between entities with both flat and structured features. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 825-832, 2006.
28 N. Cristianini and J.S. Taylor. An introduction to support vector machines and other kernel-based learning methods. New York, USA: Cambridge University Press; 2000.
29 R. Kabiljo, A. Clegg, and A. Shepherd. A realistic assessment of methods for extracting gene/protein interactions from free text. BMC Bioinformatics, vol. 10, pp. 233-245, 2009.
30 R.C.Bunescu, R. Ge, R.J. Kate, E.M.Marcotte, R.J. Mooney, A.K.Ramani, and Y.W. Wong. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, vol. 33, issue 2, pp. 39-55, 2005.
31 R. Satre, K. Sagae, and J.Tsujii. Syntactic features for protein-protein interaction extraction. In Proceedings of the 2nd international symposium on languages in biology and medicine, pp. 6.1-6.14, 2007.
32 S. Kim, R. Islamaj Dogan, A. Chatr-aryamontri, M. Tyers, W.J. Wilbur, and D.C. Comeau. BioCreative V BioC Track Overview: Collaborative Biocurator Assistant Task for BioGRID, Database, 2016.
33 S. Pyysalo, A. Airola, J. Heimonen, J. Björne, F. Ginter, and T.Salakoski. Comparative Analysis of Five Protein-protein Interaction Corpora. BMC Bioinformatics, vol. 9: S6, 2008.
34 S. Pyysalo, F. Ginter, J. Heimonen, J. Björne, J. Boberg, J. Järvinen, T. Salakoski. A corpus for information extraction in the biomedical domain. BMC Bioinformatics, vol. 8, issue 50, pp. 50-74, 2007.
35 S.R. Jonnalagadda, D. Li, S. Sohn, S.T. Wu, K. Wagholikar, M. Torii, and H. Liu. Coreference Analysis in Clinical Notes: A Multi-pass Sieve with Alternate Anaphora Resolution Modules. Journal of the American Medical Informatics Association,19(5):867-874, 2012.
36 S.V.N. VishwanathanandA.J. Smola. Fast kernels for string and tree matching. In Proceedings of Neural Information Processing Systems, pp. 569-576, 2002.
37 T. Kuboyama, K. Hirata, H. Kashima, K.F. Aoki-Kinoshita, and H. Yasuda. A spectrum tree kernel. Information and Media Technologies, vol. 2, pp.292-299, 2007.
38 Y. López, K. Nakai, and A. Patil. HitPredict version 4: comprehensive reliability scoring of physical protein-protein interactions from more than 100 species. Database, vol. 2015, 2015.
39 Z. Yang, N. Tang, X. Zhang, H. Lin, Y. Li, and Z. Yang. Multiple kernel learning in protein-protein interaction extraction from biomedical literature. Artificial Intelligence in Medicine, vol. 51, issue 3, pp. 163-73, 2011.
40 T. Mikolov, K. Chen, G. Corrado and J. Dean. Efficient estimation of word representations in vector space. In Proceeding of International Conference on Learning Representations, 2013.
41 C. Ma, Y. Zhang, and M. Zhang. Tree Kernel-based protein-protein interaction extraction considering both governor verb phrases and appositive dependency features. In Proceedings of the 24th International Conference on World Wide Web Companion, pp. 655-660, 2015.
42 Katrin Fundel, Robert Kuffner, and Ralf Zimmer. RelEx–Relation extraction using dependency parse trees. Bioinformatics, 23(3):365–371, 2007.
43 Yun-Nung Chen, Dilek Hakkani-Tur, and Gokan Tur. Deriving local relational surface forms from dependency-based entity embeddings for unsupervised spoken language understanding. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pp. 242–247, 2014.
44 R. Socher, C.D. Manning, and Andrew Y. Ng. Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks. Deep Learning and Unsupervised Feature Learning Workshop – NIPS, 2010.
45 J. F. Gao, X. D He, W. T. Yih, and L. Deng. Learning Continuous Phrase Representations for Translation Modeling. In Proceedings of ACL, 2014.
46 Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin. Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 1785–1794, 201555.
47 K. Sugiyama, K. Hatano, M. Yoshikawa, and S. Uemura. Extracting information on protein-protein interactions from biological literature based on machine learning approaches. Genome Informatics, vol. 14, pp. 699-700, 2003.
48 T. Mitsumori, M. Murata, Y. Fukuda, K. Doi, and H. Doi. Extracting protein-protein interaction information from biomedical text with SVM. IEICE Transactions on Information and Systems, vol. E89-D (8), pp. 2464-2466, 2006.
49 B. Liu, L. H. Qian, H. L. Wang, and G. D. Zhou. Dependency-driven feature-based learning for extracting protein–protein interactions from biomedical Text. In Proceedings of COLING’2010 (Poster), pp. 757-65, 2010.
50 D. McClosky, S. Riedel, M. Surdeanu, A. McCallum, and C. D. Manning. Combining joint models for biomedical event extraction. BMC bioinformatics, vol. 13, no. Suppl 11, S9, 2012.
51 A. Vlachos and M. Craven. Biomedical event extraction from abstracts and full papers using search-based structured prediction. BMC bioinformatics, vol. 13, no. Suppl 11, S5, 2012.
52 Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of Machine Learning Research 3, pp. 1137-1155, 2003.
53 L. Qiu, Y. Cao, Z. Nie, and Y. Rui. Learning word representation considering proximity and ambiguity. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
54 E. Asgari, and M. R. K. Mofrad. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLOS ONE 10, 11, 2015.
55 T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In: Advances in neural information processing systems, pp. 3111-3119, 2013.
56 S. Albert, S. Gaudan, H. Knigge, A. Raetsch, A. Delgado, B. Huhse, H. Kirsch, M. Albers, D. Rebholz-Schuhmann, M. Koegl. Computer-assisted generation of a protein-interaction database for nuclear receptors. Mol Endocrinol. 17(8): 1555-1567, 2003.
57 M. Huang, X. Zhu, and Y. Hao. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20, 3604-3612, 2004.
58 R. Bunescu and R. Mooney. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP), pp. 724-731, 2005.
59 C. Li, R. Song, M. Liakata, A. Vlachos, S. Seneff, X. Zhang. Using word embedding for bio-event extraction. In Proceedings of the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015), pp. 121-126, 2015.
60 R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12 (2011), pp. 2493-2537.
61 V.N. Vapnik. The nature of statistical learning theory. Springer-Verlag, 1995.
62 G. E. Hinton. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, pp. 1-12, 1986.
63 J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine learning, 7(2-3):195–225, 1991.