簡易檢索 / 詳目顯示

研究生: 黃筌敬
Huang, Chuan-Ching
論文名稱: 應用蛋白質片段資訊進行酵素功能標註及推論蛋白質間交互作用關係
Applying Structural Domain Information for Enzyme Reaction Annotation and Protein-Protein Interaction Inference
指導教授: 唐傳義
Tang, Chuan Yi
口試委員: 葉信宏
唐傳義
盧錦隆
林俊淵
韓永楷
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2013
畢業學年度: 102
語文別: 英文
論文頁數: 84
中文關鍵詞: 蛋白質間的交互作用結構片段間的交互作用資料探勘關聯式法則蛋白質結構片段組成k等分交叉驗證
外文關鍵詞: protein-protein interaction, domain-domain interaction, data mining, association rule, domain architecture, k-fold cross validation
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 蛋白質在生物體內扮演了各式各樣的功能,包含了催化反應、訊息的傳導、養分的運輸等,而蛋白質片段是構成蛋白質的基本構造。大部分的蛋白質多由兩個以上的結構片段所構成,與其他蛋白質間的交互作用可以透過該蛋白質上的結構片段去辨識其他蛋白質上的結構片段,並與之結合來達成。本論文旨在用蛋白質結構片段的特性探討以下兩個主題:一是從酵素的結構片段組成進行酵素功能的預測;二是從蛋白質結構片段間的交互作用推論蛋白質間的交互作用。
    在高通量基因定序技術的發明之後,新定序出來的蛋白質序列和已知功能的蛋白質序列數量之差異日益加大。在這種情況下,使得人為的蛋白質功能註解變得艱鉅;於是開發自主性的蛋白質功能預測逐漸成為趨勢。在蛋白質功能的定義方面,生物化學暨分子生物的國際組織以四組數字為酵素制定了完善的分類標準。頭三組數字用來描述酵素反應的整體範疇,最後一組數字則隨反應物的特性而進行編號。因此本論文將收集到的蛋白質分成兩個集合進行分析,一個只包含酵素編碼前三組數字的集合,另一個則包含酵素編碼全部四組數字的集合。由於有些蛋白質能表達一個以上的功能,每一個集合又將再細分成兩組;具單一催化反應的蛋白質為一組,具多種催化反應的蛋白質為另一組。在具單一催化反應的組別中,知名的關聯式法則演算法對前三組酵素編號的蛋白質集合與全部四組酵素編號的蛋白質集合的預測準確率可分別達到96%和91%。本論文所提出的酵素功能預測方法則可分別達到稍高的99%和92%的準確度。相較於單一催化反應的預測,多種催化反應的預測顯得困難許多。於是在對多種催化反應的蛋白質集合進行預測,關聯式法則在前三組酵素編號的蛋白質集合的準確率為17%,在全部四組數字的蛋白質集合為8%;而本酵素功能預測方法的準確度則分別可達到49%和42%之譜。
    生物體內的反應可以從蛋白質間的交互作用中,由一個蛋白質對另一個蛋白質上的一段結構構造進行辨識並結合來驅動。因此從蛋白質片段的層次探討蛋白質功能是可行的。在冬季全球盛行的諾羅病毒,會引起嚴重的腹瀉和食物中毒的癥症。因為諾羅病毒基因序列的變異性高,以致目前尚未能開發出有效的疫苗。抵抗力較弱的人群在感染到諾羅病毒之後,往往會嚴重到需要住院治療,甚至可能會導致死亡。本論文嚐試從蛋白質片段的層次建立感染諾羅病毒時的蛋白質交互作用網路組織,以期待進一步的臨床運用與藥物設計。


    Domains are fundamental building blocks of proteins which perform a variety of functions within living organisms, including catalysis, signal transduction, and transport of nutrients. The majority of proteins are composed of more than two domains that recognize and bind structural units in other proteins through protein-protein interactions. This dissertation uses the nature of domains in the proteins to investigate two main topics including “enzyme reaction prediction based on the domain architecture of an enzyme” and “inferring protein-protein interactions (PPIs) from domain-domain interactions (DDIs)”.
    The gap between novel protein sequences and characterized protein functions has been widened according to the advent of high-throughput genome sequencing techniques in the post-genomics era. To identify functions of a protein from manually curated sequence annotation is a challenging task; therefore, automated protein function prediction techniques are necessary. The enzyme nomenclature proposed by the International Union of Biochemistry and Molecular Biology has provided a well-defined four-field number on enzyme classification. The first three numbers of an enzyme reaction describe the overtype of enzymatic reaction, and the last number denotes the substrate specificity of a reaction. Proteins are grouped into two data sets, comprising the 3-numerical-block set and the 4-numerical-block set. According to whether the protein performed more than one enzymatic reaction, each data set was further divided into single-EC cases and multiple-EC cases. For the case of single-EC, the fractions of entries correctly classified using the well-known association rule method reached 96% and 91% accuracy for the 3-numerical-block set and the 4-numerical-block set, respectively. The proposed enzyme reaction prediction (ERP) method showed marginally higher accuracy, with 99% and 92% separately. It is more difficult to predict multiple enzymatic activities for a single protein. For the case of multiple-EC, the fractions of entries correctly predicted for the 3-numerical-block set and the 4-numerical-block set were 17% and 8%, respectively, for the association rule method, and 49% and 42%, respectively, for the ERP method.
    Biological processes could be carried out when one protein recognize and bind certain structural elements in other proteins through PPIs. Therefore, it is possible to explore protein functions from protein interactions at domain level. Noroviruses cause severe gastroenteritis and foodborne illness during the winter worldwide. There is no efficient vaccine for Noroviruses because of their variable genome sequences. Vulnerable populations suffer from Noroviruses often require hospitalization and may die. We attempted to build the protein interaction network from the domain level for clinical applications and drug design further.

    Abstract 中文摘要 誌謝辭 Chapter 1 Introduction 1 Chapter 2 Materials 9 2.1 The protein repository 10 2.2 The domain database 12 2.3 The protein function nomenclature 13 2.4 The integrated data set 14 Chapter 3 Methods 23 3.1 Data analysis and method review 25 3.1.1 Predictive modeling 25 3.1.2 Association analysis 25 3.1.3 Cluster analysis 26 3.1.4 Anomaly detection 26 3.2 The AR method 28 3.2.1 Frequent itemset generation 30 3.2.2 Rule generation 33 3.3 The ERP method 36 3.3.1 Domain architecture enumeration 37 3.3.2 Enzyme reaction ranking 46 3.4 Five-fold cross-validation 48 3.5 Inferring protein-protein interaction network from domain-domain interactions 49 Chapter 4 Results 54 Chapter 5 Discussion 73 5.1 Enzyme reaction annotation 74 5.2 Inferring PPIs from DDIs 76 Chapter 6 Conclusion 77 Bibliography 80

    [1] B. Rost, J. Liu, R. Nair, K. O. Wrzeszczynski, and Y. Ofran, "Automatic prediction of protein function," Cellular and molecular life sciences : CMLS, vol. 60, pp. 2637-50, Dec 2003.
    [2] I. Friedberg, "Automated protein function prediction--the genomic challenge," Briefings in bioinformatics, vol. 7, pp. 225-42, Sep 2006.
    [3] R. Garrett and C. M. Grisham, Biochemistry, 3rd ed. Belmont, Calif. ; London: Thomson Brooks/Cole, 2005.
    [4] "BBC Research (2011) In Reprot BIO030F - Enzymes in Industrial Applications: Global markets.."
    [5] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic acids research, vol. 25, pp. 3389-402, Sep 1 1997.
    [6] J. D. Thompson, D. G. Higgins, and T. J. Gibson, "Clustal-W - Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice," Nucleic acids research, vol. 22, pp. 4673-4680, Nov 11 1994.
    [7] C. Claudel-Renard, C. Chevalet, T. Faraut, and D. Kahn, "Enzyme-specific profiles for genome annotation: PRIAM," Nucleic Acids Research, vol. 31, pp. 6633-6639, Nov 15 2003.
    [8] C. G. Yu, N. Zavaijevski, V. Desai, and J. Reifman, "Genome-wide enzyme annotation with precision control: Catalytic families [CatFam] databases," Proteins-Structure Function and Bioinformatics, vol. 74, pp. 449-460, Feb 1 2009.
    [9] S. Quester and D. Schomburg, "EnzymeDetector: an integrated enzyme function prediction tool and database," Bmc Bioinformatics, vol. 12, Sep 23 2011.
    [10] M. Scheer, A. Grote, A. Chang, I. Schomburg, C. Munaretto, M. Rother, C. Sohngen, M. Stelzer, J. Thiele, and D. Schomburg, "BRENDA, the enzyme information system in 2011," Nucleic Acids Research, vol. 39, pp. D670-D676, Jan 2011.
    [11] W. D. Tian, A. K. Arakaki, and J. Skolnick, "EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference," Nucleic Acids Research, vol. 32, pp. 6226-6239, 2004.
    [12] W. T. Clark and P. Radivojac, "Analysis of protein function and its prediction from amino acid sequence," Proteins-Structure Function and Bioinformatics, vol. 79, pp. 2086-2096, Jul 2011.
    [13] D. Lee, O. Redfern, and C. Orengo, "Predicting protein function from sequence and structure," Nature Reviews Molecular Cell Biology, vol. 8, pp. 995-1005, Dec 2007.
    [14] J. Espadaler, N. Eswar, E. Querol, F. X. Aviles, A. Sali, M. A. Marti-Renom, and B. Oliva, "Prediction of enzyme function by combining sequence similarity and protein interactions," Bmc Bioinformatics, vol. 9, May 27 2008.
    [15] W. D. Tian and J. Skolnick, "How well is enzyme function conserved as a function of pairwise sequence identity?," Journal of molecular biology, vol. 333, pp. 863-882, Oct 31 2003.
    [16] M. Kotera, Y. Okuno, M. Hattori, S. Goto, and M. Kanehisa, "Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions," Journal of the American Chemical Society, vol. 126, pp. 16487-16498, Dec 22 2004.
    [17] Y. Yamanishi, M. Hattori, M. Kotera, S. Goto, and M. Kanehisa, "E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate-product pairs," Bioinformatics, vol. 25, pp. I179-I186, Jun 15 2009.
    [18] Q. Y. Zhang and J. Aires-de-Sousa, "Structure-based classification of chemical reactions without assignment of reaction centers," Journal of Chemical Information and Modeling, vol. 45, pp. 1775-1783, Nov-Dec 2005.
    [19] D. A. R. S. Latino, Q. Y. Zhang, and J. Aires-De-Sousa, "Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps," Bioinformatics, vol. 24, pp. 2236-2244, Oct 1 2008.
    [20] C. Chothia, J. Gough, C. Vogel, and S. A. Teichmann, "Evolution of the protein repertoire," Science, vol. 300, pp. 1701-3, Jun 13 2003.
    [21] L. Holm and C. Sander, "Parser for protein folding units," Proteins-Structure Function and Bioinformatics, vol. 19, pp. 256-68, Jul 1994.
    [22] J. S. Richardson, "The anatomy and taxonomy of protein structure," Advances in protein chemistry, vol. 34, pp. 167-339, 1981.
    [23] C. Vogel, M. Bashton, N. D. Kerrison, C. Chothia, and S. A. Teichmann, "Structure, function and evolution of multidomain proteins," Current opinion in structural biology, vol. 14, pp. 208-16, Apr 2004.
    [24] S. H. Chiu, C. C. Chen, G. F. Yuan, and T. H. Lin, "Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences," Bmc Bioinformatics, vol. 7, Jun 15 2006.
    [25] K. C. Chou and Y. D. Cai, "Using functional domain composition and support vector machines for prediction of protein subcellular location," Journal of Biological Chemistry, vol. 277, pp. 45765-45769, Nov 29 2002.
    [26] T. Koestler, A. von Haeseler, and I. Ebersberger, "FACT: Functional annotation transfer between proteins with similar feature architectures," BMC bioinformatics, vol. 11, Aug 9 2010.
    [27] M. Magrane and U. Consortium, "UniProt Knowledgebase: a hub of integrated protein data," Database : the journal of biological databases and curation, vol. 2011, p. bar009, 2011.
    [28] R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, M. D. Croning, R. Durbin, L. Falquet, W. Fleischmann, J. Gouzy, H. Hermjakob, N. Hulo, I. Jonassen, D. Kahn, A. Kanapin, Y. Karavidopoulou, R. Lopez, B. Marx, N. J. Mulder, T. M. Oinn, M. Pagni, F. Servant, C. J. Sigrist, and E. M. Zdobnov, "The InterPro database, an integrated documentation resource for protein families, domains and functional sites," Nucleic acids research, vol. 29, pp. 37-40, Jan 1 2001.
    [29] E. L. Sonnhammer, S. R. Eddy, and R. Durbin, "Pfam: a comprehensive database of protein domain families based on seed alignments," Proteins-Structure Function and Bioinformatics, vol. 28, pp. 405-20, Jul 1997.
    [30] D. W. Buchan, S. C. Rison, J. E. Bray, D. Lee, F. Pearl, J. M. Thornton, and C. A. Orengo, "Gene3D: structural assignments for the biologist and bioinformaticist alike," Nucleic acids research, vol. 31, pp. 469-73, Jan 1 2003.
    [31] J. Gough, K. Karplus, R. Hughey, and C. Chothia, "Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure," Journal of molecular biology, vol. 313, pp. 903-919, Nov 2 2001.
    [32] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium," Nature genetics, vol. 25, pp. 25-9, May 2000.
    [33] International Union of Biochemistry and Molecular Biology. Nomenclature Committee. and E. C. Webb, Enzyme nomenclature 1992 : recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. San Diego: Published for the International Union of Biochemistry and Molecular Biology by Academic Press, 1992.
    [34] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining, 1st ed. Boston: Pearson Addison Wesley, 2006.
    [35] J. Han, M. Kamber, and J. Pei, Data mining : concepts and techniques, 3rd ed. Amsterdam ; London: Elsevier MK, 2012.
    [36] D. Frankowski. The Association Rules Package of CPAN. Available: http://search.cpan.org/dist/Data-Mining-AssociationRules/lib/Data/Mining/AssociationRules.pm
    [37] J. Xi, J. X. Wang, D. Y. Graham, and M. K. Estes, "Detection of Norwalk Virus in Stool by Polymerase Chain-Reaction," Journal of Clinical Microbiology, vol. 30, pp. 2529-2534, Oct 1992.
    [38] A. Scipioni, A. Mauroy, J. Vinje, and E. Thiry, "Animal noroviruses," Veterinary Journal, vol. 178, pp. 32-45, Oct 2008.
    [39] G. Belliot, S. V. Sosnovtsev, T. Mitra, C. Hammer, M. Garfield, and K. Y. Green, "In vitro proteolytic processing of the MD145 norovirus ORF1 nonstructural polyprotein yields stable precursors and products similar to those detected in calicivirus-infected cells," Journal of Virology, vol. 77, pp. 10957-10974, Oct 2003.
    [40] M. E. Hardy, "Norovirus protein structure and function," Fems Microbiology Letters, vol. 253, pp. 1-8, Dec 1 2005.
    [41] R. Chen, J. D. Neill, J. S. Noel, A. M. Hutson, R. I. Glass, M. K. Estes, and B. V. Prasad, "Inter- and intragenus structural variations in caliciviruses and their functional implications," Journal of Virology, vol. 78, pp. 6469-79, Jun 2004.
    [42] A. Bertolotti-Ciarlet, S. E. Crawford, A. M. Hutson, and M. K. Estes, "The 3 ' end of Norwalk virus mRNA contains determinants that regulate the expression and stability of the viral capsid protein VP1: a novel function for the VP2 protein," Journal of Virology, vol. 77, pp. 11603-11615, Nov 2003.
    [43] B. V. V. Prasad, M. E. Hardy, T. Dokland, J. Bella, M. G. Rossmann, and M. K. Estes, "X-ray crystallographic structure of the Norwalk virus capsid," Science, vol. 286, pp. 287-290, Oct 8 1999.
    [44] S. Yellaboina, A. Tasneem, D. V. Zaykin, B. Raghavachari, and R. Jothi, "DOMINE: a comprehensive collection of known and predicted domain-domain interactions," Nucleic Acids Research, vol. 39, pp. D730-D735, Jan 2011.
    [45] M. E. Smoot, K. Ono, J. Ruscheinski, P. L. Wang, and T. Ideker, "Cytoscape 2.8: new features for data integration and network visualization," Bioinformatics, vol. 27, pp. 431-2, Feb 1 2011.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE