研究生: |
邱世浩 Shih-Hau Chiu |
---|---|
論文名稱: |
由蛋白質的功能區域組成預測酵素類別並探討其類源意義之研究 A study of enzyme class prediction - from functional domain composition and phylogenetic relationships of protein sequence |
指導教授: |
林志侯
Dr. Thy-Hou Lin |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
生命科學暨醫學院 - 分子醫學研究所 Institute of Molecular Medicine |
論文出版年: | 2007 |
畢業學年度: | 96 |
語文別: | 英文 |
論文頁數: | 85 |
中文關鍵詞: | 關聯演算法 、支持向量機演算法 、功能區域組成 、物理化學特性 、類源分類器 、類源組別 |
外文關鍵詞: | association algorithm, Apriori, support vector machines, functional domain composition, enzyme class, InterPro entries, physicochemical features, phylogenetics |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由序列資料直接預測蛋白質的生理功能,目前仍是計算分子生物的一大挑戰。傳統上,由序列相似度預測蛋白質的生理功能一直是被學者廣為採行的方法,只是單靠比對相似度,預測的準確度常會落入相似度臨界值如何選定的迷思中,尤其當比對相似度不高或是比對不到已知蛋白質時,更會面臨無法準確註解的窘境。發展非序列比對式的方法來輔助序列比對的缺失,一直是近年來學者所努力的方向。在預測蛋白質生理功能的工作中,最能彰顯酵素類蛋白質功能的註解,就屬酵素類別的預測。一旦瞭解基因體中酵素所屬的酵素類別,就可能瞭解該生物體的生理代謝路徑。只是現存的預測系統中,酵素類別的預測只能準確預測到前三階甚至是只到前兩階,四階式的預測系統目前仍未完善。在本論文的第一部份,我們試圖從已知的蛋白質資料庫中,挖掘酵素類別和蛋白質功能區域組成的關聯,希望從關聯中建立預測的規則,我們所用的方法是關聯演算法。在挖掘出關聯規則後,我們也用已知的資料當測試組來評估關聯規則的準確度。由所得的結果發現,該系統的預測準確度最高可達88%。而在本論文的第二部份,主要是以支持向量機演算法,利用蛋白質序列上的氨基酸的物理化學特性,建構簡易的類源分類器,此部分的研究主要是配合第一部份研究中依類源組別所建立的關聯規則。希望此一簡易類源分類器的建立,可以更強化所挖掘出的關聯規則利用率。而第二部份的所得結果,和我們原先的假設是一致的,亦即蛋白質序列上的氨基酸的物理化學特性,確實和物種類源有一定的關係。由所得結果證實,單靠氨基酸的物理化學特性確實可以建立簡易的類源分類器。
Identifying the function of protein sequence is still a challenging task in computational and informational biology. Traditionally, functional prediction mostly relies on detecting similarity between a functionally annotated protein and the query protein, then transferring the annotations across. However, sequence composition bias influence the results of similarity searches, and they do not yield the exact share between biological function and domain composition based on the similarity threshold used. Moreover, for the enzyme class prediction, the predictive capacity of previous studies are just to the top level or sublevel of EC classification system, which no research findings are yet available concerning the exact (four-digit) EC numbers prediction. In this work, we attempt to construct a work flow for automatic mapping protein sequence to their corresponding enzyme class based on functional domain composition of protein. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. Furthermore, we found that the rules were different among five taxonomic datasets studied. Consequently, to use these rules, one has to know the phylogeny of protein sequences beforehand. Here, we provide a straightforward method to predict the phylogeny of protein sequences by using a support vector machines classifier based on the biochemical features of amino acid sequences of the genomes. The classification accuracies of the trained SVM classifiers by the Enzymatic and All proteins are 84 and 79%, respectively. Results show that some compositions or biochemical features of amino acid sequences of the genomes can be used to cluster proteins of different taxonomic natures. The sequence compositions of proteins analyzed are originated from some special characteristics corresponding to the taxonomic clades. We prove that the phylogenetic class of protein sequence can be predicted just by amino acid physicochemical properties alone.
1. Chiu SH, Chen CC, Yuan GF, Lin TH: Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences. BMC Bioinformatics 2006, 7:304.
2. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y: Automatic prediction of protein function. Cell Mol Life Sci 2003, 60(12):2637-2650.
3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410.
4. Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990, 183:63-98.
5. Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85(8):2444-2448.
6. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402.
7. Altschul SF, Koonin EV: Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem Sci 1998, 23(11):444-447.
8. Gattiker A, Michoud K, Rivoire C, Auchincloss AH, Coudert E, Lima T, Kersey P, Pagni M, Sigrist CJA, Lachaize C et al: Automated annotation of microbial proteomes in SWISS-PROT. Computational Biology and Chemistry 2003, 27(1):49-58.
9. Wilson CA, Kreychman J, Gerstein M: Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000, 297(1):233-249.
10. Brenner SE: Errors in genome annotation. Trends Genet 1999, 15(4):132-133.
11. Devos D, Valencia A: Intrinsic errors in genome annotation. Trends Genet 2001, 17(8):429-431.
12. Gerlt JA, Babbitt PC: Can sequence determine function? Genome biology 2000:1.
13. Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA: Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 2002, 18(12):1641-1649.
14. Andrade MA, Brown NP, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C et al: Automated genome sequence analysis and annotation. Bioinformatics 1999, 15(5):391-412.
15. Biswas M, O'Rourke JF, Camon E, Fraser G, Kanapin A, Karavidopoulou Y, Kersey P, Kriventseva E, Mittard V, Mulder N et al: Applications of InterPro in protein annotation and genome analysis. Brief Bioinform 2002, 3(3):285-295.
16. Kretschmann E, Fleischmann W, Apweiler R: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 2001, 17(10):920-926.
17. Bazzan ALC, Engel PM, Schroeder LF, da Silva SC: Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics 2002, 18:S35-S43.
18. Holm L, Sander C: Parser for Protein-Folding Units. Proteins-Structure Function and Genetics 1994, 19(3):256-268.
19. Nagarajan N, Yona G: Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics 2004, 20(9):1335-1360.
20. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti T, Corpet F, Croning MDR et al: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 2001, 29(1):37-40.
21. Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, 318(2):595-608.
22. Babbitt PC: Definitions of enzyme function for the structural genomics era. Curr Opin Chem Biol 2003, 7(2):230-237.
23. Dobson PD, Doig AJ: Predicting enzyme class from protein structure without alignments. J Mol Biol 2005, 345(1):187-199.
24. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285-4288.
25. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, 23(9):324-328.
26. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):2896-2901.
27. Cai CZ, Han LY, Ji ZL, Chen YZ: Enzyme family classification by support vector machines. Proteins 2004, 55(1):66-76.
28. Borro LC, Oliveira SR, Yamagishi ME, Mancini AL, Jardine JG, Mazoni I, Santos EH, Higa RH, Kuser PR, Neshich G: Predicting enzyme class from protein structure using Bayesian classification. Genet Mol Res 2006, 5(1):193-202.
29. Cai YD, Chou KC: Predicting enzyme subclass by functional domain composition and pseudo amino acid composition. J Proteome Res 2005, 4(3):967-971.
30. Cai YD, Chou KC: Using functional domain composition to predict enzyme family classes. J Proteome Res 2005, 4(1):109-111.
31. Lu L, Qian Z, Cai YD, Li Y: ECS: An automatic enzyme classifier based on functional domain composition. Comput Biol Chem 2007.
32. Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 2005, 6(5):361-375.
33. Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen CW, Gordon P, Durufle L, Gaasterland T, Lopez P, Muller M et al: The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. PNAS 2002, 99(3):1414-1419.
34. Daubin V, Gouy M, Perriere G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 2002, 12(7):1080-1090.
35. Bininda-Emonds ORP, Gittleman JL, Steel MA: THE (SUPER)TREE OF LIFE: Procedures, Problems, and Prospects. Annu Rev Ecol Syst 2002, 33(1):265-289.
36. Snel B, Bork P, Huynen MA: Genome phylogeny based on gene content. Nat Genet 1999, 21(1):108-110.
37. Korbel JO, Snel B, Huynen MA, Bork P: SHOT: a web server for the construction of genome phylogenies. Trends Genet 2002, 18(3):158-162.
38. Dutilh BE, Huynen MA, Bruno WJ, Snel B: The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise. J Mol Evol 2004, 58(5):527-539.
39. Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001, 1:8.
40. Qi J, Wang B, Hao BI: Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J Mol Evol 2004, 58(1):1-11.
41. Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res 2003, 13(2):145-158.
42. Wolf YI, Brenner SE, Bash PA, Koonin EV: Distribution of Protein Folds in the Three Superkingdoms of Life. Genome Res 1999, 9(1):17-26.
43. Lin J, Gerstein M: Whole-genome Trees Based on the Occurrence of Folds and Orthologs: Implications for Comparing Genomes on Different Levels. Genome Res 2000, 10(6):808-818.
44. Yang S, Doolittle RF, Bourne PE: Phylogeny determined by protein domain content. PNAS 2005, 102(2):373-378.
45. Kunin V, Ahren D, Goldovsky L, Janssen P, Ouzounis CA: Measuring genome conservation across taxa: divided strains and united kingdoms. Nucl Acids Res 2005, 33(2):616-621.
46. Snel B, Huynen MA, Dutilh BE: Genome trees and the nature of genome evolution. Annu Rev Microbiol 2005, 59:191-209.
47. Caetano-Anolles G, Caetano-Anolles D: An Evolutionarily Structured Universe of Protein Architecture. Genome Res 2003, 13(7):1563-1571.
48. Phan IQ, Pilbout SF, Fleischmann W, Bairoch A: NEWT, a new taxonomy portal. Nucleic Acids Res 2003, 31(13):3822-3823.
49. Witten IH, Frank E: Data Mining: Practical machine learning tools and techniques, 2nd Edition edn. San Francisco: Morgan Kaufmann; 2005.
50. Agrawal R, Imielinski T, Swami A: Mining Association Rules between Sets of Items in Large Databases. In: Proc of the ACM SIGMOD Conference. 1993: 207-216.
51. Agrawal R, Srikant R: Fast Algorithms for Mining Association Rules. In: Proc of the 20th VLDB Conference. 1994: 487-499.
52. Dubchak I, Muchnik I, Holbrook SR, Kim S: Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. PNAS 1995, 92(19):8700-8704.
53. Mei H, Liao ZH, Zhou Y, Li SZ: A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers 2005, 80(6):775-786.
54. Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27(1):368-369.
55. Chang C-C, Lin C-J: LIBSVM : a library for support vector machines. 2001.
56. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25(24):4876-4882.
57. Keane TM, Creevey CJ, Pentony MM, Naughton TJ, McLnerney JO: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol Biol 2006, 6:29.
58. Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci U S A 1997, 94(13):6815-6819.
59. Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18(3):502-504.
60. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696-704.
61. Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19(12):1572-1574.
62. Shimodaira H, Hasegawa M: Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference. Mol Biol Evol 1999, 16(8):1114-1116.
63. Strimmer K, Rambaut A: Inferring confidence sets of possibly misspecified gene trees. Proc Biol Sci 2002, 269(1487):137-142.
64. Creighton C, Hanash S: Mining gene expression databases for association rules. Bioinformatics 2003, 19(1):79-86.
65. Doddi S, Marathe A, Ravi SS, Torney DC: Discovery of association rules in medical data. Medical Informatics and the Internet in Medicine 2001, 26(1):25-33.
66. Karchin R, Karplus K, Haussler D: Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18(1):147-159.
67. Bock JR, Gough DA: Predicting protein--protein interactions from primary structure. Bioinformatics 2001, 17(5):455-460.
68. Ding CH, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 2001, 17(4):349-358.
69. Hua S, Sun Z: A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 2001, 308(2):397-407.
70. Esser C, Ahmadinejad N, Wiegand C, Rotte C, Sebastiani F, Gelius-Dietrich G, Henze K, Kretschmann E, Richly E, Leister D et al: A genome phylogeny for mitochondria among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol Biol Evol 2004, 21(9):1643-1660.
71. Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol 2001, 52(6):540-542.
72. Ragan MA, Harlow TJ, Beiko RG: Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol 2006, 14(1):4-8.