研究生: |
謝淑茹 Shu-Ju Hsieh |
---|---|
論文名稱: |
應用物種演化特性以預測基因位置 Coding Exon Prediction Based on Phylogenetical Comparisons |
指導教授: |
唐傳義
Chuan-Yi Tang |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 英文 |
論文頁數: | 80 |
中文關鍵詞: | 序列比對 、比較基因體 |
相關次數: | 點閱:4 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在計算生物學中,辨認基因位置是極具挑戰性的研究。近年來,許多的物種經由定序,快速累積大量的序列資料,藉由基因體的比較得以辨認基因位置。
序列分析是生物資訊的基礎,其中序列比對是最重要的一項工具。利用相近物種之基因結構及其序列在演化上高度保留的特性,透過序列比對可以預測基因位置。然而,現有的比對工具,雖然可提供最佳解或近似最佳解,但卻受限於運算速度。在本篇論文中我們提出可快速比對的分法,並結合訊號的偵測以開發基因預測的工具,以期快速且正確的預測基因位置。
此方法可應用在兩條未標定基因註解的同源序列的比對,在實驗部分我們透過人鼠的DNA序列驗證了其效能,進而探討人鼠同源基因資料庫中潛在未知的表現子。依據預測的基因表現子的序列相似度及序列長度建置成可查詢的資料庫,並評估其KA/KS ratio。
隨著定序物種的增加,經由實驗驗證的基因註解數量與日俱增,使得已知基因註解的資訊可應用於辨認初定序物種基因位置。這個方法,亦可應用於一以知基因註解及一未標定基因註解的同源序列比對。除此之外,我們亦建立了短序列表現子的比對方法,此方法進一步提高了基因結構預測的準確率。經實驗驗證我們預測的基因正確率可達81%,而表現子的正確率更可高達96%,其中表現子的被錯估比例小於1%。
Identifying protein coding genes is a challenging task in computational biology. With the rapid accumulation of genomic sequences for various organisms, it is now feasible to identify novel genes and exons by genomic comparisons. Sequence analysis is the base of bioinformatics, in which the sequence alignment is a major and fundamental task. Comparing coding regions among organisms can pinpoint functionally important parts of proteins, which are more conserved than the other parts bearing no functional significance. However, most of the currently available alignment programs, though providing optimal or near optimal results, have been limited by their computation speed. In this thesis, we propose a new method for coding region alignments. Based on a probabilistic filtration approach, CORAL (COding Region ALignment) is a linear time alignment tool. Integrating CORAL and signal detectors, we developed two programs, EXONALIGN and GeneAlign for coding exon prediction.
EXONALIGN simultaneously aligns and predicts exons between homologous genes/ syntenic regions. To reduce computation time and improve prediction accuracy, EXONALIGN calculates strengths of intrinsic splice signals and applies CORAL to measure sequence homologies between regions flanked by pairs of candidate splice acceptor and donor of homologous genes. The performance of EXONALIGN was evaluated on the ROSETTA and the Projector data sets. The predictions obtained by EXONALIGN are comparable with those obtained by widely used gene prediction programs, confirming the benefit of importing the conservation of exon-intron structures into an exon/gene prediction tool. Finally, EXONALIGN was employed to explore novel human exons within the annotated human-mouse homologous genes. More than one hundred novel human and mouse exon pairs were predicted within annotated genes. These putative human exons are longer than 100 bp and show greater than 70% sequence conservation to the corresponding mouse exons. The KA/KS ratios of 75% of the predicted exons are smaller than 1, further supporting the likelihood that the majority of newly predicted human exons code for proteins.
Furthermore, with increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to the annotated genes of phylogenetically close organisms. GeneAlign predicts protein coding genes by measuring the homologies between the sequence of a newly sequenced genome and the homologue of a related genome. GeneAlign was tested on Projector data set of 491 human-mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%.
REFERENCES
1. Nekrutenko,A., Chung,W.Y. and Li,W.H. (2003) ETOPE: evolutionary test of predicted exons. Nucleic Acids Res., 31, 3564-3567.
2. Guigó,R., Dermitzakis,E.T., Agarwal,P., Ponting,C.P., Parra,G., Reymond,A., Abril,J.F., Keibler,E., Lyle,R., Ucla,C. et al. (2003) Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci., 100, 1140-1145.
3. Mathe,C., Sagot,M.F., Schiex,T. and Rouze,P. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res., 30, 4103-4117.
4. Zhang,M.Q. (2002) Computational prediction of eukaryotic protein coding genes. Nat. Rev. Genet., 3, 698-709.
5. Brent,M.R. and Buigo,R. (2004) Recent advances in gene structure prediction. Curr. Opin. Struct. Biol., 14, 264-272.
6. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78-94.
7. Borodovsky,M. and McIninch,J. (1993) GENMARK: parallel gene recognition for both DNA strands. Comput. Chem., 17, 123–133.
8. Lukashin,A.V. and Borodovsky,M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res., 26, 1107–1115.
9. Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. and Tettelin,H. (1999) Interpolated Markov models for eukaryotic gene finding. Genomics, 59, 24–31.
10. Guigó,R., Knudsen,S., Drake,N. and Smith,T.F. (1992) Prediction of gene structure. J. Mol. Biol., 226,141-157.
11. Reese,M.G., Kulo,D., Tammana,H. and Haussler,D. (2000) Genie-gene finding in Drosophila melanogaster. Genome Res., 10, 529-538.
12. Solovyev,V. and Salamov,A. (1997) The Gene-Finder computer tools for analysis of human and model organisms genome sequences. The Fifth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 294–302.
13. Majoros,W.H., Pertea,M., Antonescu,C. and Salzberg,S.L. (2003) GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res., 31, 3601-3604.
14. Stanke,M. and Waack,S. (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19, 215–225.
15. Korf,I. (2004) Gene finding in novel genomes. BMC Bioinformatics, 5, 59
16. Flicek,P., Keibler,E., Hu,P., Korf,I. and Brent,M.R. (2003) Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res., 13, 46-54.
17. Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci., 93, 9061-9066.
18. Jiang,J. and Jacob,H.J. (1998) EbEST: an automated tool using expressed sequence tags to delineate gene structure. Genome Res., 8, 268–275.
19. Mott,R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci., 13, 477–478.
20. Kan,Z., Rouchka,E.C., Gish,W.R. and States,D.J. (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res., 11, 889–900.
21. Birney,E., Clamp,M. and Durbin,R. (2004) GeneWise and Genomewise. Genome Res., 14, 988-995.
22. Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res., 8, 967-974.
23. Wheelan,S.J., Church,D.M. and Ostell,J.M. (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res., 11, 1952-1957.
24. Wu,T.D. and Watanabe,C.K. (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21, 1859-1875.
25. Novichkov,P.S., Gelfand,M.S. and Mironov,A.A. (2001) Gene recognition in eukaryotic DNA by comparison of genomic sequences. Bioinformatics, 17, 1011-1018.
26. Makalowski,W. and Boguski,M.S. (1998) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci., 95, 9407-9412.
27. Batzoglou,S., Pachter,L., Mesirovi,J.P., Berger,B. and Lander,E.S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res., 10, 950-958.
28. Bafna,V. and Huson,D.H. (2000) The conserved exon method. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 3-12.
29. Wiehe,T., Gebauer-Jung,S., Mitchell-Olds,T. and Guigó,R. (2001) SGP-1: Prediction and validation of homologous genes based on sequence alignments. Genome Res., 11, 1574-1583.
30. Meyer,I.M. and Durbin,R. (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics, 18, 1309-1318.
31. Blayo,P., Rouzé,P. and Sagot,M.F. (2003) Orphan gene finding - an exon assembly approach. Theor. Comput. Sci., 290, 1407-1431.
32. Alexandersson,M., Cawley,S. and Pachter,L. (2003) SLAM: cross-organisms gene finding and alignment with a generalized pair hidden Markov model. Genome Res., 13, 496-502.
33. Taher,L., Rinner,O., Garg,S., Sczyrba,A., and Morgenstern.B. (2004) AGenDA: gene prediction by cross-species sequence comparison. Nucleic Acids Res., 32, W305-W308.
34. Yeh,R.-F., Lim,L.P. and Burge,C.B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res., 11, 803–816.
35. Korf,I., Flicek,P., Duan,D. and Brent,M.R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics, 17, 140-148.
36. Parra,G., Agarwal,P., Abril,J.F., Wiehe,T., Fickett,J.W. and Guigó,R. (2003) Comparative gene prediction in human and mouse. Genome Res., 13, 108-117.
37. Issac,B and Raghava,G. P. S. (2004) EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches. Genome Res., 14, 1756-1766.
38. Meyer,I.M. and Durbin,R. (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res., 32, 776-783.
39. Allen,J.E., Pertea,M. and Salzberg,S.L. (2004) Computational gene prediction using multiple sources of evidence. Genome Res., 14, 142-148.
40. Brendel,V., Xing,L. and Zhu,W. (2004) Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics, 20, 1157-1169.
41. Brejova,B., Brown,D.G., Li,M. and Vinar,T. (2005) ExonHunter: a comprehensive approach to gene finding. Bioinformatics, 21, 57-65.
42. Allen,J.E. and Salzberg,S.L. (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics, 21, 3596-3606.
43. Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 2369–2376.
44. Thompson,J.D., Higgins,D.G., Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
45. Zhang,Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol., 7, 203–214.
46. Notredame,C., Higgins,D.G., Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217.
47. Lee,B.T.K., et al. (2003) MGAlignIt: a web service for the alignment of mRNA/EST and genomic sequences. Nucleic Acids Res., 31, 3533–3536.
48. Bray,N., et al. (2003) AVID: a global alignment program. Genome Res., 13, 97–102.
49. Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211–218.
50. Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B. (2003) Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics, 4, 66.
51. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721–731.
52. Kalafus,K.J., et al. (2004) Pash: efficient genome-scale anchoring by positional hashing. Genome Res., 14, 672–678.
53. Kent,W.J. and Zahler,A.M. (2000) Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res., 10, 1115–1125.
54. Schwartz,S., Zhang,Z., Frazer,K., Smit,A., Riemer,C., Bouck,J., Gibbs,R., Hardison,R. and Miller,W. (2000) PipMaker—a web server for aligning two genomic DNA sequences. Genome Res., 10, 577–586.
55. Morgenstern,B. (2000) A space-efficient algorithm for aligning large genomic sequences. Bioinformatics, 16, 948–949.
56. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453.
57. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Comp. Biol., 147, 482-489.
58. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
59. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403-410.
60. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402.
61. Tatusova,T.A. and Madden,T.L. (1999) Blast 2 sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett., 174, 247-250.
62. Shen,S.Y., Yang,J., Yao,A. and Hwang,P. (2002). Super pairwise alignment (SPA): an efficient approach to global alignment for homologous sequences. J. Comp. Biol., 9, 477-486.
63. Kent,W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664.
64. Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520-562.
65. Li,W.H. (1997) Molecular Evolution. Sinauer Associates, Sunderland, MA.
66. Berget,S.M. (1995) Exon recognition in vertebrate splicing. J. Biol. Chem., 270, 2411-2414.
67. Black,D.L. (2000) Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell, 103, 367-370.
68. Volfovsky,N., Haas,B.J. and Salzberg,S.L. (2003) Computational discovery of internal micro-exons. Genome Res., 13, 1214-1221.
69. Chen,T.M., Lu,C.C. and Li,W.H. (2005) Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics, 21, 471-482.
70. Salzberg,S.L. (1997) A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Appl. Biosci., 13, 365-376.
71. Burset,M. and Guigó,R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353-367.
72. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 33, 501-504.
73. Yang,Z. and Nielsen,R. (2000) Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol., 17, 32-43.
74. Karolchik,D., Baertsch,R., Diekhans,M., Furey,T.S., Hinrichs,A., Lu,Y.T., Roskin,K.M., Schwartz,M., Sugnet,C.W., Thomas,D.J. et al. (2003) The UCSC Genome Browser Database. Nucleic Acids Res., 31, 51-54.
75. Pertea,M., Lin,X. and Salzberg,S.L. (2001) GeneSplicer: a new computational method for splicer site prediction. Nucleic Acids Res., 29, 1185-1190.
76. Staden,R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505-519.
77. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 10915-10919.
78. Thomas,J.W., Touchman,J.W., Blakesley,R.W., Bouffard,G.G., Beckstrom-Sternberg,S.M., Margulies,E.H., Blanchette,M., Siepel,A.C., Thomas,P.J., McDowell,J.C. et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424, 788-793.
79. Bernal,A. Ear,U. and Kyrpides,N. (2001) Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res., 29, 126-127.
80. Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A.S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 15, 1034–1050.
81. Boffelli,D., Nobrega,M.A. and Rubin,E.M. (2004) Comparative genomics at the vertebrate extremes. Nature Rev. Genet., 5, 456–465.
82. Dewey,C., Wu,J.Q., Cawley,S., Alexandersson,M., Gibbs,R., Pachter,L. (2004) Accurate identification of novel human genes through simultaneous gene prediction in human, mouse, and rat. Genome Res., 14, 661–664.
83. Castelo,R. Reymond,A. Wyss,C. Camara,F. Parra,G. S. Antonarakis,E. Guigo,R. and Eyras,E. (2005) Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes. Nucleic Acids Res., 33, 1935-1939.
84. Margulies,E.H., Chen,C.W., Green,E.D. (2006) Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet., 22, 187-93.
85. International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945.
86. Rat Genome Sequencing Project Consortium. (2004) Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature 428, 493–521.
87. International Chicken Genome Sequencing Consortium. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature, 432, 695–716.
88. Aparicio,S., Chapman,J., Stupka,E., Putnam,N., Chia,J.M., Dehal,P., Christoffels,A., Rash,S., Hoon,S., Smit,A. et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 1301–1310.
89. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K., Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R., Federhen,S., et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 33, D173–D180.
90. Curwen,V., Eyras,E., Andrews,T.D., Clarke,L., Mongin,E., Searle,S.M., Clamp,M. (2004) The Ensembl automatic gene a