研究生: |
鐘文鈺 Wen-Yu Chung |
---|---|
論文名稱: |
利用相似性比較法找出表現序列 Protein-Coding Exon Identification by Comparative Genomic Approach |
指導教授: |
唐傳義
Chuan-Yi Tang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2002 |
畢業學年度: | 90 |
語文別: | 英文 |
論文頁數: | 19 |
中文關鍵詞: | 序列比對 、預測基因 、演化 |
外文關鍵詞: | gene identification, Ka/Ks ratio, exon/intron boundary, sequence comparison |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在眾多的序列資料公開在網路上的同時,所有人都希望能在搶先在這樣的資料庫中第一步找到有用的資訊。包括詳細的基因位置、基因結構以及形成蛋白質之後的結構、功能、相互作用等。這篇論文的主要目的即是提供一個有效的方法,從兩條同源的DNA序列來找出可能的基因位置。我們嘗試將現有被廣泛使用的軟體與自已寫成的程式做整合,再把資料輸入建立的SQL資料庫。主要的步驟分三部分:第一,將同源DNA序列做排比,我們使用MegaBlast這個程式;第二,利用分子演化所觀察到的特徵,即會形成蛋白質的序列(protein-coding exon)所發生的變異比不會形成蛋白質者(intron)還要來得低,估計其序列為基因的可能性;第三,偵測基因接合位置(splicing site)的訊號,把表現序列(exon)與介入子(intron)分開,找出正確的表現序列。目前已有的大多數軟體採用統計模式,先用多條序列做測試,找到模式中的最佳參數,再應用到輸入序列上。其於另有一些軟體是進行比對,把未知的輸入序列對已知的蛋白質資料庫做搜尋,找出最有可能的組合。以上兩種方法都是從單一條輸入序列找出資訊。但現在已有多種動植物的基因資料庫完成或正在進行,不再局限於單一物種。從演化中來說,重要的基因訊息會被保留下來,如果同時考慮多條序列,將可以得到更完整的基因資料。我們的方法的優點在於它是一組合性的軟體,三個步驟中所使用的程式可以被改變或被取代成另一程式。且在將來可以很容易的加入更多測試方法、資訊,或三條以上的序列比對。在初步成果中,我們表現良好,未來需要更進一步測試此方法的準確性及擴充為完整的序列分析系統。
One fascinating problem in Bioinformatics research area is to denote gene structure from genomic sequences. Some methods had been published and proved useful, but they all consider the information from one sequence only. The newest development shall be cross-species sequence comparison, which take two or more sequences into consideration. We take the assumption that important functional elements tend to be strongly conserved than other intergenic sequences under the evolution pressure. Hence, we introduce a method that combines useful existing software and automates by Perl scripts for detecting protein-coding regions. This strategy has three key parts: sequence alignment, the KA/KS ratio test, and boundary determinant. It is simple and powerful to implement, and easy to extend in the future. A test dataset of selected orthologous genes is included in the performance test. The results show we have good performance, and do find most exon boundaries correct. The method shall be furthermore established as an automated data analysis system. An initial web page was constructed at http://nekrut.uchicago.edu/eev/ as the evolutionary exon validation tool.
Anton Nekrutenko, Kateryna D. Makova, and Wen-Hsiung Li. 2002.The Ka/Ks Ratio Test for Assessing the Protein-Coding Potential of Genomic Regions: An Empirical and Simulation Study. Genome Research 12: 198-202.
Burkhard Morgenstern, Andreas Dress, and Thomas Werner. 1996. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93 12098-12103.
Burkhard Morgenstern, Kornelie Frech, Andreas Dress, and Thomas Werner. 1998. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14 290-294.
Burkhard Morgenstern. 1999. DIALIGN 2: improvement of the segment-to-segnmet approach to multiple sequence alignment. Bioinformatics 15 211-218.
Burkhard Morgenstern. 2000. A space-efficient algorithm for aligning large genomic sequences. Bioinformatics 16: 948-949.
Ghris Burge and Samuel Karlin. 1997. Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268: 78-94.
Inna Dubchak, Michael Brudno, Gabriela G. Loots, Lior Pachter, Chris Mayor, Edward M. Rubin, and Kelly A. Frazer. 2000. Active Conservation of Noncoding Sequences Revealed by Three-Way Species Comparisons. Genome Research 10:1304-1306.
Jack E. Tabaska, Michael Q. Zhang. 1999. Detection of polyadenylation signals in human DNA sequences. Gene 231 77-86.
James W. Fickett. 1996. Finding genes by computer: the state of the art. TIG 12: 316-320.
James W. Fickett. 1996. The Gene Identification Problem: An Overview for Developers. Computers Chem. 20: 103-118
Julie D. Thompson, Desmond G. Higginsl and Toby J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673-4680.
Moises Burset and Roderic Guigo. 1996. Evaluation of Gene Structure Prediction Programs. Genomics 34, 353-367.
Nick Goldman and Ziheng Yang. 1994. A Codon-based Model of Nucleotide Substitution for Protein-coding DNA Sequences. Mol. Biol. Evol. 11(5):725-736.
RepeatMasker http://ftp.genome.washington.edu/RM/RepeatMasker.html
Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological sequence analysis. Cambridge University Press.
Russel F. Doolittle. 1990. Methods in ENZYMOLOGY Vol.183. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences.
Sanja Rogic, Alan K.Mackworth, Francis B.F. Ouellette. 2001. Evaluation of Gene-Finding Programs on Mammalian Sequences. Genome Research 11: 817-832.
Serafim Batzoglou, Lior Pachter, Jill P. Mesirov, Bonnie Berger, and Eric S. Lander. 2000. Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Res. 10: 950-958.
Webb Miller. 2001. Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17 391-397.
Wen-Hsiung Li. 1997. Molecular Evolution. Sinauer Associates, Inc., Publishers.
Ziheng Yang and Joseph P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15:496-503.
Ziheng Yang and Rasmus Nielsen. 2000. Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol. Biol. Evol. 17(1):32-43.
Ziheng Yang. 1999. Phylogenetic analysis by maximum likelihood (PAML), Version 2. University College London, England.