研究生: |
王天慶 Tien-Ching Wang |
---|---|
論文名稱: |
利用對數線性關聯圖模型預測剪裁點位置 Prediction of Splice Sites with Log-Linear Dependency Graphical Models |
指導教授: |
呂忠津
Chung-Chin Lu |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 32 |
中文關鍵詞: | 基因結構預測 、關聯圖 、對數線性模型 |
外文關鍵詞: | gene structure prediction, dependency graph, log-linear model |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
去氧核醣核酸(DNA)是生物最主要的遺傳物質,DNA序列儲存了形成蛋白質所需的資訊,決定了生物體的特徵及性狀。由於分子生物技術的進步,已知的DNA序列正不斷地大量成長,然而知道DNA的序列並不代表確定基因的位置及其生成的蛋白質序列,如何分析這些DNA序列已經成為一項研究重點。基因識別便是其中一項主要問題,其目的主要是從DNA序列偵測出各種訊號並找到基因轉錄及轉譯的位置,而剪裁點便是其中一項相當重要的訊號。在真核生物中,基因剪裁發生於後轉錄階段,用來移除訊息核醣核酸(mRNA)上的非編碼區域,剪裁後的mRNA才能正確地轉譯成蛋白質。因此,能準確預測剪裁點位置對決定基因編碼區域有莫大的幫助。最近用來偵測剪裁點訊號的方法是利用關聯圖及其貝氏網路展開來建立模型,雖然能夠達到很精準的預測,不過目前仍然缺乏完整的理論基礎。在本篇論文中,便是希望由已知剪裁點序列建立對數線性關聯圖模型,並用來預測剪裁點位置。我們先使用卡方檢定來建立相對位置的關聯圖,藉著疊代比例分配法及圖學的相關理論,我們可以得到基於對數線性模型的最大相似度估測。接著我們使用我們的方法對己知人類及果蠅的DNA序列作剪裁點位置作預測。由交叉驗證的結果顯示,我們的模型在敏感度為百分之九十的情形下,對給位(donor site)的預測能達到百分之九十五至百分之九十七的專一性;對受位(acceptor site)的預測能達到百分之八十五至百分之九十三專一性。
The amount of available genomic DNA sequence data is growing at an enormous rate. The analysis of these DNA sequences currently becomes a hot topic. One major problem is the prediction of splice site locations, which is related to the identification of a gene. The splicing is an important process occurring in the post-transcriptional phase, and is required to remove the introns. In this thesis, we employ the log-linear graphical model to predict the splice sites. With the help of the iterative proportional scaling algorithm, we are able to find the the maximum likelihood estimation based on the log-linear model. Then we apply our method to predict the splice site of DNA coding sequences from two species human and fly. Results obtained through 5-fold cross-validation tests show that with 10% false negative rate, our model can reach about 95% to 97% specificity for donor site prediction and 85% to 93% for acceptor site prediction.
Burge, C. (1997). Identification of genes in human genomic DNA. PhD thesis, Stanford University.
Castelo, R. and Guigo, R. (2004). Splice site identification by idlbns. Bioinformatics, 20, i69–i76.
Chen, T.-M., Lu, C.-C., and Li, W.-H. (2005). Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics, 21, 471–482.
Darroch, J. N. and Ratcliff, D. (1972). Generalized iterative scaling for log-linear interaction models for contingency tables. Annals of Mathematical Statistics, 43, 1470–1480.
Deming, W. E. and Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics, 11, 181–190.
Denteneer, D. and Verbeek, A. (1986). A fast algorithm for iterative proportional fitting in log-linear models. Computational Statistics and Data Analysis, 3, 251–264.
Ewens, W. J. and Grant, G. R. (2001). Statistical Methods in Bioinformatics: An Introduction. Springer Science Business Media, New York.
Jirousek, R. and Preucil, S. (1995). On the effective implementation of the iterative proportional fitting procedure. Computational Statistics and Data Analysis, 19, 177–189.
Lauritzen, S. L. (1996). Graphical Models. Oxford University Press, New York.
Malvestuto, F. M. (1989). Computing the maximum-entropy extension of given discrete probability. Computational Statistics and Data Analysis, 8, 299–311.
Mathe, C., Sagot, M. F., Schiex, T., and Rouze, P. (2002). Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research, 30, 4103–4117.
Mathews, C. K., Holde, K. E. v., and Ahern, K. G. (2000). Biochemistry. Benjamin/Cummings Publishing Company, San Francisco, 3rd edition.
Russell, P. J. (1998). Genetics. Benjamin/Cummings Publishing Company, San Francisco, 5th edition.
Tarjan, R. E. and Yannakakis, M. (1984). Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM J. Comput, 13, 566–579.
Weaver, R. F. (1999). Molecular Biology. WCB McGraw-Hill, New York, 2nd edition.
Zhang, L. and Luo, L. (2003). Splice site prediction with quadratic discriminant analysis using diversity measure. Nucleic Acids Research, 31, 6214–6220.
Zhang, M. Q. (2002). Computational prediction of eukaryotic protein-coding genes. Nature Review Genetics, 3, 698–709.