簡易檢索 / 詳目顯示

研究生: 陳德銘
Chen, Te-Ming
論文名稱: 利用隨機上下文無關文法建立基因結構模型
Modeling Gene Structure with Stochastic Context-Free Grammars
指導教授: 呂忠津
Lu, Chumg-Chin
口試委員: 呂忠津
陳博現
張翔
蘇賜麟
蘇育德
林茂昭
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 107
中文關鍵詞: 基因結構預測隨機上下文無關文法關聯圖展開貝氏網路以網格為基礎演算法
外文關鍵詞: gene structure prediction, stochastic context-free grammar, dependency graph, expanded Bayesian network, trellis-based algorithm
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基因結構預測是想要預測完整的基因結構,尤其是真核生物的基因組中去氧核醣核酸序列上基因精準的外顯子-內含子結構。 其中有可能還有一個相當大數量的人類基因需要被確認。 為了因應這一個挑戰,計算基因結構預測方法便快速增加。 然而,其表現仍難差強人意。
    透過研究之前的從頭開始基因結構預測方法,以及重新檢視基因表達這個用來將基因中所含的訊息合成為功能性基因產物的生物機制。 因此在這個研究中,一個隨機上下文無關文法模型被提出來,用來建立由基因表達機制中觀察到的基因組去氧核醣核酸序列的基因結構模型。 同時,為了減少計算的複雜度,這樣的隨機上下文無關文法可以簡化為一個與隱藏式半馬可夫模型相關的弱等效文法。
    為了改善基因結構預測方法的子模型,我們利用可充分獲得剪接子內在的鹼基對位置間循環依賴關係的關聯圖,來開發包含供體位點及受體位點這兩個剪接子訊號感知器。 而為了便於統計推斷,關聯圖(通常會有循環,使得機率推論非常困難)被轉換成貝式網絡(一個有向無環圖,便於統計推理)。 除此之外,取代廣泛使用的核甘酸週期三、五階不同質的馬可夫鏈模型,我們直接利用密碼子轉成氨基酸的一階和二階的馬可夫鏈模型來描述基因組序列中編碼區的內容感知器。 最後,我們修改以網格為基礎的剖析演算法,以配合描述基因結構中含有狀態持續期間的隨機上下文無關文法,並對於兩個基因組序列測試資料庫進行基因結構預測。


    Gene structure prediction is to predict the complete gene structure, especially the precise exon-intron structure of a gene in an eukaryotic genomic DNA sequence. There might be a fairly large number of human genes that remain to be identified. In response to this challenge, computational gene structure prediction approaches have proliferated. However, the performance is still far from satisfactory.
    Previous ab initio gene prediction approaches are investigated, and a biological process by which information from a gene is used in the synthesis of a functional gene product, called the gene expression, is reexamined. In this study, a stochastic content-free grammar (SCFG) is proposed to model the gene structure of genomic DNA sequences from the gene expression process, and is reduced to a weakly equivalent grammar associated with a hidden semi-Markov model (HSMM).
    To improve the submodels for the gene structure prediction approach, the signal sensors for donor sites and acceptor sites are developed by using a dependency graph model to fully capture the intrinsic cyclic inter-dependency between base positions in a splice site. To facilitate statistical inference, the dependency graph (which is usually a graph with cycles that make probabilistic reasoning very difficult, if not impossible) is expanded into a Bayesian network (which is a directed acyclic graph that facilitates statistical reasoning). In addition, first-order and second-order Markov chain models of amino acids are investigated to model the content sensors for exons, instead of the widely used 3-periodic fifth-order inhomogeneous Markov chain model of DNA nucleotides. Finally, a modified trellis-based parsing algorithm for stochastic context-free grammars with state-duration nonterminal symbols is introduced to predict gene structure.

    目錄 第一章 簡介 1 第二章 背景 2 第三章 語言與文法 3 第四章 模型架構與演算法 4 第五章 訊號感知器與內容感知器 5 第六章 結果與討論 6 第七章 結論 7 附錄 8

    [1] S. Rogic, A. K. Mackworth, and F. B. Ouellette, “Evaluation of gene-finding programs on mammalian sequences,” Genome Research, vol. 11, pp. 817–832, 2001.
    [2] C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic dna,” Journal of Molecular Biology, vol. 268, pp. 78–94, 1997.
    [3] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, and J. et al. Baldwin,“Initial sequencing and analysis of the human genome,” Nature, vol. 409, pp. 860–921, 2001.
    [4] C. Mathe, M. Sagot, T. Schiex, and P. Rouz´e, “Current methods of gene prediction, their strengths and weaknesses,” Nucleic Acids Research, vol. 30, no. 19, pp. 4103–4117, 2002.
    [5] M. Q. Zhang, “Computational prediction of eukaryotic protein-coding genes,” Nature Reviews Genetics, vol. 3, no. 9, pp. 698–709, 2002.
    [6] C. Burge, “Identification of genes in human genomic dna,” Ph.D. dissertation, Stanford University, 1997.
    [7] C. K. Mathews, K. E. van Holde, and K. G. Ahern, Biochemistry, 3rd ed. Addison Wesley Longman, 2000.
    [8] R. F. Weaver, Molecular biology. WCB McGraw-Hill, 1999.
    [9] M. Gelfand, A. Mironov, and P. Pevzner, “Gene recognition via spliced sequence alignment,” Proc. Natl. Acad. Sci., vol. 93, 1996.
    [10] R. Guigo, P. Flicek, J. F. Abril, A. Reymond, J. Lagarde, F. Denoeud, S. Antonarakis, M. Ashburner, V. B. Bajic, E. Birney, R. Castelo, E. Eyras, C. Ucla, T. R. Gingeras, J. Harrow, T. Hubbard, S. E. Lewis, and M. G. Reese, “Egasp: the human encode genome annotation assessment project,” Genome Biology, vol. 7, no. (Suppl 1):S2, 2006.
    [11] D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman, “A generalized hidden markov model for the recognition of human genes in dna,” in Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press, 1996, pp. 134–142.
    [12] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceeding of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
    [13] K. Lari and S. J. Young, “The estimation of stochastic context-free grammars using the inside-outside algorithm,” Computer Speech and Language, vol. 4, pp. 35–56, 1990.
    [14] J. Kupiec, “A trellis-based algorithm for estimating the parameters of a hidden stochastic context-free grammar,” in Proceedings of the workshop on Speech and Natural Language, Pacific Grove, California, 1991, pp. 241–246.
    [15] ——, “An algorithm for estimating the parameters of unrestricted hidden stochastic context-free grammars,” in Proceedings of the 14th conference on Computational linguistics - Volume 1, Nantes, France, 1992, pp. 387–393.
    [16] S. Brunak, J. Engelbrecht, and S. Knudsen, “Trainable grammars for speech recognition,” The Journal of the Acoustical Society of America, vol. 65, no. S1, p. S132, 1979.
    [17] W. J. Ewens and G. R. Grant, Statistical methods in bioinformatics : An introduction. New York: Springer-Verlag, 2001.
    [18] R. Staden, “Computer methods to locate signals in nucleic acid sequences,” Nucleic Acids Research, vol. 12, pp. 505–519, 1984.
    [19] M. Q. Zhang and T. G. Marr, “A weight array method for splicing signal analysis,”Computational Application Bioscience, vol. 9, no. 5, pp. 499–509, 1993.
    [20] D. Cai, A. L. Delcher, B. Kao, and S. Kasif, “Modeling splice sites with bayes networks,”
    Bioinformatics, vol. 16, no. 2, pp. 152–158, 2000.
    [21] M. Arita, K. Tsuda, and K. Asai, “Modeling splicing sites with pairwise correlations,”
    Bioinformatics, vol. 18, no. Suppl. 2, pp. S27–S34, 2002.
    [22] G. W. Yeo and C. B. Burge, “Maximum entropy modeling of short sequence motifs with applications to rna splicing signals,” Journal of Computational Biology, vol. 11, pp. 377–394, 2004.
    [23] S. Brunak, J. Engelbrecht, and S. Knudsen, “Prediction of human mrna donor and acceptor sites from the dna sequence,” Journal of Molecular Biology, vol. 220, pp. 49–65, 1991.
    [24] S. M. Hebsgaard, P. G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouz´e, and S. Brunak, “Splice site prediction in arabidopsis thaliana pre-mrna by combining local and global sequence information,” Nucleic Acids Research, vol. 24, pp. 3439–3452, 1996.
    [25] N. Tolstrup, P. Rouz´e, and S. Brunak, “A branch point consensus from arabidopsis found by non-circular analysis allows for better prediction of acceptor sites,” Nucleic Acids Research, vol. 25, pp. 3159–3163, 1997.
    [26] M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler, “Improved splice site recognition in genie,” Jounral of Computational Biology, vol. 4, pp. 311–324, 1997.
    [27] M. Pertea, X. Lin, and S. Salzberg, “Genesplicer: a new computational method for splice site prediction,” Nucleic Acids Research, vol. 29, pp. 1185–1190, 2001.
    [28] T.-M. Chen, C.-C. Lu, and W.-H. Li, “Prediction of splice sites with dependency graphs and their expanded bayesian networks,” Bioinformatics, vol. 21, no. 4, pp. 471–482, 2005.
    [29] J. Pearl, Probabilistic reasoning in intelligent systems : networks of plausible inference. San Mateo, CA: Morgan Kaufmann, 1988.
    [30] N. N. Khodarev, J. Park, Y. Kataoka, E. Nodzenski, L. Khorasani, S. Hellman, B. Roizman, R. R. Weichselbaum, and C. et al. Pelizzari, “Receiver operating characteristic analysis: a general tool for dna array data filtration and performance estimation.” Genomics, vol. 81, pp. 202–209, 2003.
    [31] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: Probabilistic models of protein and nucleic acids. Cambridge University Press, 1998.
    [32] T. Fawcett, “An introduction to roc analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006.
    [33] M. Burset and R. Guigo, “Evaluation of gene structure prediction programs,” Genomics, vol. 34, no. 3, pp. 353–367, 1996.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE