簡易檢索 / 詳目顯示

研究生: 梁博程
Bor-Cherng Liang
論文名稱: 大尺度基因型資料之單體型解構與重建
Haplotype Decomposition and Reconstruction from Large Scale Genotype Data
指導教授: 劉庭祿
Tyng-Luh Liu
陳朝欽
Chaur-Chin Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2004
畢業學年度: 92
語文別: 英文
論文頁數: 59
中文關鍵詞: 單體型標籤單核苷酸多態性理想系統發生樹鋪貼區塊
外文關鍵詞: Haplotype, tag SNPs, perfect phylogeny tree, tiling block
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本篇論文旨在探討有關單體型(haplotype)解構與重建的問題,我們提出一個處理大尺度基因型(genotype)資料的方法,來決定其單體型的區塊分割及重建每個基因型之成對單體型。在單體型解構方面,我們採用動態程式規劃演算法決定最佳的區塊分割;在單體型重建方面,我們提出每個區塊內包含至少一個理想系統發生樹(perfect phylogeny tree)的模型,以及由標籤單核苷酸多態性(tag SNPs)組成區塊間的鋪貼區塊,來重建整個單體型。經由這二個主要元件的搭配,發展出一套有效率的單體型重建系統。

    我們所發展出的演算法,是以Eskin等人在RECOMB 2003所發表的論文為出發點。然而,對於Eskin等人採用的區塊內單一理想系統發生樹模型,我們認為其應為至少一個而且通常超過一個理想系統發生樹;而且,對於區塊間重建回整個單體型這個難題,Eskin等人只考慮兩個相鄰區塊間的關係,我們則更進一步考慮所有區塊之間的關係。本篇論文有四點主要貢獻:(1) 提出at-least-one perfect -phylogeny-tree model,更能符合真實基因型資料並改善單體型重建之準確率;(2) 訂定informative score function,分解基因型成最有可能的一對單體型;(3) 建構tiling blocks consisting of tag SNPs,使得所有區塊間的關係為可解決的(resolvable);(4) 根據mutual relation among blocks,重建整個單體型並減低少數錯誤判斷的影響。

    為了驗證所提出的演算法之效率及準確度,我們進行了數種測試。我們所建立的系統能提供準確而有效率之單體型解構與重建,在準確度上,我們使用Daly等人的染色體5q31基因型資料庫(129個基因型,每個包含103個單核苷酸多態性)來進行試驗,準確率為97.9%,在Pentium-4 3.06GHz的PC上,只需要一分鐘就可決定其區塊分割及單體型。


    In this thesis, we address the problem of haplotype decomposition and reconstruction. While focusing on large scale genotype data, we propose a new framework to determine the haplotype block partitions and to resolve the haplotype pair of each genotype. In implementing the decomposition scheme, we formulate a dynamic programming algorithm to minimize the total number of tag SNPs. For structuring the reconstruction method, we introduce an at-least-one perfect-phylogeny-tree model within each block, and use tiling blocks consisting of tag SNPs among blocks. It turns out that the two elements are well coupled and lead to an accurate and efficient haplotype reconstruction system.

    Our approach is closely related to the work of Eskin et al.. However, the perfect phylogeny model used in their scheme is restricted by only one perfect phylogeny tree within a block. We instead adopt a more flexible criterion that requires at least one perfect phylogeny tree. Furthermore, in dealing with the difficult problem of resolving whole haplotypes among blocks, we go further to take into account all blocks, whereas their work only considers two adjacent blocks. Specifically, the contributions of our work can be characterized by: (i) an at-least-one prefect-phylogeny-tree model, to fit the real genotype data and improve the accuracy of haplotype resolving within a block; (ii) an informative score function, to resolve a genotype into the most likely pair of haplotypes; (iii) tiling blocks consisting of tag SNPs, to make all of the choices resolvable; and (iii) mutual relation among blocks, to resolve whole haplotypes among blocks by considering all blocks, and to reduce the effects caused by a few erratic choices. We have also included various experimental results to illustrate the advantages of the proposed method.

    Keywords: Haplotype, tag SNPs, perfect phylogeny tree, tiling block

    Contents 1 Introduction 1 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Haplotype Decomposition . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Haplotype Reconstruction . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Simultaneous Haplotype Reconstruction and Decomposition . . . . 6 1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Haplotype Resolving within a Block . . . . . . . . . . . . . . . . . 7 1.2.2 Haplotype Resolving among Blocks . . . . . . . . . . . . . . . . . 8 2 Haplotype Decomposition 11 2.1 Haplotype Decomposition and Dynamic Programming . . . . . . . . . . . 11 2.2 Why Minimizing the Number of Tag SNPs? . . . . . . . . . . . . . . . . . 12 2.3 Dynamic Programming for Minimizing the Number of Tag SNPs . . . . . . 12 2.3.1 Useful Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 The Dynamic Programming Argument . . . . . . . . . . . . . . . 14 3 Haplotype Reconstruction with Large Scale Genotype Data 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Haplotype Reconstruction within a Block . . . . . . . . . . . . . . . . . . 18 3.2.1 The Perfect Phylogeny Haplotype Problem . . . . . . . . . . . . . 18 3.2.2 Useful Definitions and Lemma . . . . . . . . . . . . . . . . . . . . 19 3.2.3 Build-Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.4 Maximum Likelihood Model . . . . . . . . . . . . . . . . . . . . . 22 3.2.5 Resolving Missing Data . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Finding the Block Partitions from Genotype Data . . . . . . . . . . . . . . 23 3.4 Haplotype Reconstruction between Adjacent Blocks . . . . . . . . . . . . . 23 4 Our Framework 27 4.1 At-Least-One Perfect-Phylogeny-Tree Model . . . . . . . . . . . . . . . . 27 4.1.1 Our Model and the Key Ideas . . . . . . . . . . . . . . . . . . . . 28 4.1.2 At-Least-One Perfect-Phylogeny-Tree Model . . . . . . . . . . . . 29 4.2 Informative Score Function . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Tiling Blocks Consisting of Tag SNPs . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Unresolvable Choices from Tiling Blocks of Eskin et al. . . . . . . 34 4.3.2 Reducing and Transferring the Unresolvable Choices . . . . . . . . 34 4.3.3 Tiling Blocks with Tag SNPs . . . . . . . . . . . . . . . . . . . . . 34 4.4 Mutual Relation among Blocks . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Implementation and Experiments 39 5.1 Details of the Implementation . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Definitions and Criteria . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.2 The Steps of Our Approach . . . . . . . . . . . . . . . . . . . . . 40 5.1.3 Two Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 The Measures for Comparing Results . . . . . . . . . . . . . . . . 41 5.2.2 Experiments on Genotype Data in Daly et al. . . . . . . . . . . . . 44 5.2.3 Experiments on Simulation Genotype Data . . . . . . . . . . . . . 49 6 Conclusions 53 6.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 List of Figures 1-1 SNP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1-2 Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1-3 Haplotype decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1-4 Two-level haplotype resolving . . . . . . . . . . . . . . . . . . . . . . . . 7 2-1 Tag SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2-2 The diagram of the recursion for dynamic programming theory . . . . . . . 15 3-1 Genotypes A ) haplotypes B . . . . . . . . . . . . . . . . . . . . . . . . 19 3-2 Perfect phylogeny tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3-3 The relation of two sites on the perfect phylogeny tree . . . . . . . . . . . . 21 3-4 Equally / unequally resolving . . . . . . . . . . . . . . . . . . . . . . . . . 21 3-5 Finding the block partitions from genotype data . . . . . . . . . . . . . . . 24 3-6 Blocks tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4-1 Solving Gc by the triangle rule . . . . . . . . . . . . . . . . . . . . . . . . 31 4-2 The flow chart of addressing at-least-one perfect-phylogeny-tree model . . 32 4-3 Tiling blocks consisting of tag SNPs . . . . . . . . . . . . . . . . . . . . . 36 List of Tables 4.1 Example of the mutual relation table. . . . . . . . . . . . . . . . . . . . . . 37 5.1 Example of measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 The result of haplotype reconstruction by the given block partitions from Daly et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 The result of haplotype reconstruction and decomposition from the genotype data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Comparing with Eskin et al. on genotype data in Daly et al. . . . . . . . . . 48 5.5 Results of simulation genotype data from NT 001035 . . . . . . . . . . . . 51 5.6 Results of simulation genotype data from NT 003545 . . . . . . . . . . . . 51

    [1] V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph, “Haplotyping as Perfect Phylogeny:
    A Direct Approach,” Tech. Rep., Technical Report UCDavis CSE-2002-21,
    July 2002.
    [2] V. Bafna, B. V. Halldorsson, R. Schwartz, A. G. Clark, and S. Istrail, “Haplotypes
    and Informative SNP Selection Algorithms: Don’t Block out Information,” In Proceedings
    of The 7th Annual International Conference on Research in Computational
    Molecular Biology(RECOMB), pp. 19–27, 2003.
    [3] A. Clark, “Inference of Haplotypes from PCR-amplified Samples of Diploid Populations,”
    Molecular Biology and Evolution, vol. 7, no. 2, pp. 111–22, March 1990.
    [4] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander, “Highresolution
    Haplotype Structure in the Human Genome,” Nature Genetics, vol. 29, no.
    2, pp. 229–232, October 2001.
    [5] E. Eskin, E. Halperin, and R. M. Karp, “Large Scale Reconstruction of Haplotypes
    from Genotype Data,” In Proceedings of The 7th Annual International Conference on
    Research in Computational Molecular Biology(RECOMB), pp. 104–113, 2003.
    [6] L. Excoffier and M. Slatkin, “Maximum-likelihood Estimation of Molecular Haplotype
    Frequencies in a Diploid Population,” Molecular Biology and Evolution, vol. 12,
    no. 5, pp. 921–927, September 1995.
    [7] G. Greenspan and D. Geiger, “Model-based Inference of Haplotype Block Variation,”
    In Proceedings of The 7th Annual International Conference on Research in Computational
    Molecular Biology(RECOMB), pp. 131–137, 2003.
    [8] D. Gusfield, “Haplotyping as Perfect Phylogeny: Conceptual Framework and Ef-
    ficient Solutions,” In Proceedings of The 6th Annual International Conference on
    Research in Computational Molecular Biology(RECOMB), pp. 166–175, 2002.
    [9] E. Halperin and E. Eskin, “Haplotype Reconstruction from Genotype Data using
    Imperfect Phylogeny,” To appear in Bioinformatics, 2004.
    [10] G. Kimmel and R. Shamir, “Maximum Likelihood Resolution of Multi-block Genotypes,”
    In Proceedings of The 8th Annual International Conference on Research in
    Computational Molecular Biology(RECOMB), pp. 2–9, 2004.
    [11] M. Koivisto, M. Perola, T. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen,
    E. Ukkonen, and H. Mannila, “An MDL Method for Finding Haplotype Blocks and
    for Estimating the Strength of Haplotype Block Boundaries,” In Proceedings of the
    Pacific Symposium on Biocomputing (PSB), vol. 8, pp. 502–513, 2003.
    [12] J. Long, R. Williams, and M Urbanek, “An EM Algorithm and Testing Strategy for
    Multiple-locus Haplotypes,” American Journal of Human Genetics, vol. 56, no. 3,
    pp. 799–810, March 1995.
    [13] NHGRI, “http://www.genome.gov/10005336,” October 2002.
    [14] NHGRI, “http://www.genome.gov/10001772,” February 2004.
    [15] N. Patil, A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi, C. R. Hacker, C. R.
    Kautzer, D. H. Lee, C. Marjoribanks, D. P. McDonough, B. T. N. Nguyen, M. C.
    Norris, J. B. Sheehan, N. Shen, D. Stern, R. P. Stokowski, D. J. Thomas, M. O.
    Trulson, K. R. Vyas, K. A. Frazer, S. P. A. Fodor, and D. R. Cox, “Blocks of limited
    haplotype diversity revealed by high-resolution scanning of human chromosome 21,”
    Science, vol. 294, no. 5547, pp. 1719–1723, November 2001.
    [16] R. SCHWARTZ, B. V. HALLDORSSON, V. BAFNA, A. G. CLARK, and S. ISTRAIL1,
    “Robustness of Inference of Haplotype Block Structure,” Journal of Computational
    Biology, vol. 10, no. 1, pp. 13–19, 2003.
    [17] M. Stephens, N. Smith, and P. Donnelly, “A New Statistical Method for Haplotype
    Reconstruction from Population Data,” American Journal of Human Genetics, vol.
    68, no. 4, pp. 978–989, October 2001.
    [18] K. Zhang, M. Deng, T. Chen, M. S.Waterman, and F. Sun, “A Dynamic Programming
    Algorithm for Haplotype Block Partitioning,” Proceedings of the National Acadamy
    of Science(PNAS), vol. 99, no. 11, pp. 7335–7339, May 2002.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE