簡易檢索 / 詳目顯示

研究生: 廖基復
Liao, Ki-Hok
論文名稱: 野生細菌族群基因讀序片段譜系網路之研究
On the Genealogies of A Set of Reads from Evolving Microbial Populations
指導教授: 唐傳義
Tang, Chuan-Yi
口試委員: 韓永楷
Hon, Wing Kai
謝文萍
Hsieh, Wen-Ping
丁照棣
Ting, Chau-Ti
趙坤茂
Chao, Kun-Mao
學位類別: 博士
Doctor
系所名稱:
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 130
中文關鍵詞: 溯祖理論微生物模擬散彈槍定序
外文關鍵詞: coalescent, microbial, simulator, shotgun sequencing
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 高通量定序技術已經大幅革新了元基因組學以及癌症演化的研究。當一個野生菌落只存在少數的物種時,我們可以藉由全基因體定序的方式得到該物種的隨機樣本,其中的單核苷酸多型性資訊可以用來推論這些細菌族群過去的演化歷史。隨著基因讀序片段長度增加,未來將有更多資訊可以用來研究野生微生物族群的演化過程。由於全基因體定序所得到的讀序片段之間的關係可以表示為一個譜系網路。了解譜系網路的機率分佈有助於了解演化歷史如何影響定序資料的性質,並且可以用來設計新的分析方法。在這個研究裡面,我們提出一個新的機率模型來描述定序資料的譜系網路。我們的模型可以用兩種隨機過程來描述,第一種方式是譜系樹在基因組不同位置的變化,第二種方式是讀序片段從現在到過去的演化。我們證明這個模型可以視為標準模型的近似,而且複雜度遠低於標準模型,因此可以用來設計較有效率的方法。我們根據這個模型實作的一個模擬軟體 MetaSMC,可以用來模擬任意演化過程下的定序資料。除了古典的 infinite-site model,我們的模擬軟體也支援 Jukes-Cantor, HKY85 以及 generalised time-reversible model 等各種演化模型。我們的模擬軟體可以讓不同族群有不同的突變率,這個功能可以讓我們模擬 mutator phenotype 的效果。


    High-throughput sequencing technology has revolutionized the study of metagenomics and cancer evolution. In a relatively simple environment, a metagenomics sequencing data is dominated by a few species. By analyzing the alignment of reads from microbial species, single nucleotide polymorphisms can be discovered and the evolutionary history of the populations can be reconstructed. The ever-increasing read length will allow more detailed analysis about the evolutionary history of microbial or tumor cell population. All reads in the sample from a microbial community or tumor can be related by a sequence of genealogies. Understanding the distribution of genealogies provide insight into how evolutionary process affect the mutation pattern in the reads. In this thesis, we proposed a new model of the genealogies of a set of reads from evolving microbial populations. We proposed a spatial process that generate the sequence of genealogies of a set of reads. Our model ignores unnecessary chromosomal segments and thus is far simpler than standard coalescent when recombination is frequent. We showed that the process behind our spatial process is equivalent to Sequentially Markov Coalescent with an incomplete sample. The accuracy of our model was evaluated by summary statistics and likelihood curves derived from Monte-Carlo integration over large number of random genealogies. Based on our model, we implemented an efficient simulator, MetaSMC. Based on the coalescent theory, our simulator supports all evolutionary scenarios supported by other coalescent simulators. In addition, the simulator supports various substitution models, including Jukes-Cantor, HKY85 and generalised time-reversible (GTR) models. The simulator also supports mutator phenotypes by allowing different mutation rates and substitution models in different subpopulations.

    1 Introduction 1 2 Background 5 2.1 Kingman’s Coalescent........................................ 5 2.2 Coalescent with Recombination................................... 12 2.3 Applications of coalescent theory.................................. 14 3 Spatial Process that Simulates Genealogies of Reads 16 3.1 Basic Algorithm........................................... 16 3.2 Recombination............................................ 19 4 An Equivalent Temporal Process 22 4.0.1 Proof of Theorem 1..................................... 29 4.0.2 Equivalence of Coalescent Parts............................... 35 4.0.3 Equivalence of Recombination Parts............................ 40 5 Likelihood of A Set of Reads 43 5.0.1 Proof of Theorem 2..................................... 47 6 A Shotgun-Sequence Simulator 50 6.1 Detailed Algorithm.......................................... 53 7 Result 57 7.1 Performance of the Spatial Process................................. 57 7.2 Accuracy............................................... 59 7.2.1 Correlation between Local Genealogies........................... 59 7.2.2 Likelihood........................................... 63 8 Conclusion 66 9 Future Works 70 9.1 Exact Model of Gene Conversion.................................. 70 9.2 Further Simplification of Coalescent with Recombination..................... 71 10 Figures and Tables 73 Appendix 83 A1 Algorithm A1 and A2........................................ 83 A2 Data Structures............................................ 86 A2.1 Binary Indexed Tree..................................... 86 A2.2 Event Index.......................................... 88 A3 Properties of Subtrees Embedded in a Global Genealogy..................... 90 A3.1 Polya Urn Analog of Random Genealogies......................... 90 A3.2 Properties of Genealogy of Individual Region....................... 92 A3.3 Correlation Between Genealogies of Different Regions.................. 103 A4 Tutorial................................................ 116 A5 Figures and Tables.......................................... 119

    [1] Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician’s Guide to Metage-
    nomics. Microbiology and Molecular Biology Reviews. 2008;72(4):557–578. doi:10.1128/mmbr.00009-08.
    [2] Eppley J, Tyson G, Getz W, Banfield J. Strainer: software for analysis of population variation in
    community genomic datasets. BMC Bioinformatics. 2007;8(1):398+. doi:10.1186/1471-2105-8-398.
    [3] Simmons SL, Dibartolo G, Denef VJ, Goltsman DAS, Thelen MP, Banfield JF. Population genomic
    analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation.
    PLoS biology. 2008;6(7):e177+. doi:10.1371/journal.pbio.0060177.
    [4] Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community struc-
    ture and metabolism through reconstruction of microbial genomes from the environment. Nature.
    2004;428(6978):37–43. doi:10.1038/nature02340.
    [5] McElroy K, Thomas T, Luciani F. Deep sequencing of evolving pathogen populations: applications,
    errors, and bioinformatic solutions. Microb Inform Exp. 2014;.
    [6] Liu X, Fu YX, Maxwell TJ, Boerwinkle E. Estimating population genetic parameters and compar-
    ing model goodness-of-fit using DNA sequences with error. Genome Research. 2010;20(1):101–109.
    doi:10.1101/gr.097543.109.
    [7] Liu B, Faller LL, Klitgord N, Mazumdar V, Ghodsi M, Sommer DD, et al. Deep sequencing
    of the oral microbiome reveals signatures of periodontal disease. PloS one. 2012;7(6):e37919+.
    doi:10.1371/journal.pone.0037919.
    [8] Liu X, Maxwell TJ, Boerwinkle E, Fu YXX. Inferring population mutation rate and sequencing error
    rate using the SNP frequency spectrum in a sample of DNA sequences. Molecular biology and evolution.
    2009;26(7):1479–1490. doi:10.1093/molbev/msp059.
    [9] Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: A clean look at
    messy data. Genome Research. 2006;16(10):1320–1327. doi:10.1101/gr.5431206.
    [10] Johnson PL, Slatkin M. Inference of microbial recombination rates from metagenomic data. PLoS
    genetics. 2009;5(10):e1000674+. doi:10.1371/journal.pgen.1000674.
    [11] Knudsen B, Miyamoto M. Accurate and fast methods to estimate the population mutation rate from
    error prone sequences. BMC Bioinformatics. 2009;10(1):247+. doi:10.1186/1471-2105-10-247.
    [12] Haubold B, Pfaffelhuber P, Lynch M. mlRho a program for estimating the population mutation and
    recombination rates from shotgun-sequenced diploid genomes. Molecular Ecology. 2010;19(s1):277–284.
    doi:10.1111/j.1365-294x.2009.04482.x.
    [13] Nik-Zainal S, Van Loo P, Wedge DC, Alexandrov LB, Greenman CD, Lau KW, et al. The Life History
    of 21 Breast Cancers. Cell. 2012;149(5):994–1007. doi:10.1016/j.cell.2012.04.023.
    [14] Newburger DE, Kashef-Haghighi D, Weng Z, Salari R, Sweeney RT, Brunner AL, et al.
    Genome evolution during progression to breast cancer. Genome Research. 2013;23(7):1097–1108.
    doi:10.1101/gr.151670.112.
    [15] Griffiths RC, Tavare S. Sampling Theory for Neutral Alleles in a Varying Environment. Philosophical
    Transactions: Biological Sciences. 1994;344(1310). doi:10.2307/56112.
    [16] Kaplan NL, Hudson RR, Langley CH. The “Hitchhiking Effect” Revisited. Genetics. 1989;123(4):887–
    899.
    [17] Kuhner MK. Coalescent genealogy samplers: windows into population history. Trends in Ecology &
    Evolution. 2009;24(2):86–93. doi:10.1016/j.tree.2008.09.007.
    [18] Hudson RR. Properties of a neutral allele model with intragenic recombination. Theoretical Population
    Biology. 1983;23(2):183–201. doi:10.1016/0040-5809(83)90013-8.
    [19] Notohara M. The coalescent and the genealogical process in geographically structured population.
    Journal of Mathematical Biology. 1990;29(1):59–75. doi:10.1007/bf00173909.
    [20] Takahata N. The coalescent in two partially isolated diffusion populations. Genet Res. 1988;.
    [21] Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Research.
    2009;19(1):136–142. doi:10.1101/gr.083634.108.
    [22] McVean GA, Cardin NJ. Approximating the coalescent with recombination. Philosophical trans-
    actions of the Royal Society of London Series B, Biological sciences. 2005;360(1459):1387–1393.
    doi:10.1098/rstb.2005.1673.
    [23] Paul JS, Song YS. A Principled Approach to Deriving Approximate Conditional Sampling Dis-
    tributions in Population Genetics Models with Recombination.
    Genetics. 2010;186(1):321–338.
    doi:10.1534/genetics.110.117986.
    [24] Paul JS, Steinrücken M, Song YS. An accurate sequentially Markov conditional sampling distribution
    for the coalescent with recombination. Genetics. 2011; p. 1115–1128.
    [25] Steinrücken M, Paul JS, Song YS.
    A sequentially Markov conditional sampling distribution
    for structured populations with migration and recombination. Theoretical Population Biology.
    2012;doi:10.1016/j.tpb.2012.08.004.
    [26] Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: a
    sequentially markov conditional sampling distribution approach. Genetics. 2013;194(3):647–662.
    [27] Wakeley J. Coalescent Theory: An Introduction. 1st ed. W. H. Freeman; 2008. Available from:
    http://www.worldcat.org/isbn/0974707759.
    [28] Ethier SN, Griffiths RC. On the two-locus sampling distribution. Journal of Mathematical Biology.
    1990;29:131–159.
    [29] Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinfor-
    matics. 2002;18(2):337–338. doi:10.1093/bioinformatics/18.2.337.
    [30] Marjoram O, Wall JD. Fast ”coalescent” simulation. BMC Genetics. 2006;7:16–25.
    [31] Staab PR, Zhu S, Metzler D, Lunter G. scrm: efficiently simulating long sequences using the approxi-
    mated coalescent with recombination. Bioinformatics. 2015;31(10):1680–1682.
    [32] Kelleher J, Etheridge AM, McVean G. Efficient Coalescent Simulation and Genealogical Analysis for
    Large Sample Sizes. PLoS Computation Biology. 2016;12(5).
    [33] Liu X, Fu YX. Exploring Population Size Changes Using SNP Frequency Spectra. Nature Genetics.
    2015;47:555–559.
    [34] Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics. 1993;133:693–709.
    [35] Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.
    Genetics. 1989;123:585–595.
    [36] Griffiths RC, Marjoram P. Ancestral Inference from Samples of DNA Sequences with Recombination.
    Journal of Computational Biology. 1996;3(4):479–502. doi:10.1089/cmb.1996.3.479.
    [37] Hellmann I, Mang Y, Gu Z, Li P, de La, Clark AG, et al. Population genetic analysis of shot-
    gun assemblies of genomic sequences from multiple individuals. Genome Res. 2008;18(7):1020–1029.
    doi:10.1101/gr.074187.107.
    [38] Wiuf C, Hein J. Recombination as a Point Process along Sequences. Theoretical Population Biology.
    1999;55(3):248–259. doi:10.1006/tpbi.1998.1403.
    [39] Li H, Durbin R. Inference of human population history from individual whole-genome sequences. Nature.
    2011;475(7357):493–496. doi:10.1038/nature10231.
    [40] Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome
    sequences. Nat Genet. 2014;46(8):919–925. doi:10.1038/ng.3015.
    [41] Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, et al. A metagenome-wide association study of gut microbiota
    in type 2 diabetes. Nature. 2012;490:55–60.
    [42] Richter DC, Ott F, Auch AF, Schmid R, Huson DH. MetaSimA Sequencing Simulator for Genomics
    and Metagenomics. PLOS ONE. 2008;3(10):e3373+. doi:10.1371/journal.pone.0003373.
    [43] Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I.
    Characteristics of 454 pyrose-
    quencing dataenabling realistic simulation with flowsim.
    Bioinformatics. 2010;26(18):i420–i425.
    doi:10.1093/bioinformatics/btq365.
    [44] Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinfor-
    matics. 2012;28(4):593–594. doi:10.1093/bioinformatics/btr708.
    [45] McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation
    sequencing data. BMC Genomics. 2012;13(1):74+. doi:10.1186/1471-2164-13-74.
    [46] Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence
    evolution along phylogenetic trees. Computer applications in the biosciences : CABIOS. 1997;13(3):235–
    238. doi:10.1093/bioinformatics/13.3.235.
    [47] Driffield K, Miller K, Bostock JM, O’Neill AJ, Chopra I. Increased mutability of Pseudomonas aeruginosa
    in biofilms. The Journal of antimicrobial chemotherapy. 2008;61(5):1053–1056.
    [48] Didelot X, Walker AS, Peto TE, Crook DW, Wilson DJ. Within-host evolution of bacterial pathogens.
    Nature Reviews Microbiology. 2016;14(3):150–162. doi:10.1038/nrmicro.2015.13.
    [49] Lambert G, Estévez-Salmeron L, Oh S, Liao D, Emerson BM, Tlsty TD, et al. An analogy between
    the evolution of drug resistance in bacterial communities and malignant tissues. Nature reviews Cancer.
    2011;11(5):375–382. doi:10.1038/nrc3039.
    [50] Gonzalez C, Hadany L, Ponder RG, Price M, Hastings PJ, Rosenberg SM. Mutability and Importance
    of a Hypermutable Cell Subpopulation that Produces Stress-Induced Mutants in Escherichia coli. PLoS
    Genet. 2008;4(10):e1000208+. doi:10.1371/journal.pgen.1000208.
    [51] Sniegowski PD, Gerrish PJ, Lenski RE. Evolution of high mutation rates in experimental populations
    of E. coli. Nature. 1997;387(6634):703–705. doi:10.1038/42701.
    [52] Feliziani S, Marvig RL, Luján AM, Moyano AJ, Di Rienzo JA, Krogh Johansen H, et al. Coexistence and
    Within-Host Evolution of Diversified Lineages of Hypermutable Pseudomonas aeruginosa in Long-term
    Cystic Fibrosis Infections. PLoS Genet. 2014;10(10):e1004651+. doi:10.1371/journal.pgen.1004651.
    [53] Fenwick PM. A new data structure for cumulative frequency tables. SoftwarePractice & Experience.
    1994;24(3):327–336.
    [54] Spouge JL. Within a Sample from a Population, the Distribution of the Number of Descendants of a
    Subsample’s Most Recent Common Ancestor. Theor Popul Biol. 2013; p. 51–54.
    [55] Wu S, Koelle K, Rodrigo A. Coalescent entanglement and the conditional dependence of the times to
    common ancestry of mutually exclusive pairs of individuals. Journal of Heredity. 2013;104:86–91.
    [56] Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of
    Molecular Evolution. 1981;17(6):368–376.
    [57] Denamur E, Matic I. Evolution of mutation rates in bacteria. Molecular Microbiology. 2006;60(4):820–
    827. doi:10.1111/j.1365-2958.2006.05150.x.
    [58] Good BH, Desai MM. Evolution of Mutation Rates in Rapidly Adapting Asexual Populations. Genetics.
    2016;204:1249–1266.
    [59] Desai MM, Fisher DS. The Balance Between Mutators and Nonmutators in Asexual Populations.
    Genetics. 2011;188(4):997–1014. doi:10.1534/genetics.111.128116.
    [60] Wiuf C, Hein J. The Coalescent With Gene Conversion. Genetics. 2000;155(1):451–462.
    [61] Croucher NJ, Harris SR, Barquist L, Parkhill J, Bentley SD. A High-Resolution View of Genome-Wide
    Pneumococcal Transformation. PLoS Pathog. 2012;8(6):e1002745+. doi:10.1371/journal.ppat.1002745.
    [62] Mell JC, Lee JY, Firme M, Sinha S, Redfield RJ. Extensive Cotransformation of Natural Variation
    into Chromosomes of Naturally Competent Haemophilus influenzae. G3: Genes—Genomes—Genetics.
    2014;4(4):717–731. doi:10.1534/g3.113.009597.
    [63] Paulsson J, Karoui ME, Lindell M, Hughes D. The processive kinetics of gene conversion in bacteria.
    Molecular Microbiology. 2017;104(5):752–760.
    [64] Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using
    Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233.
    [65] Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet Processes. Journal of the American
    Statistical Association. 2006;101(476):1566–1581. doi:10.1198/016214506000000302.
    [66] Fu YX. Statistical properties of segregating sites. Theoretical Population Biology. 1995;(48):172–197.

    QR CODE