簡易檢索 / 詳目顯示

研究生: 謝明峰
Hsieh, Ming-Feng
論文名稱: Clover:一個針對Illumina定序平台的叢集導向組序軟體
Clover: a clustering-oriented de novo assembler for Illumina sequences
指導教授: 唐傳義
Tang, Chuan-Yi
口試委員: 盧錦隆
Lu, Chin Lung
李哲榮
Lee, Che-Rung
蔡英德
Tsai, Yin-Te
林沿妊
Lin, Yen-Jen
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 31
中文關鍵詞: 基因體組序DNA定序de Bruijn graph
外文關鍵詞: De novo genome assembly, DNA sequencing, de Bruijn graph
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 次世代定序技術藉由低成本且高通量的 reads 來改進基因體學,而這個改變促進了近期組序軟體的發展。許多基於 de Bruijn graph 的組序方法被證實對於 Illumina reads 來說很有效率。然而,定序機產生的序列錯誤使得組序的分析變得複雜且影響了下游基因體學研究的品質。
    在這篇論文中,我們發展了一個基於 de Bruijn 的組序軟體,稱它為 Clover (叢集導向組序軟體),它利用一個從 overlap-layout-consensus 概念中獲得的新穎 k-mer 叢集方法來處理 Illumina 平台產生的序列錯誤。我們進一步使用三組資料(Staphylococcus aureus,Rhodobacter sphaeroides 和人類第十四號染色體)來評估 Clover 相對於數個 de Bruijn graph 組序軟體(ABySS,SOAPdenovo,SPAdes 與 Velvet), overlap-layout-consensus 組序軟體(Bambus2,CABOG 與 MSR-CA)與 string graph 組序軟體(SGA)的表現。其結果顯示 Clover 在驗證後的 N50 與 E-size 的項目中獲得較高的組序品質,同時保持在執行速度上的競爭性(除了 SOAPdenovo)。
    這個新穎的叢集導向方法, Clover ,整合了 overlap-layout-consensus 方法的彈性與 de Bruijn graph 方法的效率,在組序上有很高的潛力。現在, Clover 可以從 http://oz.nthu.edu.tw/~d9562563/ 免費下載。


    Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches.
    In this thesis, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo.
    The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available at http://oz.nthu.edu.tw/~d9562563/.

    摘要 i Abstract ii Contents iv List of Tables vi List of Figures vii List of Abbreviations viii 1 Introduction 1 1.1 Traditional methods 1 1.1.1 Overlap-layout-consensus approach 2 1.1.2 De Bruijn graph approach 2 2 Rationale 4 2.1 Installation of Clover 5 2.2 Run Leptospira shermani assembly 6 3 Results and discussion 7 3.1 Datasets 7 3.2 Running Clover assembler 7 3.3 Assemblers 8 3.4 Comparison 8 3.5 Run times and memory requirements 10 3.6 Future works 11 3.6.1 Parallelization of Clover 11 3.6.2 Exploration of other possible clustering algorithms 12 4 Conclusions 13 5 Methods 14 5.1 Construction and clustering of k-mers 14 5.2 Consensus computing and splitting of nodes 15 5.3 De Bruijn graph construction 17 5.4 Graph cleaning and extension with shorter k-mers 17 5.5 Scaffolding 17 Bibliography 21 Appendix 25 A.1 Leptospira shermani assembly statistics results 25 A.2 Clover assembly statistics results 28

    [1] Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A. and Pevzner, P. A. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5), 455-477.
    [2] Bentley, D. R. 2006. Whole-genome re-sequencing. Current Opinion in Genetics and Development, 16, 545-552.
    [3] Caporaso, J. G., Lauber, C. L., Walters, W. A., Berg-Lyons, D., Huntley, J., Fierer, N., Owens, S. M., Betley, J., Fraser, L., Bauer, M., Gormley, N., Gilbert, J. A., Smith, G. and Knight, R. 2012. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. The ISME Journal, 6, 1621-1624.
    [4] Chaisson, M. J. and Pevzner, P. A. 2008. Short read fragment assembly of bacterial genomes. Genome Research, 18(2), 324-330.
    [5] Chaisson, M. J., Brinza, D. and Pevzner, P. A. 2009. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Research, 19, 336-346.
    [6] Cheung, F., Haas, B. J., Goldberg, S. M., May, G. D., Xiao, Y. and Town, C. D. 2006. Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics, 7, 272-281.
    [7] De Bruijn, N. G. 1946. A combinatorial problem. Proceedings of the Section of Sciences of the Koninklijke Nederlandse Akademie van Wetenschappen te Amsterdam, 49(7), 758-764.
    [8] Hawkins, R. D., Hon, G. C. and Ren, B. 2010. Next-generation genomics: an integrative approach. Nature Reviews Genetics, 11, 476-486.
    [9] Hernandez, D., François, P., Farinelli, L., Østerås, M. and Schrenzel, J. 2008. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research, 18, 802-809.
    [10] Hillier, L. W., Marth, G. T., Quinlan, A. R., Dooling, D., Fewell, G., Barnett, D., Fox, P., Glasscock, J. I., Hickenbotham, M., Huang, W., Magrini, V. J., Richt, R. J., Sander, S. N., Stewart, D. A., Stromberg, M., Tsung, E. F., Wylie, T., Schedl, T., Wilson, R. K., Mardis E. R. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods, 5(2), 183-188.
    [11] Idury, R. M. and Waterman, M. S. 1995. A new algorithm for DNA sequence assembly. Journal of computational biology, 2(2), 291-306.
    [12] Johnson, D. S., Mortazavi, A., Myers, R. M., and Wold, B. 2007. Genomewide mapping of in vivo protein-DNA interactions. Science, 316, 1497-1502.
    [13] Kececioglu, J. D. and Myers, E. W. 1995. Combinatorial algorithms for DNA sequence assembly. In Algorithmica, 13(1), 7-51.
    [14] Kelley, D. R., Schatz, M. C. and Salzberg, S. L. 2010. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11, R116.
    [15] Koren, S., Treangen, T. J. and Pop, M. 2011. Bambus 2: scaffolding metagenomes. Bioinformatics, 27(21), 2964-2971.
    [16] Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J. and Wang, J. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research, 20, 265-272.
    [17] Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y. J., Chen, Z., Dewell, S. B., Du, L., Fierro, J. M., Gomes, X. V., Goodwin, B. C., He, W., Helgesen, S., Ho, C. H., Irzyk, G. P., Jando, S. C., Alenquer, M. L., Jarvie, T. P., Jirage, K. B., Kim, J. B., Knight, J. R., Lanza, J. R., Leamon, J. H., Lefkowitz, S. M., Lei, M., Li, J., Lohman, K. L., Lu, H., Makhijani, V. B., McDade, K. E., McKenna, M. P., Myers, E. W., Nickerson, E., Nobile, J. R., Plant, R., Puc, B. P., Ronan, M. T., Roth, G. T., Sarkis, G. J., Simons, J. F., Simpson, J. W., Srinivasan, M., Tartaro, K. R., Tomasz, A., Vogt, K. A., Volkmer, G. A., Wang, S. H., Wang, Y., Weiner, M. P., Yu, P., Begley, R. F. and Rothberg, J. M. 2005. Genome Sequencing in Open Microfabricated High Density Picoliter Reactors. Nature, 437, 376-380.
    [18] Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. 2008. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24(24), 2818-2824.
    [19] Minoche, A. E., Dohm, J. C. and Himmelbauer, H. 2011. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biology, 12, R112.
    [20] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. 2008. Mapping and quantifying mammalian transcriptomes by RNASeq. Nature Methods, 5, 621-628.
    [21] Ondov, B. D., Varadarajan, A., Passalacqua, K. D. and Bergman, N. H. 2008. Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications. Bioinformatics, 24(23), 2776-2777.
    [22] Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M., Marçais, G., Pop, M. and Yorke, J. A. 2012. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Research, 22, 557-567.
    [23] Shendure, J. and Ji, H. 2008. Next-generation DNA sequencing. Nature Biotechnology, 26, 1135-1145.
    [24] Shukla, S. K., Kislow, J., Briska, A., Henkhaus, J. and Dykes, C. 2009. Optical Mapping Reveals a Large Genetic Inversion between Two Methicillin-Resistant Staphylococcus aureus Strains. Journal of Bacteriology, 191, 5717-5723.
    [25] Simpson, J. T. and Durbin, R. 2012. Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22, 549-556.
    [26] Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. 2009. ABySS: a parallel assembler for short read sequence data. Genome Research, 19, 1117-1123.
    [27] Zerbino, D. R. and Birney, E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18, 821-829.
    [28] Zimin, A. V., Marçais, G., Puiu, D., Roberts, M., Salzberg, S. L. and Yorke, J. A. 2013. The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669-2677.

    QR CODE