簡易檢索 / 詳目顯示

研究生: 陳昱翰
Chen, Yu-Han
論文名稱: 根據最少配對模式解決 Scaffolding 問題之研究
The Study of Solving Scaffolding Problem Based on Exemplar Model
指導教授: 盧錦隆
Lu, Chin-Lung
口試委員: 邱顯泰
Chiu, Hsien-Tai
林苕吟
Lin, Tiao-Yin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 72
中文關鍵詞: 演算法基因體組裝最少配對模型整數線性規劃生物資訊次世代定序
外文關鍵詞: algorithm, scaffolding problem, exemplar model, integer linear programming, bioinformatics, next generation sequencing
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Scaffolding是DNA定序的過程中非常重要的一個步驟,而scaffolding的目的是把一個基因體草圖 (draft genome) 中的contigs給定序與定向。先前,我們的實驗室已開發出一個rearrangement-based scaffolding工具CSAR,它可以利用一個完整 (complete) 或不完整 (incomplete) 的參考基因體 (reference genome) 來對目標 (target) 基因體草圖進行scaffolding。基本上,CSAR所使用的目標與參考基因體必須以不重複的序列標記 (singleton sequnce markers) 來表示。然而,重覆的序列標記 (duplicate sequence markers) 在原核與真核生物的基因體中都很常出現。因此,在本論文中,我們利用所謂的exemplar breakpoint distance (EBD) 的觀念來定義出一個EBD-based scaffolding problem,這個問題的目的是要去決定出目標與參考基因體的scaffolds,使得這兩個scaffolds之間的exemplar breakpoint distance為最小。除此之外,我們使用整數線性規劃 (integer linear programming) 設計出一個精確演算法來解決EBD-based scaffolding problem。最後,根據模擬與實際資料的實驗結果,我們的ILP scaffolding演算法在有考慮duplicate markers情況下的準確度確實比在沒有考慮duplicate markers情況下來得好。除此之外,我們的ILP scaffolding演算法在準確度的表現也稍微比CSAR好,但是CSAR在執行時間的表現卻遠比我們的ILP scaffolding演算法好。


    Scaffolding is one of the important steps in the process of DNA sequencing. The purpose of scaffolding is to order and orient contigs in a draft genome. Previously, our laboratory has developed a rearrangement-based scaffolding tool CSAR which scaffolds a target draft genome using a complete or incomplete reference genome. Basically, the target and reference genomes used by CSAR must be represented by singleton sequence markers. However, duplicate sequence markers are commonly observed in prokaryote and eukaryote genomes. In this thesis, therefore, we utilize a concept of the so-called exemplar breakpoint distance (EBD) to define an EBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the exemplar breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the EBD-based scaffolding problem. Finally, according to the experimental results on simulated and real datasets, the accuracy of our ILP scaffolding algorithm with considering duplicate markers is better than that of our ILP scaffolding algorithm without considering duplicate markers. Moreover, our ILP scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our ILP scaffolding algorithm in terms of running time.

    中文摘要....1 Abstract....2 Acknowledgement....3 Contents....4 List of figures....6 List of tables....10 Chapter 1 Introduction....13 Chapter 2 Methods....18 2.1 Preliminaries....19 2.1.1 Genome, Contig and Marker....19 2.1.2 Shared Adjacency and Breakpoint....20 2.1.3 Shared Potential Adjacency....21 2.1.4 Extended Shared Potential Adjacency....23 2.2 ILP Formulation....24 2.2.1 ILP Variables and Objective Function....24 2.2.2 ILP Constraints....26 Chapter 3 Experiment Results and Discussion....31 3.1 Quality Metrics....31 3.2 Experiments of Simulation....33 3.2.1 Flowchart of Simulation....34 3.2.2 Settings of Simulation....35 3.2.3 Results of Simulation....36 3.3 Experiments of Real Datasets....51 3.3.1 Settings of Used Tools....52 3.3.2 Real Datasets....52 3.3.3 Results of Real Datasets....57 Chapter 4 Conclusion....69 References....70

    [1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
    [2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
    [3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
    [4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
    [5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
    [6] S.A. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, 560–566.
    [7] Z. Dias, U. Dias and J.C. Setubal (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
    [8] C.L. Li, K.T. Chen, C.L. Lu (2013) Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics, 14, S9.
    [9] C.L. Lu, K.T. Chen, S.Y. Huang and H.T. Chiu (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics, 15, 381.
    [10] C.L. Lu (2015) An efficient algorithm for the contigs ordering problem under algebraic rearrangement distance. Journal of Computational Biology, 22, 975–987.
    [11] K.T. Chen, C.L. Liu, S.H. Huang, H.T. Shen, Y.K. Shieh, H.T. Chiu and C.L. Lu (2018) CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34, 109–111.
    [12] J. Bailey and E. Eichler (2006) Primate segmental duplication: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552–564.
    [13] M. Lynch (2007) The Origins of Genome Architecture. Sinauer, Sunderland, MA.
    [14] M. Shao and B. Moret (2016) A fast and exact algorithm for the exemplar breakpoint distance. Journal of Computational Biology, 23, 337–346.
    [15] T.W. Wu (2019) A heuristic algorithm for solving scaffolding problem based on exemplar model. Thesis, National Tsing Hua University.
    [16] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi and S. Pham (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. In, International Workshop on Algorithms in Bioinformatics, Springer, 215–229.
    [17] D. Sankoff (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909–917.

    QR CODE