研究生: |
高逸軒 Kao, I-Hsuan |
---|---|
論文名稱: |
根據最多配對模式解決Scaffolding問題之研究 The Study of Solving Scaffolding Problem Based on Maximum-matching Model |
指導教授: |
盧錦隆
Lu, Chin-Lung |
口試委員: |
邱顯泰
Chiu, Hsien-Tai 林苕吟 Lin, Tiao-Yin |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 78 |
中文關鍵詞: | 演算法 、基因體組裝 、最多配對模式 、整數線性規劃 、生物資訊 、次世代定序 |
外文關鍵詞: | algorithm, scaffolding problem, maximum-matching model, integer linear programming, bioinformatics, next generation sequencing |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在DNA的定序過程中,scaffolding是一個重要的步驟,它的目的是要去決定目標基因體草圖(target draft genome)中contigs的次序與方向。準確的Scaffolding有利於我們後續取得一個更完整的基因體序列。過去我們實驗室已經發展出一種rearrangement-based的scaffolding工具CSAR可以根據一個完整 (complete) 或不完整 (incomplete)的參考基因體 (reference genome) 來對一個目標基因體草圖進行scaffolding。然而,CSAR的主要限制是目標與參考基因體之間的保守序列標記 (conserved sequence markers) 必須是不重覆的。事實上,重複的序列標記 (duplicate sequence markers) 在物種的基因體上是非常普遍的。因此,在本篇論文中,我們利用一個所謂的maximum-matching breakpoint distance (MBD) 的觀念去定義出一個MBD-based scaffolding problem,這個問題的目的是要去決定目標與參考基因體之間的骨架(scaffolds),使得這兩個骨架之間的maximum-matching breakpoint distance為最小。除此之外,我們利用integer linear programming (ILP)設計出一個精準的演算法(exact algorithm)去解決MBD-based scaffolding problem。最後,我們在模擬與真實資料的實驗結果顯示出我們MBD-based scaffolding algorithm在有考慮duplicate markers時的準確度比它在沒有考慮duplicate markers時的準確度還來得好。另一方面,EBD-based scaffolding algorithm在模擬資料的表現勝過我們MBD-based scaffolding algorithm,但是在真實資料的表現上,我們MBD-based scaffolding algorithm卻勝過EBD-based scaffolding algorithm。除此之外,我們MBD-based scaffolding algorithm在準確度的表現略勝過CSAR,但CSAR在執行速度上卻遠勝過我們的MBD-based scaffolding algorithm。
Scaffolding is an important step in the process of DNA sequencing. The purpose of scaffolding is to determine orders and orientations of the contigs of a draft genome. An accurate scaffolding is helpful for obtaining a more complete genome sequence in the subsequent process. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR that can scaffold a target draft genome based on a complete or incomplete reference genome. However, the main limitation of CSAR is that the conserved sequence markers between target and reference genomes must be a singleton. In fact, duplicate sequence markers are very common in the genomes of species. In this thesis, therefore, we utilize a concept of the so-called maximum-matching breakpoint distance (MBD) to define an MBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the maximum-matching breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the MBD-based scaffolding problem. Finally, our experimental results on simulated and real datasets have shown that the accuracy of our MBD-based scaffolding algorithm with considering duplicate markers is better than that of our MBD-based scaffolding algorithm without considering duplicate markers. On the other hand, the accuracy performance of EBD-based scaffolding algorithm is better than that of our MBD-based scaffolding algorithm on simulated datasets, but our MBD-based scaffolding algorithm outperforms EBD-based scaffolding algorithm on real datasets. Moreover, our MBD-based scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our MBD-based scaffolding algorithm in terms of running time.
[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
[6] S.A. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, 560–566.
[7] Z. Dias, U. Dias and J.C. Setubal (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
[8] C.L. Li, K.T. Chen, C.L. Lu (2013) Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics, 14, S9.
[9] C.L. Lu, K.T. Chen, S.Y. Huang and H.T. Chiu (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics, 15, 381.
[10] C.L. Lu (2015) An efficient algorithm for the contigs ordering problem under algebraic rearrangement distance. Journal of Computational Biology, 22, 975–987.
[11] K.T. Chen, C.L. Liu, S.H. Huang, H.T. Shen, Y.K. Shieh, H.T. Chiu and C.L. Lu (2018) CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34, 109–111.
[12] J. Bailey and E. Eichler (2006) Primate segmental duplication: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552–564.
[13] M. Lynch (2007) The Origins of Genome Architecture. Sinauer, Sunderland, MA.
[14] M. Shao and B. Moret (2016) A fast and exact algorithm for the exemplar breakpoint distance. Journal of Computational Biology, 23, 337–346.
[15] M. Shao and B. Moret (2017) On computing breakpoint distances for genomes with duplicate genes. Journal of Computational Biology, 24, 571–580.
[16] T.W. Wu (2019) A heuristic algorithm for solving scaffolding problem based on exemplar model. Thesis, National Tsing Hua University, Taiwan.
[17] Y.J. Chen (2019) A heuristic algorithm for solving scaffolding problem based on maximum-matching model. Thesis, National Tsing Hua University, Taiwan.
[18] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi and S. Pham (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. In, International Workshop on Algorithms in Bioinformatics, Springer, 215–229.
[19] D. Sankoff (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909–917.