研究生: |
陳昱翰 Chen, Yu-Han |
---|---|
論文名稱: |
根據最少配對模式解決 Scaffolding 問題之研究 The Study of Solving Scaffolding Problem Based on Exemplar Model |
指導教授: |
盧錦隆
Lu, Chin-Lung |
口試委員: |
邱顯泰
Chiu, Hsien-Tai 林苕吟 Lin, Tiao-Yin |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 72 |
中文關鍵詞: | 演算法 、基因體組裝 、最少配對模型 、整數線性規劃 、生物資訊 、次世代定序 |
外文關鍵詞: | algorithm, scaffolding problem, exemplar model, integer linear programming, bioinformatics, next generation sequencing |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Scaffolding是DNA定序的過程中非常重要的一個步驟,而scaffolding的目的是把一個基因體草圖 (draft genome) 中的contigs給定序與定向。先前,我們的實驗室已開發出一個rearrangement-based scaffolding工具CSAR,它可以利用一個完整 (complete) 或不完整 (incomplete) 的參考基因體 (reference genome) 來對目標 (target) 基因體草圖進行scaffolding。基本上,CSAR所使用的目標與參考基因體必須以不重複的序列標記 (singleton sequnce markers) 來表示。然而,重覆的序列標記 (duplicate sequence markers) 在原核與真核生物的基因體中都很常出現。因此,在本論文中,我們利用所謂的exemplar breakpoint distance (EBD) 的觀念來定義出一個EBD-based scaffolding problem,這個問題的目的是要去決定出目標與參考基因體的scaffolds,使得這兩個scaffolds之間的exemplar breakpoint distance為最小。除此之外,我們使用整數線性規劃 (integer linear programming) 設計出一個精確演算法來解決EBD-based scaffolding problem。最後,根據模擬與實際資料的實驗結果,我們的ILP scaffolding演算法在有考慮duplicate markers情況下的準確度確實比在沒有考慮duplicate markers情況下來得好。除此之外,我們的ILP scaffolding演算法在準確度的表現也稍微比CSAR好,但是CSAR在執行時間的表現卻遠比我們的ILP scaffolding演算法好。
Scaffolding is one of the important steps in the process of DNA sequencing. The purpose of scaffolding is to order and orient contigs in a draft genome. Previously, our laboratory has developed a rearrangement-based scaffolding tool CSAR which scaffolds a target draft genome using a complete or incomplete reference genome. Basically, the target and reference genomes used by CSAR must be represented by singleton sequence markers. However, duplicate sequence markers are commonly observed in prokaryote and eukaryote genomes. In this thesis, therefore, we utilize a concept of the so-called exemplar breakpoint distance (EBD) to define an EBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the exemplar breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the EBD-based scaffolding problem. Finally, according to the experimental results on simulated and real datasets, the accuracy of our ILP scaffolding algorithm with considering duplicate markers is better than that of our ILP scaffolding algorithm without considering duplicate markers. Moreover, our ILP scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our ILP scaffolding algorithm in terms of running time.
[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
[6] S.A. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, 560–566.
[7] Z. Dias, U. Dias and J.C. Setubal (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
[8] C.L. Li, K.T. Chen, C.L. Lu (2013) Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics, 14, S9.
[9] C.L. Lu, K.T. Chen, S.Y. Huang and H.T. Chiu (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics, 15, 381.
[10] C.L. Lu (2015) An efficient algorithm for the contigs ordering problem under algebraic rearrangement distance. Journal of Computational Biology, 22, 975–987.
[11] K.T. Chen, C.L. Liu, S.H. Huang, H.T. Shen, Y.K. Shieh, H.T. Chiu and C.L. Lu (2018) CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34, 109–111.
[12] J. Bailey and E. Eichler (2006) Primate segmental duplication: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552–564.
[13] M. Lynch (2007) The Origins of Genome Architecture. Sinauer, Sunderland, MA.
[14] M. Shao and B. Moret (2016) A fast and exact algorithm for the exemplar breakpoint distance. Journal of Computational Biology, 23, 337–346.
[15] T.W. Wu (2019) A heuristic algorithm for solving scaffolding problem based on exemplar model. Thesis, National Tsing Hua University.
[16] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi and S. Pham (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. In, International Workshop on Algorithms in Bioinformatics, Springer, 215–229.
[17] D. Sankoff (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909–917.