根據最多配對模式解決Scaffolding問題之研究｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	高逸軒 Kao, I-Hsuan
論文名稱：	根據最多配對模式解決Scaffolding問題之研究 The Study of Solving Scaffolding Problem Based on Maximum-matching Model
指導教授：	盧錦隆 Lu, Chin-Lung
口試委員:	邱顯泰 Chiu, Hsien-Tai 林苕吟 Lin, Tiao-Yin
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	78
中文關鍵詞：	演算法、基因體組裝、最多配對模式、整數線性規劃、生物資訊、次世代定序
外文關鍵詞：	algorithm, scaffolding problem, maximum-matching model, integer linear programming, bioinformatics, next generation sequencing
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在DNA的定序過程中，scaffolding是一個重要的步驟，它的目的是要去決定目標基因體草圖(target draft genome)中contigs的次序與方向。準確的Scaffolding有利於我們後續取得一個更完整的基因體序列。過去我們實驗室已經發展出一種rearrangement-based的scaffolding工具CSAR可以根據一個完整 (complete) 或不完整 (incomplete)的參考基因體 (reference genome) 來對一個目標基因體草圖進行scaffolding。然而，CSAR的主要限制是目標與參考基因體之間的保守序列標記 (conserved sequence markers) 必須是不重覆的。事實上，重複的序列標記 (duplicate sequence markers) 在物種的基因體上是非常普遍的。因此，在本篇論文中，我們利用一個所謂的maximum-matching breakpoint distance (MBD) 的觀念去定義出一個MBD-based scaffolding problem，這個問題的目的是要去決定目標與參考基因體之間的骨架(scaffolds)，使得這兩個骨架之間的maximum-matching breakpoint distance為最小。除此之外，我們利用integer linear programming (ILP)設計出一個精準的演算法(exact algorithm)去解決MBD-based scaffolding problem。最後，我們在模擬與真實資料的實驗結果顯示出我們MBD-based scaffolding algorithm在有考慮duplicate markers時的準確度比它在沒有考慮duplicate markers時的準確度還來得好。另一方面，EBD-based scaffolding algorithm在模擬資料的表現勝過我們MBD-based scaffolding algorithm，但是在真實資料的表現上，我們MBD-based scaffolding algorithm卻勝過EBD-based scaffolding algorithm。除此之外，我們MBD-based scaffolding algorithm在準確度的表現略勝過CSAR，但CSAR在執行速度上卻遠勝過我們的MBD-based scaffolding algorithm。

Scaffolding is an important step in the process of DNA sequencing. The purpose of scaffolding is to determine orders and orientations of the contigs of a draft genome. An accurate scaffolding is helpful for obtaining a more complete genome sequence in the subsequent process. Previously, our laboratory has already developed a rearrangement-based scaffolding tool CSAR that can scaffold a target draft genome based on a complete or incomplete reference genome. However, the main limitation of CSAR is that the conserved sequence markers between target and reference genomes must be a singleton. In fact, duplicate sequence markers are very common in the genomes of species. In this thesis, therefore, we utilize a concept of the so-called maximum-matching breakpoint distance (MBD) to define an MBD-based scaffolding problem, which is to determine the scaffolds of the target and reference genomes such that the maximum-matching breakpoint distance between the resulting scaffolds is minimized. In addition, we use integer linear programming (ILP) to design an exact algorithm to solve the MBD-based scaffolding problem. Finally, our experimental results on simulated and real datasets have shown that the accuracy of our MBD-based scaffolding algorithm with considering duplicate markers is better than that of our MBD-based scaffolding algorithm without considering duplicate markers. On the other hand, the accuracy performance of EBD-based scaffolding algorithm is better than that of our MBD-based scaffolding algorithm on simulated datasets, but our MBD-based scaffolding algorithm outperforms EBD-based scaffolding algorithm on real datasets. Moreover, our MBD-based scaffolding algorithm performs slightly better than CSAR does in terms of accuracy performance, but CSAR is much better than our MBD-based scaffolding algorithm in terms of running time.

中文摘要.....1
Abstract.....3
Acknowledgement.....5
Contents.....6
List of figures.....8
List of tables.....14
Chapter 1  Introduction.....17
Chapter 2  Methods.....25
2.1 Preliminaries.....26
2.1.1 Genome, Contig and Marker.....26
2.1.2 Adjacency and pair of shared adjacencies.....27
2.1.3 Breakpoint and breakpoint distance.....28
2.1.4 Matching, maximum-matching and maximum-matching model.....29
2.1.5 Potential adjacency and pair of shared potential adjacencies.....30
2.1.6 Extended potential adjacency and extended pair of shared potential adjacencies.....31
2.2 ILP formulations.....32
2.2.1 ILP variables.....32
2.2.2 ILP objective function.....34
2.2.3 ILP constraints.....34
Chapter 3   Experiment Results and Discussion.....40
3.1 Quality Metrics.....40
3.2 Experiments of Simulation.....42
3.2.1 Overview of Simulation.....42
3.2.2 Flowchart of Simulation.....43
3.2.3 Parameters of Simulation.....44
3.2.4 Family ratio of Simulation.....46
3.2.5 Results of Simulation.....50
3.3 Experiments of Real Datasets.....62
3.3.1 Settings of Used Tools.....63
3.3.2 Real Datasets.....63
3.3.3 Results of Real Datasets.....66
3.3.4 Discussion.....73
Chapter 4  Conclusion.....75
References.....76

                                

[1] S. Assefa, T.M. Keane, T.D. Otto, C. Newbold and M. Berriman (2009) ABACAS algorithm-based automatic contiguation of assembled sequences. Bioinformatics, 25, 1968–1969.
[2] M. Galardini, E.G. Biondi, M. Bazzicalupo and A. Mengoni (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code for Biology and Medicine, 6, 11.
[3] P. Husemann and J. Stoye (2010) r2cat: synteny plots and comparative assembly. Bioinformatics, 26, 570–571.
[4] D.C. Richter, S.C. Schuster and D.H. Huson (2007) OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics, 23, 1573–1579.
[5] A.I. Rissman, B. Mau, B.S. Biehl, A.E. Darling, J.D. Glasner and N.T. Perna (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
[6] S.A. van Hijum, A.L. Zomer, O.P. Kuipers and J. Kok (2005) Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Research, 33, 560–566.
[7] Z. Dias, U. Dias and J.C. Setubal (2012) SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.
[8] C.L. Li, K.T. Chen, C.L. Lu (2013) Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics, 14, S9.
[9] C.L. Lu, K.T. Chen, S.Y. Huang and H.T. Chiu (2014) CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics, 15, 381.
[10] C.L. Lu (2015) An efficient algorithm for the contigs ordering problem under algebraic rearrangement distance. Journal of Computational Biology, 22, 975–987.
[11] K.T. Chen, C.L. Liu, S.H. Huang, H.T. Shen, Y.K. Shieh, H.T. Chiu and C.L. Lu (2018) CSAR: a contig scaffolding tool using algebraic rearrangements, Bioinformatics, 34, 109–111.
[12] J. Bailey and E. Eichler (2006) Primate segmental duplication: crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552–564.
[13] M. Lynch (2007) The Origins of Genome Architecture. Sinauer, Sunderland, MA.
[14] M. Shao and B. Moret (2016) A fast and exact algorithm for the exemplar breakpoint distance. Journal of Computational Biology, 23, 337–346.
[15] M. Shao and B. Moret (2017) On computing breakpoint distances for genomes with duplicate genes. Journal of Computational Biology, 24, 571–580.
[16] T.W. Wu (2019) A heuristic algorithm for solving scaffolding problem based on exemplar model. Thesis, National Tsing Hua University, Taiwan.
[17] Y.J. Chen (2019) A heuristic algorithm for solving scaffolding problem based on maximum-matching model. Thesis, National Tsing Hua University, Taiwan.
[18] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi and S. Pham (2013) Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes. In, International Workshop on Algorithms in Bioinformatics, Springer, 215–229.
[19] D. Sankoff (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909–917.

簡易檢索 / 詳目顯示

相關論文