研究生: |
陳怡如 Chen, Yi-Ju |
---|---|
論文名稱: |
根據最多配對模式解決Scaffolding問題的啟發式演算法 A Heuristic Algorithm for Solving Scaffolding Problem Based on Maximum-matching Model |
指導教授: |
盧錦隆
Lu, Chin-Lung |
口試委員: |
邱顯泰
Chiu, Hsien-Tai 林苕吟 Lin, Tiao-Yin |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 46 |
中文關鍵詞: | 基因體組裝 、最多配對模式 、基因體重組 、重複序列標記 |
外文關鍵詞: | contig scaffolding, maximum-matching model, genome rearrangement, duplicate sequence marker |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
次世代定序 (Next Generation Sequencing,簡稱NGS) 技術已經允許我們對許多有興趣的物種有效率地產生出他們的基因體(genomes)。然而,這些被定序出來的基因體都還只是一群contigs,他們在被定序基因體上的次序與方向是未知的。Scaffolding是用來決定這些contigs的次序與方向的程序,對於後續要取得一個定序基因體的完整序列的程序是重要且有幫助的。目前已有許多的scaffolding的工具被開發出來可以用一個完整或是不完整的參考基因體決定目標基因體草圖上contigs的前後次序與方向,例如OSLay、Mauve Aligner、MeDuSa、Ragout與CSAR。特別的是,CSAR是由我們實驗室所發展出來的,而且我們的實驗結果也已顯示出CSAR在大多數情況下的敏感度、準確度、F-score與基因體覆蓋率等的表現優於其他工具。但是CSAR做scaffolding的基因體必須只包含不重複的序列標記 (singleton sequence markers)。事實上,重複的序列標記 (du-plicate sequence markers) 在基因體上非常普遍。這促使我們尋找一個方法能讓CSAR同時考慮不重複與重複的序列標記使得它的準確度表現能夠進一步被改進。在本研究中,我們設計以下的啟發式方法來解決這個問題。首先,我們利用Shao和Moret提出的最多配對模式 (maximum-matching model) 用以下的步驟將重複的序列標記轉變成不重複的序列標記:(1) 在重複的序列標記間建立最多的匹配,(2) 捨棄不在匹配當中的複本,然後 (3) 將每個匹配配對視為一個新的標記family。在那之後,目標與參考基因體中的所有標記都是不重複的。接下來我們用CSAR來根據參考基因體將目標基因體的contigs做scaffold。最後,我們實驗兩種不同類型的資料組來驗證我們的scaffolding方法,包含模擬資料組與真實資料組。我們的實驗結果顯示在CSAR的scaffolding程序中考慮重複的序列標記確實可以得到比不考慮重複的序列標記以及CSAR本身利用MUMmer處理重複的序列標記更好的結果。
Next generation sequencing technologies have allowed us to efficiently produce genomes for many organisms of interest. However, most sequenced genomes are just collections of independent contigs, whose relative positions and orientations along the genome being sequenced are unknown. Scaffolding is a process to determine the orders and orientations of these contigs, which is critical and helpful for accomplishing the subsequent finishing process. Currently, several scaffolding tools have been developed to use a complete or incomplete reference genome to order and orient the contigs of target draft genomes, such as OSLay, Mauve Aligner, MeDuSa, Ragout and CSAR. In particular, CSAR was developed by our laboratory and our experimental results have shown that CSAR always outperforms than other tools in terms of sensitivity, precision, F-score and genome coverage. However, the genomes used by CSAR for scaffolding must be singleton sequence markers. In fact, duplicate sequence markers are very common in the genomes. This motivates us to find a method that allows CSAR to consider both singleton and duplicate sequence markers such that its accuracy performance can be further improved. In this study, we design the following heuristic approach to address this problem. First, we utilize the so-called maximum-matching model, which was proposed by Shao and Moret, to transform duplicate sequence markers into singleton ones by using the following steps: (1) build a maximum matching between duplicate sequence markers, (2) discard the copies not in the matching and (3) treat each matched pairs as a new marker family. After that, the markers are all singletons in the target and reference genomes. Next, we use CSAR to scaffold the contigs of the target genome based on the reference genome. Finally, we perform experiments on two different datasets to validate our scaffolding approach, including simulated datasets and real datasets. As a result, our experimental results show that considering duplicate sequence markers in the scaffolding process of CSAR indeed has better performance than that without considering duplicate sequence markers, as well as CSAR itself that utilizes MUMmer to deal with the duplicate sequence markers.
[1] Pop, M. (2009) Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics, 10, 354-366.
[2] Nagarajan, N., Cook, C., Di Bonaventura, M., Ge, H., Richards, A., Bishop-Lilly, K., DeSalle, R., Read, T. and Pop, M. (2010) Finishing genomes with limited resources: les-sons from an ensemble of microbial genomes. BMC Genomics, 11, 242.
[3] Richter, D., Schuster, S. and Huson, D. (2007) OSLay: optimal syntenic layout of unfin-ished assemblies. Bioinformatics, 23, 1573–1579.
[4] Rissman, A., Mau, B., Biehl, B., Darling, A., Glasner, J. and Perna, N. (2009) Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics, 25, 2071–2073.
[5] Bosi, E., Donati, B., Galardini, M., Brunetti, S., Sagot, M., Lió, P., Crescenzi, P., Fani, R. and Fondi, M. (2015) MeDuSa: a multi-draft based scaffolder. Bioinformatics, 31, 2443-2451.
[6] Kolmogorov, M., Raney, B., Paten, B. and Pham, S. (2014) Ragout - a reference-assisted assembly tool for bacterial genomes. Bioinformatics, 30, i302-i309.
[7] Chen, K.T., Liu, C.L., Huang, S.H., Shen, H.T., Shieh, Y.K., Chiu, H.T. and Lu, C.L. (2018) CSAR: a contig scaffolding tool using algebraic rearrangement. Bioinformatics, 34, 109-111.
[8] Bailey, J. and Eichler, E. (2006) Primate segmental duplications: Crucibles of evolution, diversity and disease. Nature Reviews Genetics, 7, 552–564.
[9] Lynch, M. and Walsh, B. (2007) The Origins of Genome Architecture, vol. 98. Sinauer Associates, Sunderland, MA.
[10] Shao, M. and Moret, B. (2016) A fast and Exact Algorithm for the Exemplar Breakpoint Distance, Journal of Computational Biology, 23, 309-322.
[11] Shao, M. and Moret, B. (2017) On Computing Breakpoint Distances for Genomes with Duplicate Genes, Journal of Computational Biology, 24, 571-580.
[12] Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O. and Salzberg, S.L. (1999) Alignment of whole genomes. Nucleic acids research, 27, 2369-2376.
[13] Minkin, I., Patel, A., Kolmogorov, M., Vyahhi, N. and Pham, S. (2013) Sibelia: A Scalable and Comprehensive Synteny Block Generation Tool for Closely Related Microbial Ge-nomes. In: Darling A., Stoye J. (eds) Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, vol 8126. Springer, Berlin, Heidelberg, 215-229.
[14] Sankoff, D. (1999) Genome rearrangement with gene families. Bioinformatics, 15, 909–917.
[15] Hannenhalli, S. and Pevzner, P.A. (1999) Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. Journal Association for Computing Machinery, 46, 1-27.
[16] Bergeron, A., Mixtacki, J. and Stoye, J. (2006) A unifying view of genome rearrangements. Lecture Notes Computer Science, 4175, 163–173.
[17] Feijão, P. and Meidanis, J. (2013) Extending the algebraic formalism for genome rear-rangements to include linear chromosomes. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 10, 819-831.
[18] Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G. (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29, 1072-1075.
[19] Dias, Z., Dias, U. and Setubal, J.C. (2012) SIS: a program to generate draft genome se-quence scaffolds for prokaryotes. BMC Bioinformatics, 13, 96.