簡易檢索 / 詳目顯示

研究生: 劉書呈
Liu, Shu-Cheng
論文名稱: 一個多重參考式Scaffolding工具的網路伺服器
A Web Server of Multiple Reference Based Scaffolding Tool
指導教授: 盧錦隆
Lu, Chin-Lung
口試委員: 邱顯泰
Chiu, Hsien-Tai
林苕吟
Lin, Tiao-Yin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 45
中文關鍵詞: 演算法基於多從參考式的scaffolding網路伺服器權重機制生物資訊次世代定序
外文關鍵詞: algorithm, Multiple reference-based scaffolding, web server, weighting scheme, Bioinformatics, next generation sequencing
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 次世代定序技術近幾年已有很大的進展。人們利用這技術來獲得大量的short reads並將它們組裝成一個基因體序列。在組裝一個基因體序列的過程中,scaffolding對基因體草圖中contigs的定序與定向是一個重要的步驟。目前,很多的reference-based scaffolding工具已被開發出來,但是它們大多數只能利用一個參考基因體來scaffold目標基因體草圖。如果目標與參考基因體之間的演化關係是遠的,或者是它們之間發生了一些基因體重組 (例如翻轉) 時,那麼只利用一個參考基因體的scaffolder可能無法正確地scaffold目標基因體。因此,這促使人們去開發multiple reference-based scaffolding工具可以參考多個相關物種的基因體來對目標基因體內的contigs做scaffolding,因為多個參考基因體可能可以提供不同但有互補性的scaffolding資料。Multi-CSAR是我們實驗室開發出來的multiple reference-based scaffolding工具。事實上,在我們先前的研究中,我們也證實Multi-CSAR在準確度上(如sensitivity, precision和F-score)確實比Ragout和MeDuSa還好。然而,Multi-CSAR所使用的權重機制只有考慮到目標和參考基因體之間序列特徵 (sequence markers) 的相似度。但是,序列特徵次序與方向的相似度也是一個在目標與參考基因體之間重要的評量。因此,在本論文中,我們考慮序列特徵的序列
    相似度與次序和方向的相似度來設計出一個新的權重機制。除此之外,由於Multi-CSAR原來是一個stand-alone的軟體,所以它對於Unix/Linux系統不熟悉的使用者來說是不方便的。因此,在這研究中我們發展出一個Multi-CSAR的網路伺服器以提供使用者一個簡單又好操作的介面讓使用者可以有效且準確地scaffold目標基因體。基本上, Multi-CSAR接受一個目標基因體和至少一個參考基因體作為input。目標基因體的格式為multi-FASTA,參考基因體的格式可以是FASTA或者是multi-FASTA,這完全會根據參考基因體是否完整的或不完整的。除此之外,使用者可選擇使用NUCmer on nucleotides或是PROmer on translated amino acids的方式來搜尋目標基因體與每一個參考基因體之間的序列特徵。在執行Multi-CSAR之前,使用者也可以選擇是否使用權重機制。在輸出端的頁面,Multi-CSAR的網路伺服器會用兩種圖像化的模式 (dotplot和Circos) 來顯示scaffolding的結果,這可讓使用者視覺化地來檢驗scaffolding結果的正確性。除了圖形化的模式之外,Multi-CSAR的網路伺服器也會提供使用者一個表格模式來顯示scaffolds的細節。


    Next-generation sequencing technologies have greatly advanced in recent years. People use these technologies to obtain a large number of short reads and assemble them into a genomic sequence. In the process of assembling a genomic sequecne, scaffolding is an important step to order and orient the contigs in a draft genome. At present, many reference-based scaffolding tools have been developed, but most of them scaffold the target draft genome according to only one reference genome. If the phylogenetic relationship between target and reference genomes is distant or some genome rearrangements (e.g., reversals) occur between them, a single reference-based scaffolder might not correctly scaffold the target draft genome. Thus, this motivates people to develop multiple reference-based scaffolders which scaffold the contigs of the target draft genome by referring to multiple reference genomes of related organisms, which may provide different but complementary types of scaffolding information. Multi-CSAR is a multiple reference-based scaffolder developed by our laboratory. In fact, in our previous study, we have shown that Multi-CSAR performs better than Ragout and MeDuSa in terms of accuracy (such as sensitivity, precision and F-score). However, the weighting scheme in Multi-CSAR only considers the sequence identity of markers between the target and reference genomes. But, the order identity of markers is also an important measurement between the target and reference genomes. In this thesis, therefore, we consider both sequence identity and order identity of markers to design a new weighting scheme. Furthermore, Multi-CSAR is a stand-alone program. It is not convenient for users who are not familiar with Unix/Linux systems. Therefore, we develop a web server of Multi-CSAR in this study to provide users with an easy-to-operate interface, allowing users to efficiently and accurately scaffold their target draft genomes. Basically, Multi-CSAR takes a target draft genome and at least one reference genome as the input. The format of a target draft genome is multi-FASTA format and the format of each reference genome is either multi-FASTA format or FASTA format, depending on whether the reference genome is incomplete or complete. Furthermore, users can select either ‘NUCmer on nucleotides’ or ‘PROmer on translated amino acids’ on the web server of Multi-CSAR to identify markers between target draft genome and each reference genome. Before running Multi-CSAR, users can choose whether to use the weighting scheme or not. In the output page, the web server of Multi-CSAR displays the scaffolding result in two types of graphical mode (i.e. dotplot and Circos), allowing users to visually validate the correctness of the resulting scaffolds. In addition to the graphical modes, the web server of Multi-CSAR also provides users with a tabular mode to show details of scaffolds.

    中文摘要 1 Abstract -----3 Acknowledgement -----5 Contents -----6 List of figures -----9 List of tables -----11 Chapter 1 Introduction -----13 Chapter 2 Method -----16 2.1 Overview of Multi-CSAR -----16 2.1.1 Contig adjacency graph -----17 2.1.2 Maximum weighted perfect matching -----19 2.1.3 Sequence identity-based weighting scheme -----19 2.2 Order and sequence identity-based weighting scheme -----20 2.2.1 Sequence identity-based weight -----20 2.2.2 Order identity-based weight -----21 2.2.3 Order and sequence identity-based weighting scheme -----23 Chapter 3 Experiment Results and Discussion -----24 3.1 Quality Metrics -----24 3.2 Experiments of Real Datasets -----26 3.2.1 Multi-CSAR利用NUCmer所產生的scaffolding結果 26 3.2.2 Multi-CSAR利用PROmer所產生的scaffolding結果 29 3.3.3 Discussion -----32 Chapter 4 Web Server of Multi-CSAR -----33 4.1 Web input interface -----33 4.2 Input data & parameters -----34 4.3 Dotplot validation -----36 4.4 Circos validation -----38 4.5 Scaffold of target -----41 Chapter 5 Conclusion -----43 References -----44

    1. Richter DC, Schuster SC, Huson DH. OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007; 23:1573–9.
    2. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. Reordering contigs of draft genomes using the Mauve Aligner. Bioinformatics. 2009; 25:2071–3.
    3. Husemann P, Stoye J. r2cat: synteny plots and comparative assembly. Bioinformatics. 2010; 26:570–1.
    4. Lu CL, Chen KT, Huang SY, Chiu HT. CAR: contig assembly of prokaryotic draft genomes using rearrangements. BMC Bioinformatics. 2014; 15:381.
    5. Chen KT, Liu CL, Huang SH, Shen HT, Shieh YK, Chiu HT, et al.CSAR: a contig scaffolding tool using algebraic rearrangements. Bioinformatics. 2018; 34:109–11.
    6. Li CL, Chen KT, Lu CL. Assembling contigs in draft genomes using reversals and block-interchanges. BMC Bioinformatics. 2013; 14(Suppl 5):9.
    7. Lu CL. An efficient algorithm for the contig ordering problem under algebraic rearrangement distance. Journal of Computation Biology. 2015; 22:975–87.
    8. Kolmogorov M, Raney B, Paten B, Pham S. Ragout: a reference-assisted assembly tool for bacterial genomes. Bioinformatics. 2014; 30:i302—9.
    9. Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lio P, et al.MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015; 31:2443–51.
    10. Chen KT, Chen CJ, Shen HT, Liu CL, Huang SH, Lu CL. Multi-CAR: a tool of contig scaffolding using multiple references. BMC Bioinformatics. 2016; 17:469.
    11. Chen KT, Shen HT, Lu CL. Multi-CSAR: a multiple reference-based contig scaffolder using algebraic rearrangements. BMC Bioinformatics. 2018; 12(Suppl 9):139.
    12. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al.Versatile and open software for comparing large genomes. Genome Biology. 2004; 5:R12.
    13. Kolmogorov V. Blossom V: a new implementation of a minimum cost perfect matching algorithm. Math Program Computation. 2009; 1:43–67.
    14. Feijão, P., and Meidanis, J. 2013. Extending the algebraic formalism for genome rearrangements to include linear chromosomes. IEEE-ACM Trans. Journal of Bioinformatics and Computation Biology. 10, 819–831.
    15. Dias Z, Dias U, Setubal JC. SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics. 2012; 13:96.
    16. Chen KT, Lu CL. CSAR-web: a web server of contig scaffolding using algebraic rearrangements. Nucleic Acids Res. 2018;46(W1):W55–W59.

    QR CODE