簡易檢索 / 詳目顯示

研究生: 甘人方
Gan, Ren-Fang
論文名稱: 適用於FM-index變體即時比對之DNA序列的壓縮演算法研究
Compression algorithms of the FM-index variant for just-in-time alignment of DNA sequences
指導教授: 石維寬
Shih, Wei-Kuan
口試委員: 徐讚昇
Hsu, Tsan-sheng
張原豪
Chang, Yuan-Hao
衛信文
Wei, Hsin-Wen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 26
中文關鍵詞: 基因定序FM索引遊程編碼霍夫曼編碼基因壓縮
外文關鍵詞: DNA sequencing, FM-index, Run Length Encoding, Huffman coding, genomic data compression
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • FM-index被廣泛地應用於基因體的序列比對中,它是基於Burrows-Wheeler轉換的資料結構,當基因體有參考序列時,先將參考序列做Burrows-Wheeler轉換後,它便可以快速地將基因片段比對至參考序列上,但它需要額外的空間來儲存比對用的輔助資訊,所需的空間與參考序列的大小呈正比,且大部分的基因體是由上千萬甚至上億個鹼基所組成,因此它所需要的儲存空間會是一大瓶頸。

    Ferragina和 Manzini描述了一種FM-index變體的應用,它藉由只儲存部分的輔助資訊和壓縮的轉換後參考序列來減少所需的儲存空間,但是比對一個片段需要數次的解壓縮以及計算的步驟,會導致比對運行時間大幅地增加。

    因此我們提出了兩種針對基因序列的壓縮演算法,並將它們應用在FM-index變體上。它們除了可以達到良好的壓縮率外,還能夠在序列壓縮的情況下進行基因比對,省去解壓縮的步驟,進而減少所需的運行時間。此外,我們也嘗試將演算法應用於書籍上,並對它們做字串的比對,結果顯示,同樣能達到高壓縮率以及能夠快速地比對字串。


    The FM-index which is based on Burrows–Wheeler transform is broadly used for sequence alignment against DNA sequences. When the reference sequence of a genome exists, the FM-index can efficiently align reads to the reference sequence. However, it requires extra space to store the auxiliary information for alignment. In addition, the required storage space is related to the size of the reference sequence, and since most of the sequences consist of more than tens of millions of nucleobases, the space required for the FM-index would be an issue.

    Ferragina and Manzini have described a variant implementation of the FM-index to solve this problem by only storing part of the auxiliary information and the compressed transformed reference sequence. Nevertheless, it requires numbers of decompression and calculation steps to align one read, which results in a significant increase in the computational cost.

    Given the above reason, we propose two compression algorithms for DNA sequences and implement them on the variant implementation of the FM-index. Both of the proposed algorithms could effectively reduce the required space, and furthermore, they allow performing sequence alignment with the compressed sequence, which could eliminate the steps of decompression and thereby reducing the computation time. Apart from DNA sequences, we have done pattern matching against publications, and the results show that our algorithms also have a good effect on them.

    Chapter 1. Introduction ........................... 1 Chapter 2. Background and Motivation .............. 3 2.1 Burrows–Wheeler transform ..................... 3 2.2 FM-index ...................................... 4 2.3 Variant implementation of the FM-index ........ 5 2.4 Motivation .................................... 6 Chapter 3. Proposed Compression Algorithms .........7 3.1 Overview ...................................... 7 3.2 Modified Run-Length Encoding .................. 7 3.3 Modified Huffman coding with RLE .............. 10 Chapter 4. Experimental Studies ....................13 4.1 Experimental Setup ............................ 13 4.2 Detail Result ................................. 15 4.3 Other Applications ............................ 19 Chapter 5. Conclusion ............................. 24 References ......................................... 25

    [1]. J. Besser and H. A. Carleton. Next-Generation Sequencing Technologies and their Application to the Study and Control of Bacterial Infections. Clin Microbiol Infect, 24(4): 335–341, April 2018.
    [2]. M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. DEC SRC Research Report 124, 1994.
    [3]. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. FOCS’00, pp. 390–398, 2000.
    [4]. Ferragina and G. Manzini. An experimental study of an opportunistic index. SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms Pages 269-278, January 2001.
    [5]. B. Langmead, C. Trapnell, M. Pop and S. L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, R25, March 2009.
    [6]. H. Li and R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, Volume 25, Issue 14, Pages 1754–1760, July 2009.
    [7]. H. Li and N. Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, Volume 11, Issue 5, Pages 473–483, September 2010.
    [8]. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov.
    [9]. J. Shendure, S .Balasubramanian and G. M. Church. DNA sequencing at 40: past, present and future. Nature, volume 550, Pages 345–353, October 2017.

    QR CODE