簡易檢索 / 詳目顯示

研究生: 吳柏儒
Wu, Bor-Ru
論文名稱: Compressed Index for Approximate String Matching : From Theory to Practice
空間壓縮下搜尋近似字串的索引 : 從理論到實作
指導教授: 韓永楷
Hon, Wing-Kai
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 56
中文關鍵詞: 字尾陣列反向字尾陣列近似字串比對
外文關鍵詞: approximate string matching, inverse suffix array, suffix array
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Abstract
    Let T be a text string of length n and P be a pattern string of length
    m, such that the characters in both strings are chosen from a fixed finite
    alphabet A. The k-difference approximate matching problem is to find the
    occurrences of P in T that have edit distance at most k from P. In this
    thesis, we propose an index for T such that for any given P we can report
    the desired occurrences in the above problem efficiently. Our index is based
    on suffix array and inverse suffix array, which is further combined with a
    technique called suffix sampling for reducing index space. The space complexity
    of the index is O(n log |A|) bits, while reporting the occurrence can
    be done in O(|A|mlog2 n + occ log n) time; here, occ denotes the number
    of occurrences reported. In addition, we compare this index empirically
    with two other existing indexes for their practical performances. Our results
    demonstrated that our index is the best choice under many different
    situations.


    中文摘要
    假定我們有一個長度為n的文字字串T, 以及長度為m的比對字串P, 兩者的字元皆是由
    一固定的字元範圍A中選出。在k-difference approximate matching 的問題裡, 我們
    希望能找出P在T中出現的位置, 而且它的edit distance 最多為k。也就是說, 我們找
    出在T中, 與P的edit distance 小於k 的地方。在這篇論文當中, 我們提出了一個新
    的索引方法, 使得我們對於任何的字串T以及P, 都能有效率的解決上述的問題。我們
    的索引方法是建立於suffix array 以及inverse suffix array 的基礎觀念, 再結合了一
    個suffix sampling 的新技巧, 達到壓縮空間的效果。這個索引方法使用的空間複雜度
    為O(n log |A|) bits, 時間複雜度為O(|A|mlog n+occ log n), 其中occ指的是P在T中
    出現的次數。除此之外, 我們將前人提出過的兩種索引方法, 與我們的索引方法進行比
    較, 看看實際上的表現會是如何。而實驗的結果發現, 在許多不同的情形之下, 我們的索
    引方法會是最好的選擇。

    Contents 1 Introduction 1 2 Preliminaries 5 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Edit Distance, Hamming Distance, Edit Operations . 5 2.1.2 Our Problem . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Suffix Array, Inverse Suffix Array, Ψ Function . . . . . . . . 6 2.3 Suffix Sampling and Geometric BWT . . . . . . . . . . . . . 10 3 Approximate Matching Using Suffix Array 13 4 Approximate Matching Using Sparse Suffix Array 15 5 Experimental Results 19 5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 19 5.2 CSA Time-Space Tradeoff . . . . . . . . . . . . . . . . . . . 19 5.3 Sparse SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3.1 Sparse SA time-space tradeoff . . . . . . . . . . . . . 21 5.3.2 Varying Pattern Length . . . . . . . . . . . . . . . . 22 5.3.3 Varying Number of Patterns . . . . . . . . . . . . . . 23 5.3.4 Varying Alphabet Size . . . . . . . . . . . . . . . . . 24 5.4 Equal space . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.4.1 Varying Pattern Length(with α=3) . . . . . . . . . . 26 5.4.2 Varying Number of Patterns (with α=3) . . . . . . . 28 5.4.3 Varying Pattern Length(with α=8) . . . . . . . . . . 29 5.4.4 Varying Number of Patterns (with α=8) . . . . . . . 31 5.4.5 Varying Text Length . . . . . . . . . . . . . . . . . . 33 5.4.6 Varying Alphabet Size . . . . . . . . . . . . . . . . . 34 5.5 Fixed Compression Rate for Sparse SA versus CSA . . . . . 35 5.5.1 Varying Text Length (with α=3) . . . . . . . . . . . 35 5.5.2 Varying Text Length (with α=8) . . . . . . . . . . . 37 5.5.3 Varying Alphabet Size (with α=3) . . . . . . . . . . 39 5.5.4 Varying Alphabet Size (with α=8) . . . . . . . . . . 41 5.6 Time-space Tradoff for All Three Algorithms . . . . . . . . . 43 5.6.1 Varying Alphabet Size (alphabet=2) . . . . . . . . . 43 5.6.2 Varying Alphabet Size (alphabet=4) . . . . . . . . . 45 5.6.3 Varying Alphabet Size (alphabet=20) . . . . . . . . . 46 5.6.4 Varying Alphabet Size (alphabet=26) . . . . . . . . . 47 5.6.5 Varying Text Length (text length=2000) . . . . . . . 48 5.6.6 Varying Text Length (text length=6000) . . . . . . . 49 5.6.7 Varying Text Length (text length=10000) . . . . . . 50 6 Conclusions 51 7 Future Work 53 Bibliography 54

    Bibliography
    [1] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein,
    and M. Rodeh. Text Indexing and Dictionary Matching With One
    Error. volume 37, pages 309–325, 2000.
    [2] R. Baeza-Yates and G. Navarro. A Practical q-Gram Index for Text
    Retrieval Allowing Errors. CLEI Electronic Journal, 1(2), 1998.
    [3] A. L. Buchsbaum, M. T. Goodrich, and J. R. Westbrook. Range
    Searching Over Tree Cross Products. In Proceedings of European Symposium
    on Algorithms, pages 120–131, 2000.
    [4] Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric
    Burrows-Wheeler Transform: Linking Range Searching and Text Indexing.
    In Proceedings of Data Compression Conference, pages 252–
    261, 2008.
    [5] A. L. Cobbs. Fast Approximate Matching Using Suffix Trees. In
    Proceedings of Symposium on Combinatorial Pattern Matching, pages
    41–54, 1995.
    [6] R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary Matching and
    Indexing With Errors and Don’t Cares. In Proceedings of Symposium
    on Theory of Computing, pages 91–100, 2004.
    [7] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-
    Compressed Text Indexes. In Proceedings of Symposium on Discrete
    Algorithms, pages 841–850, 2003.
    [8] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees
    with Applications to Text Indexing and String Matching. SIAM Journal
    on Computing, 35(2):378–407, 2005.
    [9] P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String
    Matching in Static Texts. In Proceedings of International Symposium
    on Mathematical Foundations of Computer Science, pages 240–248,
    1991.
    [10] T. W. Lam, W. K. Sung, and S. S. Wong. Improved Approximate
    String Matching Using Compressed Suffix Data Structures. Algorithmica,
    51(3):298–314, 2008.
    [11] G. Navarro and R. Baeza-Yates. A Hybrid Indexing Method for Approximate
    String Matching.
    [12] G. Navarro and R. A. Baeza-Yates. A New Indexing Method for Approximate
    String Matching. In Proceedings of Symposium on Combinatorial
    Pattern Matching, pages 163–185, 1999.
    [13] G. Navarro, E. Sutinen, and J. Tanninen. Indexing Text With Approximate
    q-Grams. In Proceedings of Symposium on Combinatorial
    Pattern Matching, pages 350–365, 2000.
    [14] K. Sadakane. Compressed Suffix Trees with Full Functionality. Theory
    of Computing Systems, pages 589–607, 2007.
    [15] F. Shi. Fast Approximate String Matching With q-Blocks Sequences.
    In Proceedings of South American Workshop on String Processing,
    pages 257–271, 1996.
    [16] E. Sutinen and J. Tarhio. Filtration With q-Samples in Approximate
    String Matching. In Proceedings of Symposium on Combinatorial Pattern
    Matching, pages 50–63, 1996.
    [17] H. N. D. Trinh, W. K. Hon, T. W. Lam, and W. K. Sung. Approximate
    String Matching Using Compressed Suffix Arrays. Theoretical
    Computer Science, 352(1–3):240–249, 2006.
    [18] E. Ukkonen. Approximate String Matching Over Suffix Trees. In
    Proceedings of Symposium on Combinatorial Pattern Matching, pages
    228–242, 1993.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE