簡易檢索 / 詳目顯示

研究生: 楊庭碩
Yang, Ting-Shuo
論文名稱: Dynamic compressed index for approximate string matching
空間壓縮下搜尋近似字串的動態索引
指導教授: 韓永楷
Hon, Wing-Kai
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 44
中文關鍵詞: 動態空間壓縮索引近似
外文關鍵詞: dynamic, compressed index, approximate
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Abstract
    Let L = {T1, T2, ..., Tl} be a set of texts whose total length is n and P =
    P0P2...Pm−1 be a pattern of length m, both over a fixed alphabet A. The ap-
    proximate library management problem is to report all occurrences of P by
    allowing some degree of errors. Besides a text may be inserted into or deleted
    from L from time to time. This thesis introduces the approach to solve the
    approximate library management problem which did not be concerned before.
    Prior to this thesis, the most compact result that the index of Chan et al. [3]
    with the algorithm of Trinh et al. [11] is based on compressed suffix array that is
    so complex to implement and it requires O(m|A| log3 n+occ log3 n) query time.
    In this thesis I use the indexes based on suffix sampling which is simple and easy
    to implement and the query time is O(m|A| log3 n+occ log2 n) using O(n log |A|)
    bits where occ is the number of occurrences. Also the indexes in this thesis sup-
    ports text insertion and deletion with update time O(|T| log |A| log n + log2 n).


    中文摘要
    讓L = {T1, T2, ..., Tl} 為一字串的集合, 其總長度為n, 以及P = P0P2...Pm−1為一長
    度為m的樣本,且L 和P都建立在一固定的字母表A上。Approximate library man-
    agement problem 是在容許某種程度的錯誤下, 找出P發生在L中的位置。本論文提
    出此以前無人關注的動態問題, 即一字串可以被加入L中或從L中刪除, 並提供其解決
    方法。之前跟本論文最相近的結果為Chan et al. [3] 所提出的索引套用Trinh et al.
    [11]所提出的演算法, 此解法是基於compressed suffix array 以至於較複雜而難以實
    作, 其搜尋時間為O(m|A| log3 n + occ log3 n)。本論文所使用的索引是建立在一新的
    技巧suffix sampling 上, 簡單且容易實作, 其搜尋時間為O(m|A| log3 n+occ log2 n)
    使用O(n log |A|)bits 的空間, 其中occ 為此樣本出現的次數。此索引支援字串的加入
    與刪除, 其更新時間為O(|T| log |A| log n + log2 n)。

    Contents 1 Introduction 1 1.1 The string matching problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The library management problem . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 The approximate library management problem . . . . . . . . . . . . . . . . . . . 4 2 Preliminaries 7 2.1 Edit operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Suffix array and Inverse suffix array . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Function and Compressed suffix array . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 The suffix sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 Burrows-Wheeler transform and Wavelet tree . . . . . . . . . . . . . . . . . . . 15 2.8 Fstart and Fend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Review: Static Index using “Suffix sampling” 19 3.1 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Case 1: error occurs in preceding characters of MPalign . . . . . . . . . . . . . . 23 3.3 Case 2: error occurs in MPalign . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Case 3: pattern length less than d . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 The Complexity of Time and Space . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Dynamizing the index 33 4.1 Dynamic sparse suffix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Dynamic wavelet tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusions and Future Work 41 Bibliography 41

    Bibliography
    [1] A. Amir, M. Lewenstein, and E. Porat. Faster algorithms for string match-
    ing with k mismatches. In Proceedings of the 11th Proceedings of Symposium
    on Discrete Algorithms, pages 794–803, 2000.
    [2] R. S. Boyer and J. S. Moore. A Fast String Searching Algorithm. Commu-
    nications of the ACM, 20(10):762–772, 1977.
    [3] H. L. Chan, W. K. Hon, T.W. Lam, and K. Sadakane. Compressed Indexes
    for Dynamic Text Collections. volume 3. 2007.
    [4] A. L. Cobbs. Fast approximate matching using suffix trees. In Proceedings
    of the 6th Annual Symposium on Combinatorial Pattern Matching, number
    937 in Lecture Notes in Computer Science, pages 41–54, 1995.
    [5] R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary Matching and
    Indexing with Errors and Don’t Cares. In Proceedings of Symposium on
    Theory of Computing, pages 91–100, 2004.
    [6] R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with
    applications to text indexing and string matching. In Proceedings ACM
    Symposium on the Theory of Computing, pages 397–406, Portland, Oregon,
    2000.
    [7] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Sci-
    ence and Computational Biology. Cambridge University Press, New York,
    NY, USA, 1997.
    [8] Hon, Sadakane, and Sung. Breaking a time-and-space barrier in construct-
    ing full-text indices. In FOCS: IEEE Symposium on Foundations of Com-
    puter Science (FOCS), 2003.
    [9] W.-K. Hon, T.-W. Lam, R. Shah, S.-L. Tam, and J. S. Vitter. Compressed
    Index for Dictionary Matching. In Proceedings of Data Compression Con-
    ference, pages 23–32, 2008.
    [10] W. K. Hon, R. Shah, and J. S. Vitter. Ordered Pattern Matching: To-
    wards Full-Text Retrieval. Technical Report TR-06-008, Department of
    CS, Purdue University, 2006.
    [11] T. N. D. Huynh, W. K. Hon, T. W. Lam, and W. K. Sung. Approximate
    string matching using compressed suffix arrays. In CPM: 15th Symposium
    on Combinatorial Pattern Matching, 2004.
    [12] P. Jokinen and E. Ukkonen. Two algorithms for approximate string match-
    ing in static texts. In Proceedings of the 16th Symposium on Mathematical
    Foundations of Computer Science, number 520 in Lecture Notes in Com-
    puter Science, pages 240–248, 1991.
    [13] D. Kim, J. Sim, H. Park, and K. Park. Linear-Time Construction of Suffix
    Arrays. In Proceedings of Symposium on Combinatorial Pattern Matching,
    pages 186–199, 2003.
    [14] D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast Pattern Matching in
    Strings. SIAM Journal on Computing, 6(2):323–350, 1977.
    [15] P. Ko and S. Aluru. Space Efficient Linear Time Construction of Suffix
    Arrays. In Proceedings of Symposium on Combinatorial Pattern Matching,
    pages 200–210, 2003.
    [16] T. W. Lam, W. K. Sung, and S. S. Wong. Improved approximate string
    matching using compressed suffix data structures. Algorithmica, 51(3):298–
    314, July 2008.
    [17] G. M. Landau and U. Vishkin. Fast parallel and serial approximate string
    matching. 10(2):157–169, 1989.
    [18] G. Navarro. A guided tour to approximate string matching. ACM Com-
    puting Surveys, 33(1):33–38, 2001.
    [19] K. Sadakane and T. Shibuya. Indexing Huge Genome Sequences for Solving
    Various Problems, 2001.
    [20] B. R. Wu. Compressed index for approximate string matching: From thesis
    to experiment. M. Sc. Thesis, Department of Computer Science, National
    Tsing Hua University, Taiwan, 2009.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE