Compressed Index for Approximate String Matching : From Theory to Practice

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳柏儒 Wu, Bor-Ru
論文名稱：	Compressed Index for Approximate String Matching : From Theory to Practice 空間壓縮下搜尋近似字串的索引 : 從理論到實作
指導教授：	韓永楷 Hon, Wing-Kai
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2009
畢業學年度：	97
語文別：	英文
論文頁數：	56
中文關鍵詞：	字尾陣列、反向字尾陣列、近似字串比對
外文關鍵詞：	approximate string matching, inverse suffix array, suffix array
相關次數：	點閱：80 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

Abstract
Let T be a text string of length n and P be a pattern string of length
m, such that the characters in both strings are chosen from a fixed finite
alphabet A. The k-difference approximate matching problem is to find the
occurrences of P in T that have edit distance at most k from P. In this
thesis, we propose an index for T such that for any given P we can report
the desired occurrences in the above problem efficiently. Our index is based
on suffix array and inverse suffix array, which is further combined with a
technique called suffix sampling for reducing index space. The space complexity
of the index is O(n log |A|) bits, while reporting the occurrence can
be done in O(|A|mlog2 n + occ log n) time; here, occ denotes the number
of occurrences reported. In addition, we compare this index empirically
with two other existing indexes for their practical performances. Our results
demonstrated that our index is the best choice under many different
situations.

中文摘要
假定我們有一個長度為n的文字字串T, 以及長度為m的比對字串P, 兩者的字元皆是由
一固定的字元範圍A中選出。在k-difference approximate matching 的問題裡, 我們
希望能找出P在T中出現的位置, 而且它的edit distance 最多為k。也就是說, 我們找
出在T中, 與P的edit distance 小於k 的地方。在這篇論文當中, 我們提出了一個新
的索引方法, 使得我們對於任何的字串T以及P, 都能有效率的解決上述的問題。我們
的索引方法是建立於suffix array 以及inverse suffix array 的基礎觀念, 再結合了一
個suffix sampling 的新技巧, 達到壓縮空間的效果。這個索引方法使用的空間複雜度
為O(n log |A|) bits, 時間複雜度為O(|A|mlog n+occ log n), 其中occ指的是P在T中
出現的次數。除此之外, 我們將前人提出過的兩種索引方法, 與我們的索引方法進行比
較, 看看實際上的表現會是如何。而實驗的結果發現, 在許多不同的情形之下, 我們的索
引方法會是最好的選擇。

Contents
Introduction 1
Preliminaries 5
1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Edit Distance, Hamming Distance, Edit Operations . 5
1.2 Our Problem . . . . . . . . . . . . . . . . . . . . . . 6
2 Suffix Array, Inverse Suffix Array, Ψ Function . . . . . . . . 6
3 Suffix Sampling and Geometric BWT . . . . . . . . . . . . . 10
Approximate Matching Using Suffix Array 13
Approximate Matching Using Sparse Suffix Array 15
Experimental Results 19
1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 19
2 CSA Time-Space Tradeoff . . . . . . . . . . . . . . . . . . . 19
3 Sparse SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Sparse SA time-space tradeoff . . . . . . . . . . . . . 21
3.2 Varying Pattern Length . . . . . . . . . . . . . . . . 22
3.3 Varying Number of Patterns . . . . . . . . . . . . . . 23
3.4 Varying Alphabet Size . . . . . . . . . . . . . . . . . 24
4 Equal space . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Varying Pattern Length(with α=3) . . . . . . . . . . 26
4.2 Varying Number of Patterns (with α=3) . . . . . . . 28
4.3 Varying Pattern Length(with α=8) . . . . . . . . . . 29
4.4 Varying Number of Patterns (with α=8) . . . . . . . 31
4.5 Varying Text Length . . . . . . . . . . . . . . . . . . 33
4.6 Varying Alphabet Size . . . . . . . . . . . . . . . . . 34
5 Fixed Compression Rate for Sparse SA versus CSA . . . . . 35
5.1 Varying Text Length (with α=3) . . . . . . . . . . . 35
5.2 Varying Text Length (with α=8) . . . . . . . . . . . 37
5.3 Varying Alphabet Size (with α=3) . . . . . . . . . . 39
5.4 Varying Alphabet Size (with α=8) . . . . . . . . . . 41
6 Time-space Tradoff for All Three Algorithms . . . . . . . . . 43
6.1 Varying Alphabet Size (alphabet=2) . . . . . . . . . 43
6.2 Varying Alphabet Size (alphabet=4) . . . . . . . . . 45
6.3 Varying Alphabet Size (alphabet=20) . . . . . . . . . 46
6.4 Varying Alphabet Size (alphabet=26) . . . . . . . . . 47
6.5 Varying Text Length (text length=2000) . . . . . . . 48
6.6 Varying Text Length (text length=6000) . . . . . . . 49
6.7 Varying Text Length (text length=10000) . . . . . . 50
Conclusions 51
Future Work 53
Bibliography 54

                                

Bibliography
[1] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein,
and M. Rodeh. Text Indexing and Dictionary Matching With One
Error. volume 37, pages 309–325, 2000.
[2] R. Baeza-Yates and G. Navarro. A Practical q-Gram Index for Text
Retrieval Allowing Errors. CLEI Electronic Journal, 1(2), 1998.
[3] A. L. Buchsbaum, M. T. Goodrich, and J. R. Westbrook. Range
Searching Over Tree Cross Products. In Proceedings of European Symposium
on Algorithms, pages 120–131, 2000.
[4] Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Geometric
Burrows-Wheeler Transform: Linking Range Searching and Text Indexing.
In Proceedings of Data Compression Conference, pages 252–
261, 2008.
[5] A. L. Cobbs. Fast Approximate Matching Using Suffix Trees. In
Proceedings of Symposium on Combinatorial Pattern Matching, pages
41–54, 1995.
[6] R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary Matching and
Indexing With Errors and Don’t Cares. In Proceedings of Symposium
on Theory of Computing, pages 91–100, 2004.
[7] R. Grossi, A. Gupta, and J. S. Vitter. High-Order Entropy-
Compressed Text Indexes. In Proceedings of Symposium on Discrete
Algorithms, pages 841–850, 2003.
[8] R. Grossi and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees
with Applications to Text Indexing and String Matching. SIAM Journal
on Computing, 35(2):378–407, 2005.
[9] P. Jokinen and E. Ukkonen. Two Algorithms for Approximate String
Matching in Static Texts. In Proceedings of International Symposium
on Mathematical Foundations of Computer Science, pages 240–248,
1991.
[10] T. W. Lam, W. K. Sung, and S. S. Wong. Improved Approximate
String Matching Using Compressed Suffix Data Structures. Algorithmica,
51(3):298–314, 2008.
[11] G. Navarro and R. Baeza-Yates. A Hybrid Indexing Method for Approximate
String Matching.
[12] G. Navarro and R. A. Baeza-Yates. A New Indexing Method for Approximate
String Matching. In Proceedings of Symposium on Combinatorial
Pattern Matching, pages 163–185, 1999.
[13] G. Navarro, E. Sutinen, and J. Tanninen. Indexing Text With Approximate
q-Grams. In Proceedings of Symposium on Combinatorial
Pattern Matching, pages 350–365, 2000.
[14] K. Sadakane. Compressed Suffix Trees with Full Functionality. Theory
of Computing Systems, pages 589–607, 2007.
[15] F. Shi. Fast Approximate String Matching With q-Blocks Sequences.
In Proceedings of South American Workshop on String Processing,
pages 257–271, 1996.
[16] E. Sutinen and J. Tarhio. Filtration With q-Samples in Approximate
String Matching. In Proceedings of Symposium on Combinatorial Pattern
Matching, pages 50–63, 1996.
[17] H. N. D. Trinh, W. K. Hon, T. W. Lam, and W. K. Sung. Approximate
String Matching Using Compressed Suffix Arrays. Theoretical
Computer Science, 352(1–3):240–249, 2006.
[18] E. Ukkonen. Approximate String Matching Over Suffix Trees. In
Proceedings of Symposium on Combinatorial Pattern Matching, pages
228–242, 1993.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文