簡易檢索 / 詳目顯示

研究生: 古宗翰
Ku, Tsung-Han
論文名稱: 生物資訊應用上的省空間資料結構
Space-Efficient Indexes for Some New Bioinformatics Applications
指導教授: 韓永楷
Hon, Wing-Kai
口試委員: 唐傳義
Tang, Chuan Yi
王有禮
Wang, Yue-Li
盧錦隆
Lu, Chin-Lung
謝孫源
Hsieh, Sun-Yuan
姚兆明
Yiu, Siu-Ming
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 93
中文關鍵詞: 字典比對資料結構字串比對
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文主要是研究字典比對問題(dictionary matching problem)以及其相關的問題像是近似字典比對問題(approximate dictionary matching problem)與循環字典比對問題(circular dictionary matching)以及允許萬用符號的字串索引問題(text indexing with wildcards problem).

    給定一個有$d$個字串的字串集合$\cal{D}$, 這$d$個字串的總長度是$n$, 字典比對問題的目標是去為這集合$\cal{D}$設計一個資料結構, 使得當你輸入一個很長的字串$T$去詢問此資料結構時, 此資料結構可以很迅速地告知你是否有哪些在$\cal{D}$內的字串發生在$T$裡面. 字典比對這個問題可以用著名的Aho-Corasick automaton這個資料結構來解決這個問題, 不過Aho-Corasick automaton這個資料結構所需要的使用空間距離最低所需得使用空間還有一段距離.
    Hon教授一群人在2008年設計了一個只需要壓縮空間的資料結構來解決字典比對問題, 他們的資料結構只需要$nH_k(\D)+o(n\log\sigma)$-位元數, 而他們的資料結構支援詢問字串$T$的時間是$O(|T|(\log^{\epsilon} n+ \log d)+ occ)$, $\epsilon>0$. 之後2010年Belazzougui同樣設計了一個只需很省空間的資料結構, 他的資料結構只需要$n\log\sigma +O(n)$位元數, 而Belazzougui的資料結構支援詢問字串$T$的時間是最佳的.

    在這篇論文內我們設計了一個資料結構, 只需要$nH_k(\D)+O(n)$ 位元數且同時支援詢問字串$T$的時間是最佳的,我們採用的方法是使用 XBW的壓縮方法來進一步改進Belazzougui的資料結構所需要的空間.

    對於近似字典比對這個問題, 我們考慮可以允許一個錯誤的情況, 在允許一個錯誤的情況下, 我們考慮一個$T$的子字串如果與$\cal{D}$中的某個字串$P$最多只有一個edit distance的距離, 我們設計的資料結構就必須回報$P$發生在$T$中. 對於這個問題目前最著名的資料結構是由Cole教授一群人2004年設計出來的資料結構, 他們的資料結構需要$O(n+d\log d)$字元組, 且支援詢問字串$T$的時間是$O(|T|\log{d}\log{\log{d}}+occ)$. 再者就是1999年, Ferragina教授一群人設計的資料結構, 他們設計的資料結構需要$O(n^{1+\epsilon})$字元組,而支援詢問$T$字串的時間是$O(|T|\log\log n + occ)$. 然而至今尚未有人設計出壓縮空間的資料結構來解決這個問題, 在這篇論文中我們將提出第一個壓縮空間的資料結構來解決在允許一個錯誤的情況下的近似字典比對問題.

    循環字串是指一個字串的每個不同的循環都屬於有效的字串, 這樣的字串在生物資訊領域以及計算幾何領域逐漸受到重視. 在這篇論文中我們設計一個簡潔(succinct)空間的資料結構來解決循環字典比對問題, 我們的資料結構需要$n\log\sigma(1+o(1))+O(n)=O(d\log n)$位元數, 我們使用Burrows-Wheeler Tranform的方法來設計我們的資料結構.

    有時候一個長字串$T$內是存在有萬用符號的, 允許萬用符號的字串索引問題的目標是為了這樣的字串$T$去設計一個資料結構使得當我們詢問一個較短的字串$P$時, 可以迅速地回答$P$是否存在在$T$中, 這個問題被應用在俱有單核苷酸多態性(SNP)的染色體序列中, 因為單核苷酸多態性的性質可以用萬用符號來模仿. 最近Tam教授一群人在2009年以及Thachuk在2011年, 分別都提出他們設計的簡潔空間的資料結構來解決允許萬用符號的字串索引問題, 在這篇論文中我們展現字典比對問題如何可以幫助我們解決允許萬用符號的字串索引問題, 然後我們提出第一個只需要壓縮空間的資料結構來解決這個問題. 我們的資料結構只需要$nH_h +o(n\log \sigma)+O(d\log n)$位元數.


    This thesis studies the dictionary matching problem and its related variations {\it approximate} dictionary matching problem, {\it circular} dictionary matching problem, and an application {\it text indexing with wildcards problem}.

    Given a set $\D$ of $d$ patterns of total length $n$, the dictionary matching problem is to index $\D$ such that for any query text $T$, we can locate the occurrences of any pattern within $T$ efficiently. This problem can be solved in optimal $O(|T|+occ)$ time by the classical Aho-Corasick automaton where $occ$ denotes the number of occurrences. The space requirement is $O(n)$ words which is far from optimal (i.e. succinct space or compressed space). When $\D$ contains a total of $n$ characters drawn from an alphabet set $\Sigma$ of size~$\sigma$, Hon et al.~(2008) gave an $nH_k(\D)+o(n\log\sigma)$-bit index which supports a query in $O(|T|(\log^{\epsilon} n+ \log d)+ occ)$ time, where $\epsilon >0$ and $nH_k(\D)$ denotes the $k$th-order entropy of $\D$. Recently, Belazzougui~(2010) has proposed an elegant scheme, which takes $n\log\sigma +O(n)$ bits of index space and supports a query in optimal time. In this thesis, we provide connections between Belazzougui's index and XBW compression of Ferragina and Manzini (2005), and show that Belazzougui's index can be slightly modified to be stored in $nH_k(\D)+O(n)$ bits, while query time remains optimal; this improves the compressed index by Hon et al.~(2008) in both space and time.

    For the {\it approximate} dictionary matching problem, we consider the one error case instead of the $k$ errors case (i.e. general case), where $k$ is an constant number larger than~0. In the one error case, we consider a substring of $T[i..j]$ an occurrence of $P$ whenever the edit distance between $T[i..j]$ and $P$ is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires $O(n+ d\log{d})$ words of space and reports all occurrences in $O(|T|\log{d}\log{\log{d}}+occ)$ time, and by Ferragina et al. (1999), which requires $O(n^{1+\epsilon})$ words of space and reports all occurrences in $O(|T|\log\log n + occ)$ time. Although there have been successes in compressing the dictionary matching index while keeping the query time optimal (as described on the above). However, a compressed index for approximate dictionary matching problem is still open. In this thesis, we propose the first such index which requires an optimal $nH_k+O(n)+o(n\log\sigma)$-bit index space. The query time of our index is $O(\sigma |T|\log^3{n}\log{\log{n}}+occ)$.

    Circular patterns are those patterns whose cyclic shifts are also valid patterns. These patterns arise naturally in bioinformatics and computational geometry. In this thesis, we consider succinct indexing schemes for a set of $d$ circular patterns of total length~$n$, with each character drawn from an alphabet $\Sigma$ of size $\sigma$. Our succinct index which needs $n\log\sigma(1+o(1))+O(n)=O(d\log n)$ bits is based on the popular Burrows-Wheeler transform (BWT) on circular patterns, while the dictionary matching problem or the pattern matching problem can be solved efficiently.

    Sometimes the text string $T$ could have wildcard characters inside. Therefore, suppose $T=T_1\phi^{k_1}T_2\phi^{k_2}\cdots\phi^{k_d}T_{d+1}$ whose total length is $n$, where characters of each $T_i$ are chosen from an alphabet $\Sigma$, and $\phi$ denotes a wildcard symbol. The text indexing with wildcards problem is to index $T$ such that when we are given a query pattern $P$, we can locate the occurrences of $P$ in $T$ efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this thesis, we will show how to apply the index of dictionary matching problem to solve this problem, and we present the first compressed index for this problem, which takes only $nH_h +o(n\log \sigma)+O(d\log n)$ bits of space, where $H_h$ is the $h$th-order empirical entropy~($h=o(\log_{\sigma} n)$) of $T$.

    1 Introduction 1.1 Survey of Dictionary Matching 1.2 Survey of Circular Dictionary Matching 1.3 Survey of Text Indexing with Wildcards 2 Preliminaries 2.1 A Review of the BWT 2.2 A Review of the XBW 2.3 Efficient Storage Scheme for Strings 2.4 Suffix Trees and Suffix Arrays 2.5 Centroid Path and Centroid Path Decomposition 2.6 Sparse Suffix Trees 2.7 Range Minimum Query 2.8 Three-Sided Range Query Structures 2.9 Computation Model 2.10 Preliminaries of Circular Patterns 2.10.1 SuffixArrayforCircularPatterns 2.10.2 Properties of Circular Patterns 2.10.3 SuffixTreeforCircularPatterns 3 Compressed Dictionary Matching 3.1 Exact Match Case 3.1.1 Compressed prefix matching with the XBW 3.1.1.1 Compressing Belazzougui’s Scheme 3.1.2 Our Index for Compressed Dictionary Matching 3.2 One Error Case 3.2.1 Amir et al.’sIndex 3.2.1.1 Trading Index Space with Query Time 3.2.2 Compressed Approximate Dictionary Matching with One Error 3.2.2.1 Handling Long Patterns 3.2.2.2 Handling Small Patterns 4 Succinct Indexes for Circular Dictionary Matching 4.1 Succinct Indexes for Circular Patterns 4.1.1 Circular Burrows-Wheeler Transform 4.1.2 Framework for Circular Dictionary Matching 4.1.3 Succinct Encoding of Circular Suffix Tree 4.1.3.1 Parentheses Encoding of STc 4.1.3.2 Height Array 4.1.4 Circular Dictionary Matching With Our Index 4.2 Efficient Construction for CircularBWT 4.2.1 Constructing Ψc for All Short Patterns 4.2.2 Updating Ψc for Long Patterns 4.3 Space-Efficient Construction Algorithm for Circular Suffix Tree 4.3.1 Construction of the Hgt Array 4.3.1.1 Kasai et al.’s Algorithm 4.3.2 Construction of the Parentheses Encoding of STc 4.3.3 Marking the Nodes in STc 4.4 Handling the General Cases 4.4.1 Compute Shortest Period for Circular Patterns 4.4.2 Check Circular Shifts Between Circular Patterns 4.4.3 Modified Dictionary Query Algorithm 5 Compressed Text Indexing with Wildcards 5.1 Our Indexes for The Text Indexing with Wildcards 5.1.1 Compressed Text Indexes 5.1.2 Compressed Indexes for Dictionary Matching 5.1.3 Orthogonal Range Reporting 5.1.4 Sparse Suffix Trees for Text Segments 5.2 Matching with Wildcards in Compressed Text 5.2.1 Type-1 Matching 5.2.2 Type-2 Matching 5.2.3 Type-3 Matching 6 Concluding Remarks and Some Future Work

    [1] A. Aho and M. Corasick. Efficient String Matching: An Aid to Bibligoraphic Search. Communications of the ACM, 18(6):333–340, 1975.
    [2] A. Amir, M. Farach, and Y. Matias. Efficient Randomized Dictionary Matching Algorithms (Extended Abstract). In Proceedings of Symposium on Combinatorial Pattern Matching,pages 262–275, 1992.
    [3] A. Amir, D. Keselman, G. M. Landau, M. Lewenstein, N. Lewenstein, and M. Rodeh. Text Indexing and Dictionary Matching With One Error. Journal of Algorithms, 37(2): 309-325, 2000.
    [4] D. Belazzougui. Succinct Dictionary Matching With No Slowdown. In Proceedings of Symposium on Combinatorial Pattern Matching, pages 88–100, 2010.
    [5] M. A. Bender and M. Farach-Colton. The Level Ancestor Problem Simplified. The- oretical Computer Science, 321(1):5–12, 2004.
    [6] M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. Lowest Common Ancestors in Trees and Directed Acyclic Graphs. Journal of Algorithms, 57(2):75–94, 2005.
    [7] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algo- rithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, CA, USA, 1994.
    [8] H. L. Chan, W. K. Hon, T. W. Lam, and K. Sadakane. Compressed Indexes for Dynamic Text Collections. ACM Transactions on Algorithms, 3(2), 2007.
    [9] T. Chan, K. G. Larsen, and M. Patrascu. Orthogonal Range Searching on the RAM, revisited. In Proceedings of Symposium on Computational Geometry, pages 1–10, 2011.
    [10] Y. F. Chien, W. K. Hon, R. Shah, and J. S. Vitter. Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing. In Proceedings of Data Compression Conference, pages 252–261, 2008.
    [11] R. Cole, L. A. Gottlieb, and M. Lewenstein. Dictionary Matching and Indexing with Errors and Don’t Cares. In Proceedings of Symposium on Theory of Computing, pages 91–100, 2004.
    [12] M. Crochemore and W. Rytter. Text Algorithms, Oxford University Press, New York, 1994.
    [13] J. A. Eisen. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biology, 5(3):e82, 2007.
    [14] P. Elias. Universal Codeword Sets and Representations of the Integers. IEEE Trans- actions on Information Theory, 21(2):194–203, 1975.
    [15] P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On Searching Com- pressed String Collections Cache-Obliviously. In Proceedings of Symposium on Prin- ciples of Database Systems, pages 181–190, 2008.
    [16] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and In- dexing Labeled Trees, with Applications. Journal of the ACM, 57(1), 2009. Article No. 4.
    [17] P. Ferragina, and G. Manzini. Indexing Compressed Text. Journal of the ACM, 52(4):552–581, 2005.
    [18] P. Ferragina, G. Manzini, V. Ma ̈kinen, and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms, 3(2), 2007.
    [19] P. Ferragina, S. Muthukrishnan, and M. de Berg. Multi-method Dispatching: A Geometric Approach with Applications to String Matching Problems. In Proceedings of Symposium on Theory of Computing, pages 483–491, 1999.
    [20] P. Ferragina, and R. Venturini. A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theoretical Computer Science, 372(1): 115–121, 2007.
    [21] P.FerraginaandR.Venturini.TheCompressedPermutermIndex.ACMTransactions on Algorithms, 7(1), 2010.
    [22] J. Fischer and V. Heun. A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proceedings of Symposium on Com- binatorics, Algorithms, Probabilistic and Experimental Methodologies, pages 459–470, 2007.
    [23] R. Grossi, and J. S. Vitter. Compressed Suffix Arrays and Suffix Trees with Appli- cations to Text Indexing and String Matching. SIAM Journal on Computing, 35(2): 378–407, 2005.
    [24] R. Grossi, A. Gupta, and J. S. Vitter High-Order Entropy-Compressed Text Indexes. In SODA, pages 841–850, 2003.
    [25] A. Gupta, W. K. Hon, R. Shah, and J. S. Vitter. A Framework for Dynamizing Succinct Data Structures. In Proceedings of International Colloquium on Automata, Languages and Programming, pages 521–532, 2007.
    [26] Y.Han.DeterministicSortinginO(nloglogn)TimeandLinearSpace.InProceedings of Symposium on Theory of Computing, pages 602–608, 2002.
    [27] W. K. Hon. On the Construction and Application of Compressed Text Indexes. PhD Thesis, Department of Computer Science, University of Hong Kong, 2004.
    [28] W. K. Hon, T. H. Ku, C. H. Lu, R. Shah, and S. V. Thankachan. Efficient Al- gorithm for Circular Burrows-Wheeler Transform. In Proceedings of Symposium on Combinatorial Pattern Matching, pages 257–268, 2012.
    [29] W. K. Hon, T. H. Ku, R. Shah, S. V. Thankachan, and J. S. Vitter. Faster Com- pressed Dictionary Matching. In Proceedings of International Symposium on String Processing and Information Retrieval, pages 191–200, 2010.
    [30] W. K. Hon, T. W. Lam, R. Shah, S. L. Tam, and J. S. Vitter Compressed Index for Dictionary Matching. In Proceedings of Data Compression Conference, pages 23–32, 2008.
    [31] W. K. Hon, T. W. Lam, K. Sadakane, W. K. Sung, and S. M. Yiu. A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays. Algorithmica, 48(1):28–36, 2007.
    [32] W. K. Hon, T. W. Lam, R. Shah, S. L. Tam, and J. S. Vitter. Compressed Index for Dictionary Matching. In Proceedings of Data Compression Conference, pages 23–32, 2008.
    [33] W. K. Hon, C. H. Lu, R. Shah, and S. V. Thankachan. Succinct Indexes for Circular Patterns. In In Proceedings of International Symposium on Algorithm and Computa- tion, pages 673–682, 2011.
    [34] W. K. Hon, K. Sadakane, and W. K. Sung. Breaking a Time-and-Space Barrier in Constructing Full-Text Indices. SIAM J. Computing, 38(6):2162–2178, 2009.
    [35] W. K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. On Entropy Compressed Text Indexing in External Memory. In Proceedings of International Symposium on String Processing and Information Retrieval, pages 75–89, 2009.
    [36] W. K. Hon, R. Shah, and J. S. Vitter. Compression, Indexing, and Retrieval for Mas- sive String Data. In Proceedings of Symposium on Combinatorial Pattern Matching, pages 260–274, 2010.
    [37] C. S. Iliopoulos and M. S. Rahman. Indexing Circular Patterns In In Proceedings of Workshop on Algorithms and Computation, pages 46–57, 2008.
    [38] G. Jacobson. Space-Efficient Static Trees and Graphs. In Proceedings of Symposium on Foundations of Computer Science, pates 549–554, 1989.
    [39] J. Ka ̈rkka ̈inen, and E. Ukkonen. Sparse Suffix Trees. In Proceedings of International Computing and Combinatorics Conference, pages 219–230, 1996.
    [40] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time Longest- Common-Prefix Computation in Suffix Arrays and Its Applications. In CPM, pages 181–192, 2001.
    [41] D. E. Knuth, J. H. Morris Jr., and V. R. Pratt. Fast Pattern Matching in Strings. SIAM J. Comput., 6(2): 323-350, 1977
    [42] T. W. Lam, W. K. Sung, S. L. Tam, and S. M. Yiu. Space-Efficient Indexes for String Matching with Don’t Cares. In Proceedings of International Symposium on Algorithms and Computation, pages 846–857, 2007.
    [43] N. J. Larsson and K. Sadakane. Faster suffix sorting. Theoretical Computer Science, 387(3):258–272, 2007.
    [44] U. Manber, and G. Myers. Suffix Arrays: A New Method for On-line String Searches. SIAM Journal on Computing, 22(5): 935–948, 1993.
    [45] S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An Extension of the Burrows Wheeler Transform. Theoretical Computer Science, 387(3):298-312, 2007.
    [46] E. M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(2): 262–272, 1976.
    [47] E. M. McCreight. Priority Search Trees SIAM Journal on Computing, 14(2): 257-276, 1985.
    [48] J. I. Munro and V. Raman. Succinct Representation of Balanced Parentheses and Static Trees. SIAM Journal on Computing, 31(3):762–776, 2001.
    [49] Y. Nekrich. Orthogonal Range Searching In Linear and Almost-Linear Space. Com- putational Geometry, 42(4): 342–351, 2009.
    [50] M. H. Overmars. Efficient Data Structures for Range Searching on a Grid. Journal of Algorithms, 9: 254–272, 1988.
    [51] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Appli- cations to Encoding k-ary Trees and Multisets. In In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, pages 233–242, 2002.
    [52] R. Raman, V. Raman, and S. S. Rao. Succinct Indexable Dictionaries with Appli- cations to Encoding k-ary Trees, Prefix Sums and Multisets. ACM Transactions on Algorithms, 3(4), 2007. Article No. 43.
    [53] K. Sadakane. Compressed Suffix Trees with Full Functionality. Theory of Computing Systems, pages 589–607, 2007.
    [54] K. Sadakane, and G. Navarro. Fully-Functional Succinct Trees. SODA, 134–149, 2010.
    [55] C. Simon and R. Daniel. Metagenomic Analyses: Past and Future Trends. Applied and Environmental Microbiology, 77(4):1153–1161, 2011.
    [56] B. L. Strang and N. D. Stow Circularization of the Herpes Simplex Virus Type 1 Genome upon Lytic Infection. Journal of Virology, 79(19):12487–12494, 2005.
    [57] A. Tam, E. Wu, T. W. Lam, and S. M. Yiu. Succinct Text Indexing with Wildcards. In Proceedings of International Symposium on String Processing and Information Retrieval, pages 39–50, 2009.
    [58] C. Thachuk. Succincter Text Indexing with Wildcards. In Proceedings of Symposium on Combinatorial Pattern Matching, pages 27–40, 2011.
    [59] P. Weiner. Linear Pattern Matcing Algorithms. In Proceedings of Symposium on Switching and Automata Theory, pages 1–11, 1973.
    [60] D. E. Willard. Log-Logarithmic Worst-Case Range Queries are Possible in Space Θ(N). Information Processing Letters, 17(2): 81–84, 1983.
    [61] I. Witten, A. Moffat, and T. Bell Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA, USA, 1999.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE