文件特徵化問題的改進演算法｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	詹棨元 Chan, Chi-Yuan
論文名稱：	文件特徵化問題的改進演算法 Improved Algorithms for the Text Fingerprinting Problem
指導教授：	王炳豐 Wang, Biing-Feng
口試委員:
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2009
畢業學年度：	98
語文別：	英文
論文頁數：	72
中文關鍵詞：	演算法、字串比對、特徵化、文字索引
外文關鍵詞：	algorithms, string matching, fingerprinting, text indexing
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在本論文中，我們研究了由 Amir et al. 所提出來的文件特徵化索引問題 (text fingerprinting indexing problem) 。令 S 為一個由有限且有序的字母表 \Sigma 所形成的字串。對任何 S 的子字串 S' ，所有在 S' 中出現的字元所形成的集合稱為它的特徵 (fingerprint) 。文件特徵化索引問題需要對 S 先建立一個資料結構，使得之後給定任何的集合C \subseteq \Sigma ，我們可以很有效率的回答以下兩種查詢 (query) ：(1) 回答 C 是不是代表了 S 的某一個子字串的特徵；(2) 找出所有特徵為 C 的最大子字串。目前為止最好的結果可以在 □(|\Sigma|) 以及 □(|\Sigma| + K) 分別回答這兩種查詢，其中 K 代表了最大子字串的個數。在本論文中，我對這個問題提出了兩個改進演算法。第一個演算法可以分別在 O(min{|C| log n, |\Sigma|}) 和 O(min{|C| log n, |\Sigma|} + K) 的時間內完成兩種不同的查詢。第二個改進演算法用了一個更省空間的方法，並且可以分別在 O(|C| log (|\Sigma|/|C|)) 以及 O(|C| log (|\Sigma|/|C|) + K) 的時間回答兩種查詢。我們的兩種改進演算法都解決了 Amir et al. 所提出來的open problem。

In this dissertation, we study the text fingerprinting indexing problem, which was introduced by Amir et al. Let S be a string over a finite, ordered alphabet \Sigma. For any substring S' of S, the set of distinct characters contained in S' is called its fingerprint. The text fingerprinting indexing problem consists of constructing a data structure for the string S in advance, so that on given any input set C \subseteq \Sigma of characters, we can answer the following queries efficiently: (1) determine if C represents a fingerprint of some substrings in S; (2) find all maximal substrings of S whose fingerprint is equal to C. The best results known so far solved these two queries in □(|\Sigma|) and □(|\Sigma| + K) time, respectively, where K is the number of maximal substrings. In this dissertation, we propose two improved algorithms for the text fingerprinting indexing problem. The first algorithm solves the two queries in O(min{|C| log n, |\Sigma|}) and O(min{|C| log n, |\Sigma|} + K) time, respectively. The second algorithm solves them in O(|C| log (|\Sigma|/|C|)) and O(|C| log (|\Sigma|/|C|) + K) time, respectively, by using a new data structure with less storage than the existing solutions. Both results answer an open problem proposed by Amir et al.

Abstract    i
Acknowledgement    iii
Contents    iv
List of Figures    vi
List of Tables    viii
Chapter 1. Introduction    1
1.1. Related Work    5
1.2. Summary of Results    6
1.3. Organization of the Dissertation    8
Chapter 2. Notation and Preliminaries    10
Chapter 3. Kolpakov and Raffinot's Algorithm    13
3.1. The Preprocessing Phase    13
3.2. Fingerprint Tree and Its Searching Algorithm    18
Chapter 4. The First Improved Algorithm    20
4.1. The Query Algorithm of Amir et al. [5]    21
4.2. An O(|C| log n)-time Query Algorithm    23
4.3. An O(min(|C| log n, |□|))-time Query Algorithm    26
Chapter 5. The Second Improved Algorithm    30
5.1. The Data Structures    30
5.1.1. The Lexi-String Trie    31
5.1.2. The Backtracking Tree    33
5.2 Query Algorithms    35
5.2.1 Answering the Queries    36
5.2.2. Faster Query Algorithm for Sorted Input    45
5.3. Constructions of the Data Structures    49
5.3.1. Construction of the Backtracking Tree    49
5.3.2. Construction of the Lexi-String Trie    51
Chapter 6. Further Improvement on the Second Algorithm    57
6.1. Discarding Names    57
6.2. Constructions of the LS Trie and the Backtracking Tree    60
Chapter 7. Conclusion and Future Work    63
References    67

                                

[1] Abrahamson, K., "Generalized string matching," SIAM Journal on Computing, vol. 16, no. 6, 1039–1051, 1987.
[2] Aho, A. V. and Corasick, M. J., "Efficient string matching: an aid to bibliographic search," Communications of the ACM, vol. 18, no. 6, 333–340, 1975.
[3] Amir, A., Aumann, Y., Benson, G., Levy, A., Lipsky, O., Porat, E., Skiena, S., and Vishne, U., "Pattern matching with address errors: Rearrangement distances," Journal of Computer and System Sciences, doi: 10.1016/j.jcss.2009.03.001, 2009.
[4] Amir, A., Aumann, A., Landau, G., Lewenstein, M., and Lewenstein, N., "Pattern matching with Swaps," Journal of Algorithms, vol. 37, 247–266, 2000.
[5] Amir, A., Apostolico, A., Landau, G. M., and Satta, G., "Efficient text fingerprinting via Parikh mapping," Journal of Discrete Algorithms, vol. 1, no. 5-6, 409–421, 2003.
[6] Amir, A., Butman, A., Crochemore, M., Landau, G. M., and Schaps, M., "Two-dimensional pattern matching with rotations," Theoretical Computer Science, vol. 314, no. 1–2, 173–187, 2004.
[7] Amir, A., Butman, A., and Lewenstein, M., "Real scaled matching," Information Processing Letters, vol. 70, 185–190, 1999.
[8] Amir, A., Butman, A., Lewenstein, M., Porat, E., and Tsur, D., "Efficient one-dimensional real scaled matching," Journal of Discrete Algorithm, vol. 5, 205–211, 2007.
[9] Amir, A. and Calinescu, G., "Alphabet-Independent and scaled dictionary matching," Journal of Algorithms, vol. 36, 34–62, 2000.
[10] Amir, A., Gasieniec, L., and Shalom, R., "Improved approximate common interval," Information Processing Letters, vol. 103, no. 4, 142–149, 2007.
[11] Amir, A., Lewenstein, M., and Porat, E., "Approximate swapped matching," Information Processing Letters, vol. 83, 33–39, 2002.
[12] Amir, A., Lewenstein, M., and Porat, E., "Faster algorithms for string matching with k mismatches," Journal of Algorithms, vol. 41, no. 2, 257–275, 2004.
[13] Amir, A., Landau, G. M., and Vishkin, U., "Efficient pattern matching with scaling," Journal of Algorithms, vol. 13, 2–32, 1992.
[14] Amir, A., Kapah, O., and Tsur, D., "Faster two dimensional pattern matching with rotations," Theoretical Computer Science, vol. 368, no. 3, 196–204, 2006.
[15] Andersson, A. and Thorup, M., "Dynamic ordered sets with exponential search trees," Journal of the ACM, vol. 54, no. 3, Article 13, 40 pages, 2007.
[16] Bender, M. A. and Farach-Colton M., "The LCA problem revisited," in Proceedings of the 4th Latin American Theoretical Informatics Symposium, Lecture Notes in Computer Science, vol. 1776, 88-94, 2000.
[17] B□al, M.-P., Bergeron, A., Corteel, S., and Raffinot, M., "An algorithmic view of gene teams," Theoretical Computer Science, vol. 320, no. 2–3, 395–418, 2004.
[18] B□cker, S., Jahn, K., Mixtacki, J., and Stoye, J., "Computation of median gene clusters," Lecture Notes in Bioinformatics, vol. 4955, 331–345, 2008.
[19] Boyer, R. S. and Moore, J. S., "A fast string searching algorithm," Communications of the ACM, vol. 20, no. 10, 762–772, 1977.
[20] Butman, A., Eres, R., and Landau, G. M., "Scaled and permuted string matching," Information Processing Letters, vol. 92, no. 6, 293–297, 2004.
[21] Cole, R., Gottlieb, L. A., and Lewenstein, M., "Dictionary matching and indexing with errors and don't cares," in Proceedings of 36th annual ACM Symposium on Theory of Computing, 91–100, 2004.
[22] Cole, R. and Hariharan, R., "Approximate string matching: a faster simpler algorithm," in Proceeding of 9th ACM-SIAM Symposium on Discrete Algorithms, 463–472, 1998.
[23] Cole R. and Hariharan, R., "Verifying candidate matches in sparse and wildcard matching," in Proceedings of 34th annual ACM Symposium on Theory of Computing, 592–601, 2002.
[24] Cole, R., Kopelowitz, T., and Lewenstein, M., "Suffix trays and suffix trists: structures for faster text indexing," in Proceedings of the 33rd International Colloquium on Automata, Languages, and Programming, Lecture Notes in Computer Science, vol. 4051, 358-369, 2006.
[25] Dandekar, T., Snel, B., Huynen, M., and Bork, P., "Conservation of gene order: a fingerprint of proteins that physically interact," Trends in Biochemical Sciences, vol. 23, no. 9, 324–328, 1998.
[26] Didier, G., Schmidt, T., Stoye, J., and Tsur, D., "Character sets of strings," Journal of Discrete Algorithms, vol. 5, no. 2, 330–340, 2007.
[27] Eilam-Tzoreff, T. and Vishkin, U., "Matching patterns in a string subject to multi-linear transformation," Theoretical Computer Science, vol.60, 231–254, 1988.
[28] Ferragina, P. and Grossi, R., "The string B-tree: a new data structure for string search in external memory and its applications," Journal of the ACM, vol. 46, no. 2, 236–280, 1999.
[29] Fredman, M., Koml□s, J., and Szemer□di, E., "Storing a sparse table with O(1) worst case access time," Journal of the ACM, vol. 31, 538–544, 1984.
[30] Fredriksson, K., Navarro, G., and Ukkonen, E., "Optimal exact and fast approximate two dimensional pattern matching allowing rotations," in Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 2373, 235–248, 2002.
[31] Gupta, S. K. and Punnen, A. P., "Group center and group median of a tree," European Journal of Operational Research, vol. 65, 400–406, 1993.
[32] Hagerup, T., Miltersen, P. B., and Pagh, R., "Deterministic dictionaries," Journal of Algorithms, vol. 41, no. 1, 69–85, 2001.
[33] Hakimi, S. L., "Optimal locations of switching centers and the absolute centers and medians of a graph," Operations Research, vol. 12, 450–459, 1964.
[34] Han, Y., "Deterministic sorting in O(n log log n) time and linear space," Journal of Algorithms, vol. 50, no. 1, 96–105, 2004.
[35] Harel, D. and Tarjan, R., "Fast algorithms for finding nearest common ancestors," SlAM Journal on Computing, vol. 13, no. 2, 338–355, 1984.
[36] He, X. and Goldwasser, M. H., "Identifying conserved gene clusters in the presence of homology families," Journal of Computational Biology, vol. 12, no. 6, 638–656, 2005.
[37] Heber, S. and Savage, C. D., "Common intervals of trees," Information Processing Letters, vol. 93, 69–74, 2005.
[38] Heber, S. and Stoye, J., " Finding all common intervals of k permutations," in Proceedings of the 12th Annual Symposium of Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 2089, 207–218, 2001.
[39] Karloff, H., "Fast algorithms for approximately counting mismatches," Information Processing Letters, vol. 48, no. 2, 53–60, 1993.
[40] Karlsson, F., Voutilainen, A., Heikkil□, J., and Anttila, A., Constraint grammar: a language-independent system for parsing unrestricted text, de Gruyter, Berlin, 1995.
[41] Kariv, O. and Hakimi, S. L., "An algorithmic approach to network location problems, Part I: The p-centers," SIAM Journal on Applied Mathematics, vol. 37, 513–538, 1979.
[42] Knuth, D. E., The art of computer programming. In: sorting and searching, vol. 3, Addison-Wesley, London, UK, 1973.
[43] Knuth, D. E., Morris, J. H., and Pratt, V. R., "Fast pattern matching in strings," SIAM Journal on Computing, vol. 6, 323–350, 1977.
[44] Kolpakov, R. and Raffinot, M., "New algorithms for text fingerprinting," in Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 4009, 342–353. 2006.
[45] Kolpakov, R. and Raffinot, M., "New algorithms for text fingerprinting," Journal of Discrete Algorithms, vol. 6, no. 2, 243–255, 2008.
[46] Levenshtein, V. I., "Binary codes capable of correcting, deletions, insertions and reversals," Soviet Physics-Doklady, vol. 10, 707–710, 1966.
[47] Luc, N., Risler, J.-L., Bergeron, A., and Raffinot, M., "Gene teams: a new formalization of gene clusters for comparative genomics," Computational Biology and Chemistry, vol. 27, no. 1, 59–67, 2003.
[48] Muthukrishnan, S., "Efficient algorithms for document retrieval problems," in Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, 657–666, 2002.
[49] Pinter, R. Y., "Efficient string matching with don't-cares patterns," Combinatorial Algorithms on Words, vol. 12, 11–29, 1985.
[50] Rahman, M. S., Iliopoulos, C. S., Lee, I., Mohamed, M., and Smyth, W. F., "Finding patterns with variable length gaps or don't cares." in Proceedings of the 12th Annual International Computing and Combinatorics Conference, Lecture Notes in Computer Science, vol. 4112, 146–155, 2006.
[51] Rogozin, I. B., Makarova, K. S., Murvai, J., Czabarka, E., Wolf, Y. I., Tatusov, R. L., Szekely, L. A., and Koonin, E. V., "Connected gene neighborhoods in prokaryotic genomes," Nucleic Acids Research, vol. 30, no. 10, 2212–2223, 2002.
[52] Schieber, B. and Vishkin, U., "On finding lowest common ancestors: simplification and parallelization," SIAM Journal on Computing, vol. 17, no. 6, 1253-1262. 1988.
[53] Schmidt, T. and Stoye, J., "Quadratic time algorithms for finding common intervals in two and more sequences," in Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, vol. 3109, 347–358, 2004.
[54] Uno, T. and Yagiura, M., "Fast algorithms to enumerate all common intervals of two permutations," Algorithmica, vol. 26, 290–309, 2000.
[55] Wang, B.-F., Lin, J.-J., and Ku, S.-C., "Efficient Algorithms for the scaled indexing problem," Journal of Algorithms, vol. 52, 82–100, 2004.
[56] Weiner, P., "Linear pattern matching algorithms," in Proceedings of the 14th IEEE Annual Symposium on Switching and Automata Theory, 1–11, 1973.
[57] Yanai, I. and DeLisi, C., "The society of genes: Networks of functional links between genes from comparative genomics," Genome Biology, vol. 64, no. 3, 1–12, 2002.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文