簡易檢索 / 詳目顯示

研究生: 歐家欣
Ou, Chia-Shin
論文名稱: 位元平行方法解決字串比對問題
On the Bit-Parallel Approaches to String Matching Problem
指導教授: 李家同
Lee, R.C.T.
盧錦隆
Lu, Chin-Lung
口試委員: 李家同
Lee, R.C.T.
唐傳義
Tang, Chuan-Yi
王有禮
Wang, Yue-Li
王炳豐
Wang, Biing-Feng
盧錦隆
Lu, Chin-Lung
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 57
中文關鍵詞: 位元平行處理精確字串比對近似字串比對過濾器演算法邏輯運算
外文關鍵詞: Approximate string matching, Bit-parallel processing, Exact sting matching, Filtering algorithm, Logical operation
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這篇論文中,我們考慮兩個問題:(1)系統化的方法解決包含特殊性質的位元陣列 (2) 位元平行方法解決最近鄰居字串比對問題。對於問題(1),我們假設一個陣列終止包含1和0,我們對找尋這個陣列中的一些特性有興趣。這些性質全都是涉及“所有”或“存在”的符號,我們對使用位元平行處理來找到這些性質感到興趣。在Myers的研究中,一連串的邏輯運算可以表示成一個邏輯公式:(∃k ≤ i, A(k) and ∀x∈[k, i – 1]) = (((A & B) + B) ^ B) | A),這公式無庸置疑是不易求得的。我們研究的貢獻在於提出一個系統化的方法找到這樣的邏輯公式來解決包含“所有”和“存在”符號的位元陣列。藉由我們的思維方式,Myers所提出的公式可以一步步地被推導獲得。在這篇論文中我們定義了五個邏輯原型的問題: “單區間所有”、“單區間存在”、“多區間所有”、“多區間存在”和“多區間所有並存在”,並且證明這些問題全都可以利用位元平行運算在O(n/w) 的時間內完成(w為機器的字大小)。我們也考慮由這五種問題所延伸的四個變形,並證明它們的邏輯公式可以有系統地藉由五種問題原型獲得。最近鄰居字串比對問題定義如下:給定一本文字串T = t1t2…tn 以及一樣本字串 P = p1p2…pm,最近鄰居字串比對問題是去尋找到本文字串T當中所有的子字串與樣本字串P的編輯距離與其他子字串相比是最小。最近鄰居字串比對問題在生物資訊學上有相當有用的應用價值。最近鄰居字串比對問題可以直接利用處理近似字串比對的Seller和Myers演算法來解決。Hyyro和Navarro提出了過濾器演算法來加速Myers的演算法。然而,Hyyro和Navarro的過濾器方法需要建立基於近似字串比對問題所定義的誤差值k的預先處理。因此,它不適合於用來解決沒有定義誤差值k的最近鄰居字串比對問題。在這篇論文中,我們提出了Hyyro和Navarro演算法的修正版本,並且還提出結合Myers演算法和修正的Hyyro和Navarro演算法的位元平行演算法。在實驗中,我們亦證明了我們的位元平行演算法是有效率的。


    In this thesis, we consider two problems: (1) Developing a systematical approach to solve problems involving special properties of bit-vectors (2) Developing a bit-parallel approach to solve the nearest neighbor string matching problem. For problem 1, suppose that we are given a vector consisting of only 1's and 0's and we are interested in finding some special properties of this vector. These properties all involve "for-all" or "there-exists" notations and we are interested in bit-parallel processes to find these properties. In Myers' work, a sequence of logical operations can be expressed as a logical formula: (∃k ≤ i, A(k) and ∀x∈[k, i – 1]) = (((A & B) + B) ^ B) | A) which is by no means easy to be obtained. The contribution of our research is to present a systematical method to find such logical formulas to solve problems involving bit vectors with "for-all" and "there-exists" notations. Using our way of thinking, the equation in Myers' work can be deduced step by step. Five logical prototype problems, "single-for-all", "single-there- exists", "multiple-for-all", "multiple- there-exists" and "multiple-there-exists-and-for-all", are defined in this thesis and are proved that all can be computed using bit-parallel operations in O(n/w) time, where w is the word size of the machine. We also consider four variants of the five problems and show that their logical formulas can be obtained using those of the five prototype problems systematically. The nearest neighbor string matching problem is defined as follows: Given a text string T = t1t2…tn and a pattern string P = p1p2…pm, the nearest neighbor string matching problem is to find all substrings of T whose edit distances with P are the smallest, among all substrings of T. The nearest neighbor string matching problem has useful applications in bioinformatics. It can be straight-forwardly solved by the Seller's Algorithm and the Myers Algorithm which are used to solve the approximate string matching problem. Hyyro and Navarro proposed a filtering algorithm to speed up the Myers Algorithm. However, Hyyro and Navarro's filtering approach needs to perform a pre-processing based on the error bound k which is given by the definition of approximate string matching problem. Hence, it is not suitable to be used to solve the nearest neighbor string matching problem which has no k. In this thesis, we present a modification of the Hyyro and Navarro Algorithm, and also present a bit-parallel algorithm combining the Myers Algorithm and the modified Hyyro and Navarro Algorithm. In experiments, we show that our bit-parallel algorithm is efficient.

    ABSTRACT III CONTENTS IV LIST OF FIGURES VI LIST OF TABLES VIII CHAPTER 1 INTRODUCTION 1 1.1 A SYSTEMATICAL APPROACH TO SOLVE PROBLEMS INVOLVING SPECIAL PROPERTIES OF BIT-VECTORS 1 1.2 THE NEAREST NEIGHBOR STRING MATCHING PROBLEM 2 CHAPTER 2 BIT-VECTORS WITH INVOLVING SPECIAL PROPERTIES PROBLEM 3 2.1 BASIC NOTATIONS AND MOTIVATION 3 2.2 SINGLE-FOR-ALL PROBLEM 8 2.3 SINGLE-THERE-EXISTS PROBLEM 10 2.4 MULTIPLE-FOR-ALL PROBLEM 12 2.5 MULTIPLE-THERE-EXISTS (MTE) PROBLEM 17 2.6 MULTIPLE-THERE-EXISTS-AND-FOR-ALL PROBLEM 21 2.7 VARIANTS OF THE FIVE LOGICAL PROTOTYPE PROBLEMS 22 CHAPTER 3 A BIT-PARALLEL ALGORITHM TO SOLVE THE NEAREST NEIGHBOR STRING MATCHING PROBLEM 26 3.1 THE NEAREST NEIGHBOR STRING MATCHING PROBLEM 26 3.2 MYERS ALGORITHM 27 3.3 HYYRO AND NAVARRO ALGORITHM 30 3.4 OUR BIT-PARALLEL ALGORITHM TO SOLVE THE NEAREST NEIGHBOR STRING MATCHING PROBLEM 41 3.5 EXPERIMENTAL RESULTS 48 CHAPTER 4 CONCLUDING REMARKS AND FUTURE RESEARCH 52 BIBLIOGRAPHY 54

    [1] Aho, A. V. and Corasick, M. J. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6), pp.333–340, 1975.
    [2] Apostolico, A. and Crochemore. M. Optimal canonization of all substrings of a string. Information and Computation, 95(1), pp.76–95, 1991.
    [3] Apostolico, A. and Giancarlo, R. The Boyer-Moore-Galil string searching strategies revisited. SIAM journal on Computing, 15(1), pp.98–105, 1986.
    [4] Baeza-Yates, R. A. Text-retrieval: Theory and practice. In Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture – Information Processing ’92, 1, pp.465–476, Amsterdam, The Netherlands, 1992.
    [5] Baeza-Yates, R. A. and Gonnet, G. H. A new approach to text searching. Communications of the ACM, 35, pp.74–82, 1992.
    [6] Baeza-Yates, R. A. and Navarro, G. Faster approximate string matching. Algorithmica, 23, pp.127–158, 1999.
    [7] Buchner, M. and Janjarasjitt, S. Detection and visualization of tandem repeats in DNA sequences, IEEE Transactions on Signal Processing, 51, pp.2280–2287, 2003.
    [8] Boyer, R. S. and Moore, J. S. A fast string searching algorithm. Commu- nications of the ACM, 20(10), pp.762–772, 1977.
    [9] Colussi, L. Correctness and efficiency of pattern matching algorithms. Jorunal of Algorithms, 16(2), pp.163–189, 1994.
    [10]Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W., Speeding up Two String-matching Algorithms, Algorithmica, 12, pp. 247-267, 1994.
    [11] Cantone, D. and Faro, S. A space efficient bit-parallel algorithm for the multiple string matching problem. Int. J. Found. Comput. Sci., 17, pp.1235–1252, 2006.
    [12] Crochemore, M., Iliopoulos, C. S., Navarro, G. and Pinzon, Y. J. A bit-parallel suffix automaton approach for (δ, γ)-matching in music retrieval. Proceeding of SPIRE 03, Manaus, Brazil, 8-10 October, pp.211–223. Springer- Verlag, Berlin, 2003.
    [13] Crochemore, M., Iliopoulos, C. S., Pinzon, Y. J. and Reid, J. A fast and practical bit-vector algorithm for the longest common subsequence problem. Inform. Process. Lett., 80, pp.279–285, 2001
    [14] Crochemore, M. and Perrin, D. Two-way string-matching. Journal of the ACM, 38(3), pp.650–674, 1991.
    [15] Fredriksson, K. Row-wise tiling for the Myers' bit-parallel dynamic programming algorithm. Proceeding of SPIRE 03, Manaus, Brazil, 8-10 October, pp. pp.66–79. Springer-Verlag, Berlin, 2003.
    [16] Fredriksson, K. and Navarro, G. Average-Optimal Single and Multiple Approximate String Matching, ACM Journal of Experimental Algorithmics, 9, pp.1–47, 2004.
    [17] Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology, Textbook, Cambridge University Press, 1997.
    [18] Grabowski, S. and Fredriksson, K. Bit-parallel string matching under Hamming distance in worst case time. Inform. Process. Lett., 105, pp.182–187, 2008.
    [19] Galil, Z. and Giancarlo, R. On the exact complexity of string matching: Upper bounds. SIAM Journal on Computing, 21(3), pp.407–437, 1992.
    [20] Galil, Z. and Seiferas, J. Time-space-optimal string matching. Journal of Computer and System Sciences, 26(3), pp.280–294, 1983.
    [21] Hirschberg, D. S. A linear space algorithm for computing maximal common subsequences, Communications of the ACM, 18(6), pp.341-343, 1975.
    [22] Hirschberg, D. S. Algorithms for the longest common subsequence problem, Journal of the ACM, 24, pp.664-675, 1977.
    [23] Horspool, R. N. Practical fast searching in strings. Software: Practice and Experience, 21(11), pp.1221–1248, 1991.
    [24] Hyyro, H. A bit-vector algorithm for computing Levenshtein and Damerau edit distances, Nord. J. Comput., 10, pp.29–39, 2003.
    [25] Hyyro, H. Bit-parallel LCS-length computation revisited. Proceeding of AWOCA 04, Ballina, NSW, Australia, 7-9 July, pp.16–27, 2004.
    [26] Hyyro, H. Bit-parallel approximate string matching algorithms with transposition. J. Discr. Alg., 3, pp.215–229, 2005.
    [27] Hyyro, H., Fredriksson, K., and Navarro, G. Increased bit-parallelism for approximate string matching. Proceedings of WEA 04, Rio Janeiro, Brazil, 25-28 May, pp.285–298. Springer-Verlag, Berlin, 2004.
    [28] Hyyro, H. and Navarro, G. Bit-parallel witnesses and their application to approximate string matching. Algorithmica, 41, pp.203–231, 2005.
    [29] Hyyro, H. and Navarro, G. Bit-parallel computation of local similarity score matrices with unitary weights, Int. J. Found. Comput. Sci., 17, pp.1325–1344, 2006.
    [30] Hyyro, H., Pinzon, Y. and Shinohara, A. Fast bit-vector algorithms for approximate string matching under indel distance. Proceedings of SOFSEM 05, Slovakia, 22-28 January, pp.380–384. Springer-Verlag, Heidelberg, 2005.
    [31] Hunt, J. W. and Szymanski, T. G. A fast algorithm for computing longest common subsequences, Communications of the ACM, 20(5), pp.350-35, 1997.
    [32] Kimura, K., Koike, A. and Nakai, K. A bit-parallel dynamic programming algorithm suitable for DNA sequence alignment. Journal of Bioinformatics and Computational Biology, 10(4): 125002, pp.1–15, 2012.
    [33] Knuth, D. E., Morris, J. H. and Pratt, V. E. Fast pattern matching in strings. SIAM Journal on Compution, 6(2), pp.323–350, 1997
    [34] Landau, G. and Vishkin, U. Fast parallel and serial approximate string matching, Journal of Algorithms, 10, pp.157–169, 1989.
    [35] Lee, R. C. T. (2010) String matching algorithms, textbook.
    [36] Tarhio, J. and Ukkonen, E. Approximate boyer-moore string matching. SIAM J. Comput., 22, pp.243–260, 1993.
    [37] Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM, 46, pp.395–415, 1999.
    [38] Navarro, G. A guided tour to approximate string matching, ACM Computing Surveys, 33, pp.31–88, 2001.
    [39] Navarro, G. NR-grep: a fast and flexible pattern-matching tool. Software: Practice and Experience, 31(13), pp.1265–1312, 2001.
    [40] Masek, W. J. and Paterson, M. S. A faster algorithm computing string edit distances, Journal of Computer and System Sciences, 20, pp.18-31, 1980.
    [41] Navarro, G. and Raffinot, M. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM journal of Experimental Algorithmics, 5(4), 2000.
    [42] Ou, C. S. and Lee, R. C. T. A Parallel Approach to Solve the Approximation String Matching Problem, Proceedings of the 27th Workshop on Combinatorial Mathematics and Computation Theory, April, pp.161–166, 2010.
    [43] Ou, C. S., Lu, C. L. and Lee, R. C. T. A bit-parallel approach to solve approximation string matching problem for unlimited pattern length, Asian Association for Algorithms and Computation, 2011.
    [44] Raita, T. Tuning the Boyer-Moore-Horspool string searching algorithm. Software: Practice and Experience, 22(10), pp.879–884, 1992.
    [45] Sellers, P. H. The theory and computation of evolutionary distances: pattern recognition, Journal of Algorithms, 1, pp.359–373, 1980.
    [46] Sunday, D. M. A very fast substring search algorithm. Communications of the ACM, 33(8), pp.132–142, 1990.
    [47] Ukkonen, E. Finding approximate patterns in strings. Journal of Algorithms, 6, pp.100–118, 1985.
    [48] Wright, A. Approximate string matching using within-word parallelism. Software Practice and Experience, 24, pp.337–362, 1994.
    [49] Wagner, R. A. and Fischer, M. J. The string-to-string correction problem. Journal of Association for Computing Machinery, 21(1), pp.168–172, 1974.
    [50] Wu, S. and Manber, U. Fast text searching: allowing errors, Commu- nication of the ACM, 35(10), pp.83–91, 1992.
    [51] Wu, S. and Manber, U. A fast algorithm for multi-pattern searching, Technical Report TR-94-17, Department of Computer Science, University of Arizona, U.S.A., 1994.
    [52] Zhu, R. F. and Takaoka, T. On improving the average case of the Boyer-Moore string matching algorithm. Journal of Information Processing, 10(3), pp.173–177, 1987.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE