研究生: |
呂嘉維 Lu, Chia Wei |
---|---|
論文名稱: |
有效率的字串比對和近似字串比對演算法 Efficient Exact and Approximate String Matching Algorithms |
指導教授: |
李家同
Lee, R. C. T. 唐傳義 Tang, Chuan Yi |
口試委員: |
李家同
Lee, R. C. T. 唐傳義 Tang, Chuan Yi 王有禮 Wang, Yue-Li 王炳豐 Wang, Biing-Feng 盧錦隆 Lu, Chin Lung 黃光璿 Huang, Guan-Shieng |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 英文 |
論文頁數: | 83 |
中文關鍵詞: | 字串比對 、近似字串比對 、分支定界法 、編輯距離 、過濾方法 、DNA重序 、新一代定序技術 |
外文關鍵詞: | exact string matching, approximate string matching, branch and bound, edit distance, filtering, DNA resequencing, NGS |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在本論文中,我們首先提出兩個有效率的字串比對演算法,字串比對問題是要找出一個字串P在另一個較長的字串T中所有出現的位置。我們提出的演算法是當P與T的某一子字串在進行比對時,使用最佳的字元比對順序,使得能夠有效率的搜尋P在T中的位置。而找出最佳的字元比對順序,我們使用了分支定界法(Branch and Bound)。我們提出的演算法能夠與其它的字串比對演算法相結合,改進搜尋速度。根據實驗數據,與其它有效率的字串比對演算法比較時,我們提出的演算法能夠有最少的字元比對次數,且在時間上也是有效率的。
論文的第二部分,我們提出了一個過濾式的演算法以及混合過濾式的演算法解決近似字串比對問題,近似字串比對問題是要在一個較長的字串T中找出所有的位置i,其都存在某個T的子字串,結尾位置在i且與P的編輯距離(edit distance)小於或等於一個容錯值k,此問題也稱為k-difference問題。根據實驗數據,對於DNA序列的近似字串比對問題,我們的過濾式演算法比其它的過濾式演算法能夠刪除較多不可能是答案的位置,且我們的混合過濾式的演算法能再進一步的改進過濾效果。
論文的第三部分,我們提出了一個漸進式的演算法來解決DNA重序問題,DNA重序的其中一個目的,是要知道一條未完成組序的DNA序列X與另外一條完整的參考序列R是否相近,且差異點為何,我們可經由新一代的定序儀器得到X的短序列,把這些短序列貼回參考序列R上,以觀察異同點,因此這是一個近似字串比對問題的應用,我們提出了一個漸進式的演算法,其能使用低的錯誤允許值將X短序列貼回R,而也能解決掉無法貼到差異度大的區域問題。
In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms.
Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm.
Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.
[1] Apostolico, A. and Crochemore, M., Optimal canonization of all substrings of a string, Information and Computation, vol. 95, 1991, pp. 76-95.
[2] Apostolico, A. and Giancarlo, R., The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing, vol. 15, 1986, pp. 98-105.
[3] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool, Journal of molecular biology, vol. 215, 1990, pp. 403-410.
[4] Baeza-Yates, R. and Gonnet, G. H., A new approach to text searching, Communications of the ACM, vol. 35, 1992, pp.74-82.
[5] Boyer, R. S. and Moore, J. S., A fast string searching algorithm, Communications of the ACM, vol. 20, 1977, pp. 762-772.
[6] Baeza-Yates, R. and Navarro, G., Faster approximate string matching, Algorithmica, vol. 23, 1999, pp. 127–158.
[7] Baeza-Yates, R. A. and Perleberg, C. H., Fast and practical approximate string matching, Information Processing Letters, vol. 59, 1996, pp. 21–27.
[8] Baeza-Yates, R. A. and Regnier, M., Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Science, vol. 92, 1992, pp. 19-31.
[9] Burrows, M. and Wheeler, D. J., A block-sorting lossless data compression algorithm, Technical Report, 1994.
[10] Colussi, L., Correctness and efficiency of pattern matching algorithms, Information and Computation, vol. 95, 1991, pp. 225-251.
[11] Colussi, L., Fastest pattern matching in strings, Journal of Algorithms, vol. 16, 1994, pp. 163-189.
[12] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W., Speeding up two string-matching algorithms, Algorithmica, vol. 12, 1994, pp. 247-267.
[13] Crochemore, M., Hancart, C. and Lecroq, T., Algorithms on Strings, Cambridge University Press, 2007.
[14] Chang, W. I. and Lawler, E. L., Approximate string matching in sublinear expected time, in: Proceedings of the ACM-SIAM 31st Annual Symposium on Foundations of Computer Science, 1990, pp. 116-124.
[15] Chang, W. I. and Lawler, E. L., Sublinear approximate string matching and biological applications, Algorithmica, vol. 12, 1994, pp. 327-344.
[16] Charras, C., Lecrog, T. and Pehoushek, J. D., A very fast string matching algorithm for small alphabets and long patterns, in: Proceedings of Combinatorial Pattern Matching, 1998, pp. 55-64, Springer Berlin Heidelberg.
[17] Chang, W. I. and Marr, T. G., Approximate string matching and local similarity, in: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, in: LNCS, vol. 807, Springer-Verlag, 1994, pp. 259-273.
[18] Chaisson, M., Pevzner, P. and Tang, H., Fragment assembly with short reads, Bioinformatics, vol. 20, 2004, pp. 2067-2074.
[19] Crochemore, M. and Rytter, W., Jewels of stringology: text algorithms, World Scientific, 2002.
[20] Ďurian, B., Holub, J., Peltola, H. and Tarhio, J., Tuning BNDM with q-grams, in: Proceedings of ALENEX, 2009, pp.29-37.
[21] Franek, F., Jennings, C. G. and Smyth, W. F., A simple fast hybrid pattern-matching algorithm, Journal of Discrete Algorithms, vol. 5, 2007, pp. 682-695.
[22] Faro, S. and Lecroq, T., Efficient variants of the backward-oracle-matching algorithm, International Journal of Foundations of Computer Science, vol. 20, 2009, pp. 967-984.
[23] Faro, S. and Lecroq, T., The exact online string matching problem: A review of the most recent results, ACM Computing Surveys, vol. 45, 2013, pp. 1-42.
[24] Fredriksson, K. and Navarro, G., Average-optimal single and multiple approximate string matching, ACM Journal of Experimental Algorithmics, vol. 9, 2004, pp. 1-47.
[25] Galil, Z. and Giancarlo, R., On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, vol. 21, 1992, pp. 407-437.
[26] Giegerich, R., Kurtz, S., Hischke, F. and Ohlebusch, E., A general technique to improve filter algorithms for approximate string matching, in: Proceedings of the 4th South American Workshop on String Processing (WSP ’97), 1997, pp. 38-52.
[27] Galil, Z. and Seiferas, J., Time-space-optimal string matching, Journal of Computer and System Science, vol. 26, 1983, pp. 280-294.
[28] Horspool, R. N., Practical fast searching in strings, Software: Practice and Experience, vol. 10, 1980, pp. 501-506.
[29] Huynh, T. N., Hon, W. K., Lam, T. W. and Sung, W. K., Approximate string matching using compressed suffix arrays, Theoretical Computer Science, vol. 352, 2006, pp. 240-249.
[30] Hyyrö, H. and Navarro, G., Bit-parallel witnesses and their applications to approximate string matching, Algorithmica, vol. 41, 2005, pp. 203-231.
[31] Jiang, H., and Wong, W. H., SeqMap: mapping massive amount of oligonucleotides to the genome, Bioinformatics, vol. 24, 2008, pp. 2395-2396.
[32] Knuth, D. E., Morris (Jr), J. H. and Pratt, V. R., Fast pattern matching in strings, SIAM Journal on Computing, vol. 6, 1977, pp. 323-350.
[33] Lecroq, T., A variation on the Boyer-Moore algorithm, Theoretical Computer Science, vol. 92, 1992, pp. 119-144.
[34] Lecroq, T., Fast exact string matching algorithms, Information Processing Letters, vol. 102, 2007, pp. 229-235.
[35] Li, H. and Durbin, R., Fast and accurate short read alignment with Burrows–Wheeler Transform, Bioinformatics, vol. 25, 2009, pp. 1754-1760.
[36] Lu, Chia Wei and Lee, R. C. T., String matching algorithms based upon the uniqueness property, in: Proceedings of the 24th Workshop on Combinatorial Mathematics and Computation Theory, 2007, pp. 385-392.
[37] Lu, Chia Wei and Lee, R. C. T., An exact string matching algorithm based upon selective matching order and branch and bound approach, in: Proceedings of the 30th Workshop on Combinatorial Mathematics and Computation Theory, 2013, pp. 131-137.
[38] Lu, Chia Wei, Lu, Chin Lung and Lee, R. C. T., A new filtration method and a hybrid strategy for approximate string matching, Theoretical Computer Science, vol.481, 2013, pp. 9-17.
[39] Li, R., Li, Y., Kristiansen, K. and Wang, J., SOAP: short oligonucleotide alignment program, Bioinformatics Applications Note, vol. 24, 2008, pp. 713-714.
[40] Li, H., Ruan, J. and Durbin, R., Mapping short DNA sequencing reads and calling variants using mapping quality scores, Genome Research, vol. 18, 2008, pp. 1851-1858.
[41] Lu, Chia Wei, Tang, Chuan Yi and Lee, R. C. T., A progressive strategy for DNA resequencing problem, in: Proceedings of the 27th Workshop on Combinatorial Mathematics and Computation Theory, 2010, pp. 45-49.
[42] Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L., Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology, vol. 10, 2009, R25.
[43] Landau, G. M. and Vishkin, U., Fast parallel and serial approximate string matching, Journal of Algorithms, vol. 10, 1989, pp. 157-169.
[44] Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K. and Wang, J., SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics, vol. 25, 2009, pp. 1966-1967.
[45] Lin, H., Zhang, Z., Zhang, M. Q., Ma, B. and Li, M., ZOOM! Zillions of oligos mapped, Bioinformatics, vol. 24, 2008, pp. 2431-2437.
[46] Myers, G., A fast bit-vector algorithm for approximate string matching based on dynamic programming, in: Proceedings Combinatorial Pattern Matching, in: LNCS, vol. 1448, 1998, pp. 1-13.
[47] Myers, G., A fast bit-vector algorithm for approximate string matching based on dynamic programming, Journal of the ACM, vol. 46, 1999, pp. 395-415.
[48] Navarro, G., Multiple approximate string matching by counting, in: Proceedings of the 4th South American Workshop on String Processing (WSP ’97), 1997, pp. 125-139.
[49] Navarro, G., A guided tour to approximate string matching, ACM Computing Surveys, vol. 33, 2001, pp. 31-88.
[50] Navarro, G., Nr-grep: a fast and flexible pattern-matching tool, Software: Practice and Experience, vol. 31, 2001, pp. 1265-1312.
[51] Navarro, G. and Baeza-Yates, R., Very fast and simple approximate string matching, Information Processing Letters, vol. 72, 1999, pp. 65-70.
[52] Navarro, G. and Baeza-Yates, R., A hybrid indexing method for approximate string matching, Journal of Discrete Algorithms, vol. 1, 2000, pp. 205-239.
[53] Navarro, G. and Baeza-Yates, R., Improving an algorithm for approximate pattern matching, Algorithmica, vol. 30, 2001, pp. 473-502.
[54] Navarro, G. and Raffinot, M., Fast and flexible string matching by combining bit-parallelism and suffix automata, Journal of Experimental Algorithmics, vol. 5, 2000.
[55] Needleman, S. B. and Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, vol. 48, 1970, pp. 443-453.
[56] Peltola, H. and Tarhio, J., Alternative algorithms for bit-parallel string matching, in: Proceedings of String Processing and Information Retrieval, vol. 2857, 2003, pp.80-93, Springer Berlin Heidelberg.
[57] Pevzner, P. A., Tang, H. and Waterman, M. S., An Eulerian path approach to DNA fragment assembly, in: Proceedings of the National Academy of Sciences, vol. 98, 2001, pp. 9748-9753.
[58] Staden, R., A strategy of DNA sequencing employing computer programs, Nucleic Acids Research, vol. 6, 1979, pp. 2601-2610.
[59] Sellers, P. H., The theory and computation of evolutionary distances: pattern recognition, Journal of Algorithms, vol. 1, 1980, pp. 359-373.
[60] Sunday, D.M., A very fast substring search algorithm, Communications of the ACM, vol. 33, 1990, pp. 132-142.
[61] Smith, P.D., Experiments with a very fast substring search algorithm, Software: Practice and Experience, vol. 21, 1991, pp. 1065-1074.
[62] Simon, I., String matching algorithms and automata, in: Proceedings of 1st American Workshop on String Processing, 1993, pp. 151-157.
[63] Smith, A. D., Xuan, Z. and Zhang, M. Q., Using quality scores and longer reads improves accuracy of Solexa read mapping, BMC Bioinformatics, vol. 9, 2008, p. 128.
[64] Tarhio, J. and Ukkonen, E., Approximate Boyer-Moore string matching, SIAM Journal on Computing, vol. 22, 1993, pp. 243-260.
[65] Thathoo, R., Virmani, A., Sai Lakshmi, S., Balakrishnan, N. and Sekar, K., TVSBS: A fast exact pattern matching algorithm for biological sequences, Current Science, vol. 91, 2006, pp.47-53.
[66] Ukkonen, E., Finding approximate patterns in strings, Journal of Algorithms, vol. 6, 1985, pp. 132-137.
[67] Ukkonen, E., Approximate string matching with q-grams and maximal matches, Theoretical Computer Science, vol. 92, 1992, pp. 191-211.
[68] Ukkonen, E., On-line construction of suffix trees, Algorithmica, vol. 14, 1995, pp. 249-260.
[69] Weiner, P., Linear pattern matching algorithms, in: 14th Annual IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.
[70] Wagner, R. A. and Fischer, M. J., The string-to-string correction problem, Journal of the ACM, vol. 21, 1974, pp. 168-173.
[71] Wu, S. and Manber, U., Fast text searching: allowing errors, Communications of the ACM, vol. 35, 1992, pp. 83-91.