研究生: |
謝一功 Shieh, Yi-Kung |
---|---|
論文名稱: |
利用參考字串樹演算法解決精確多重字串比對問題 The Exact Multiple Pattern Matching Problem Solved by a Reference String Approach |
指導教授: |
李家同
Lee, Richard Chia-Tung 盧錦隆 Lu, Chin-Lung |
口試委員: |
唐傳義
Tang, Chuan-Yi 徐熊健 Shyu, Shyong-Jian 蔡英德 Tsai, Yin-Te |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 英文 |
論文頁數: | 55 |
中文關鍵詞: | 精確多重字串比對 、參考字串樹 、參考字串 、DNA序列 、字尾樹演算法 、字尾陣列演算法 |
外文關鍵詞: | exact multiple pattern matching, reference tree, reference string, DNA sequence, suffix tree, suffix array |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
精確多重字串比對問題是給予一個來源字串,以及多個目標字串,將所有目標字串在來源字串中精確出現的結束位置找到。藉由將所有來源字串上指定長度的子字串儲存於參考字串樹上,使得此參考樹上的所有內部節點,都存在一個參考字串,精確多重字串比對問題可以有效率地得到答案—藉由搜尋此參考字串樹可找到每個目標字串精確出現的位置。在這篇論文中,我們設計與分析建立參考字串樹與搜尋參考字串樹的演算法,並使用位元平行運算增加建立與搜尋的效能。我們以果蠅的DNA序列以及聖經作為來源字串做實驗,並與目前幾種壓縮字尾樹演算法與壓縮字尾陣列演算法做比較,實驗結果顯示我們參考字串樹演算法的速度效能優於這些演算法。我們所提參考字串樹的概念並不複雜、在精確多重串比對問題上卻有效率佳、具彈性且效能穩定的表現。
Given a text T and a set of r patterns P_1,P_2,⋯,P_r, the exact multiple pattern matching problem reports the ending positions of all occurrences of P_i in T for 1≤i≤r. By transforming all substrings with a fixed length of T into a reference tree such that each internal node stores a reference string, the exact multiple pattern matching problem can be efficiently solved by searching patterns in the tree via the guidance of the reference strings. We design and analysis elegant algorithms to construct the reference tree (the preprocessing phase) and to search patterns in the tree (the searching phase) using bitwise operations. The experiments involving problem instances from the DNA sequence and the English language are conducted to compare the performance of our approach against those of the compressed suffix tree and suffix array algorithms. The computational results demonstrate the advantage of our approach over these algorithms. In spite of the simplicity, our approach is quite efficient, flexible and robust.
[1] M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms, 2 (2004) 53‒86.
[2] R. Agarwal, A. Khandelwal and I. Stoica, Succinct: enabling queries on compressed data, in: the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, USENIX Association, 2015, pp. 337‒350.
[3] S. A. Assefa, T. M. Keane, T. D. Otto, C. Newbold and M. Berriman, ABACAS: algorithm-based automatic contiguation of assembled sequences, Bioinformatics, 25 (2009) 1968‒1969.
[4] A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, A. V. Pyshkin, A. V. Sirotkin, N. Vyahhi, G. Tesler, M. A. Alekseyev and P. A. Pevzner, SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing, Journal of Computational Biology, 19 (2012) 455‒477.
[5] A. Cerion, C. Borntraeger, J. Fitzhardinge, T. Hughes, P. Jovanovic, D. Jevtic, F. Krohm, C. Love, M. Johnson, P. Mackerras, D. Mueller, N. Nethercote, P. Pavlu, I. Raisr, J. Seward, B. V. Assche, R. Walsh, P. Waroquiers and J. Weidendorfer, Valgrind - 9. Massif: a heap profiler, https://valgrind.org/docs/manual/ms-manual.html, 2000. Accessed November 10, 2021.
[6] P. S. G. Chain, D. V. Grafham, R. S. Fulton, M. G. FitzGerald, J. Hostetler, D. Muzny, J. Ali, B. Birren, D. C. Bruce, C. Buhay, J. R. Cole, Y. Ding, S. Dugan, D. Field, G. M. Garrity, R. Gibbs, T. Graves, C. S. Han, S. H. Harrison, S. Highlander, P. Hugenholtz, H. M. Khouri, C. D. Kodira, E. Kolker, N. C. Kyrpides, D. Land, A. Lapidus, S. A. Malfatti, V. Markowitz, T. Metha, K. E. Nelson, J. Parkhill, S. Pitluck, X. Qin, T. D. Read, J. Schmutz, S. Sozhamannan, P. Sterk, R. L. Strausberg, G. Sutton, N. R. Thomson, J. M. Tiedje, G. Weinstock, A. Wollam and J. C. Detter, Genome project standards in a new era of sequencing, Science, 326 (2009) 236‒237.
[7] F. Claude, R. Konow and G. Navarro, Efficient indexing and representation of web access logs, in: the 21st International Symposium, SPIRE 2014, Springer, Cham, 2014, pp. 65‒76.
[8] M. Crochemore, C. Hancart and T. Lecroq, Algorithms on strings, Cambridge University Press, Cambridge, England, 2007.
[9] F. A. Louza, S. Gog, L. Zanotto, Gu. Araujo and G. P. Telles, Parallel computation for the all-pairs suffix-prefix problem, in: the 23rd International Symposium, SPIRE 2016, Springer, Cham, 2016, pp. 122‒132.
[10] P. Ferragina, F. Manzini, V. Mäkinen and G. Navarro, Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3 (2007) 20.
[11] M. Galardini, E. G. Biondi, M. Bazzicalupo and A. Mengoni, Contiguator: a bacterial genomes finishing tool for structural insights on draft genomes, Source Code for Biology and Medicine, 6 (2011) 11.
[12] S. Gog and E. Ohlebusch, Fast and lightweight lcp-array construction algorithms, in: the 13th Workshop on Algorithm Engineering and Experiments, ALENEX 2011, SIAM Society for Industrial and Applied mathematics, 2011, pp. 25‒34.
[13] S. Gog, T. Beller, A. Moffat and M. Petri, From theory to practice: plug and play with succinct data structures, in: the 13th International Symposium on Experimental Algorithms, SEA 2014, Springer, Cham, 2014, pp. 326‒337.
[14] R. Grossi, A. Gupta and J. S. Vitter, High-order entropy-compressed text indexes, in: the 14th annual ACM-SIAM symposium on Discrete algorithms, SODA03, Society for Industrial and Applied, 2003, pp. 841‒850.
[15] R. W. Hamming, Error detecting and error correcting codes, The Bell System Technical Journal, 29 (1950) 147‒160.
[16] H. S. Warren, Hacker’s delight, second edition, Addison-Wesley Professional, Boston, Massachusetts, 2013.
[17] M. Y. Kao, Encyclopedia of algorithms, Springer, Boston, Massachusetts, 2016.
[18] T. Kasai, G. Lee, H. Arimura, A. Arikawa and K. Park, Linear-time longest-common-prefix computation in suffix arrays and its applications, in: the 12th Annual Symposium on Combinatorial Pattern Matching, CPM 2001, Lecture Notes in Computer Science, 2001, pp. 181‒192.
[19] Z. Li, J. Li and H. Huo, Optimal in-place suffix sorting, in: the 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018, Lecture Notes in Computer Science, 2018, pp. 268‒284.
[20] L. Liu, Y. Li, S. Li, N. Hu, Y. He, R. Pong, D. Lin, L. Lu and M. Law, Comparison of next-generation sequencing systems, Journal of Biomedicine and Biotechnology, 2012 (2012) 251364.
[21] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J. Tang, G. Wu, H. Zhang, Y. Shie, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D. W. Cheung, S. M. Yiu, S. Peng, X. Zhu, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T. W. Lam and J. Wang, SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience, 1 (2012) 1.
[22] U. Manber and G. Myers, Suffix arrays: a new method for on-line string searches, SIAM Journal on Computing, 22 (1993) 935‒948.
[23] D. C. Mariano, F. L. Pereira, P. Ghosh, D. Barh, H. C. Figueiredo, A. Silva, R. T. Ramos and V. A. Azevedo, MapRepeat: an approach for effective assembly of repetitive regions in prokaryotic genomes, Bioinformation, 11 (2015) 276‒279.
[24] M. D. Suggli, S. J. Puglisi and C. Boucher, Efficient indexed alignment of contigs to optical maps, in: the 14th International Workshop on Algorithms in Bioinformatics, WABI 2014, Lecture Notes in Computer Science, 2014, pp. 68‒81.
[25] G. Navarro and M. Raffinot, Flexible pattern matching in strings, Cambridge University Press, Cambridge, England, 2002.
[26] E. Ohlebusch, J. Fischer and S. Gog, CST++, in: the 17th international conference on String processing and information retrieval, SPIRE2010, Springer, Berlin, 2010, pp. 322‒333.
[27] E. Ohlebusch, S. Stauß and U. Baier, Trickier XBWT tricks, in: the 25th International Symposium on String Processing and Information Retrieval, SPIRE2018, Lecture Notes in Computer Science, 2018, pp. 325‒333.
[28] D. Paulino, R. L. Warren, B. P. Vandervalk, A. Raymond, S. D. Jackman and I. Birol, Sealer: a scalable gap-closing application for finishing draft genomes, BMC Bioinformatics, 16 (2015) 230.
[29] V. C. Piro, H. Faoro, V. A. Weiss, M. B. R. Steffens, F. O. Pedrosa, E. M. Souza and R. T. Raittz, FGAP: an automated gap closing tool, BMC Research Notes, 7 (2014) 371.
[30] A. Poyias and R. Raman, Improved practical compact dynamic tries, in: the 22nd International Symposium on String Processing and Information Retrieval, SPIRE 2015, Lecture Notes in Computer Science, 2015, pp. 324‒336.
[31] A. I. Rissman, B. Mau, B. S. Biehl, A. E. Darling, J. D. Glasner and N. T. Perna, Reordering contigs of draft genomes using the Mauve aligner, Bioinformatics, 25 (2009) 2071‒2073.
[32] L. M. S. Russo, G. Navarro and A. L. Oliveira, Fully compressed suffix trees, ACM Transactions on Algorithms, 7 (2011) 1‒34.
[33] K. Sadakane, New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48 (2003) 294‒313.
[34] K. Sadakane, Compressed suffix trees with full functionality, Theory of Computing Systems, 14 (2007) 589‒607.
[35] J. Fuentes-Sepúlveda, E. Elejalde, L. Ferres and D. Seco, Efficient wavelet tree construction and querying for multicore architectures, in: the 13th International Symposium on Experimental Algorithms, SEA 2014, Lecture Notes in Computer Science, 2014, pp. 150‒161.
[36] E. Ukkonen, On-line construction of suffix trees, Algorithmica, 14 (1995) 249‒260.
[37] P. Weiner, Linear pattern matching algorithms, in: the 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, Yale University, 1973, pp. 1‒11.
[38] D. R. Zerbino and E. Birney, Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome research, 18 (2008) 821‒829.