研究生: |
陳庭偉 Chen, Ting Wei |
---|---|
論文名稱: |
以BWT建立方式解決最大重複子字串問題 On the Construction of the Burrows-Wheeler Transform and the Maximal Repeating Group Finding |
指導教授: |
盧錦隆
Lu, Chin Lung |
口試委員: |
李家同
Lee, Chia Tung 唐傳義 Tang, Chuan Yi |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 英文 |
論文頁數: | 98 |
中文關鍵詞: | 最大重複子字串 、字串比對 |
外文關鍵詞: | Maximal Repeating Groups, Exact String Matching |
相關次數: | 點閱:73 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在這篇論文中,我們對於以Burrows-Wheeler Transform (簡稱BWT) 來解決字串比對問題有相當大的興趣,使用BWT的問題在於為某個字串產生出BWT非常耗費時間,我們的方法是以KSS的方法為基礎來修改並產生出BWT。我們的方法較KSS更簡單理解並實作,而且我們的實驗結果也顯示出我們產生出BWT的方法相當有效率。同時,我們也對最大重複子字串的問題感到相當大的興趣,也依照我們產生出BWT的方法稍做修改之後並利用到解決此問題上,舉例來說,我們的實驗裡,有一串長度為155606181個字的DNA序列,在這麼長的序列中找到長度大於2000的重複子字串只花了我們226秒,我們也成功找出了55對最大重複子字串。
In this thesis, we are interested in the Burrows-Wheeler Transform (BWT for short) for exact string matching. The problem of BWT is that it is very time-consuming to construct BWT. We have developed a method which is based upon the KSS Method to construct BWT. Our method is quite easy to comprehend and implement. Experimental results show that our method is efficient. We are also interested in the maximal repeating group problem. We have developed an efficient method to find maximal repeating groups. For example, for a DNA sequence with length 155606181, it took only 226 seconds to find 55 maximal repeating groups with lengths longer than 2000.
[ARM16] The 3D Folding of Metazoan Genomes Correlates with the Association of Similar Repetitive Elements, Cournac, A., Koszul, R., Mozziconacci, J., Nucleic Acids Research, 2016.
[BM77] Boyer, R. S. and Moore, J. S., A Fast String Searching Algorithm, Communications of ACM, Vol. 20, No. 10, 1977, p.p. 762–772.
[BW94] Burrows, M. and Wheeler, D. J., A block sorting lossless data compression algorithm. Technical Report124, Digital Equipment Corporation, 1994.
[FM2000] Ferragina P. and Manzini G., Opportunistic data structures with applications. Proceedings of the 41st Symposium on Foundations of Computer Science, 2000.
[CCGJLPR94] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski, W. and Rytter, W., Speeding Up Two String-matching Algorithms, Algorithmica, Vol. 12, 1994, pp. 247-267.
[FP74] Fischer, M. M. and Paterson, M. S., String-Matching and Other Products, SIAM-AMS Proceedings, Vol. 7., 1974, pp. 113-125 (In "Complexity of Computation", R.M. Karp.)
[H2012] Hou, K. W., The Discrete Convolution Method on Solving the Exact String Matching Problem, MS Thesis, 2012, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan.
[H75] A Linear Space Algorithm for Computing Maximal Common Subsequences, Hirschberg, D. S., Communications of the ACM, Vol. 18, No. 6, 1975, pp. 341-343.
[H80] Horspool, R. N., Practical Fast Searching in Strings, Software Practice and Experience, Vol. 10, 1980, pp. 501-506.
[KMP77] Knuth, D. E., Morris, J. H. and Pratt, V. R., Fast Pattern Matching in Strings, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp. 323-350.
[K13] Kung, B. R. On the Repeating Group Finding Problem, M. S. Thesis, Department of Computer Science, Takming University of Science and Technology, Taipei, Taiwan, 2013.
[KSS06] Kärkkäinen, J., Sanders, P. and Burkhardt, S. Linear Work Suffix Array Construction, Journal of the ACM (JACM), Volume 53 Issue 6, November 2006. pp. 918-936.
[KGAP01] de Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A., Pollock, D. D., Repetitive Elements May Comprise Over Two-Thirds of the Human Genome, PLoS Genet 7. 12 (2011): e1002384.
[M56] McClintock, B., Controlling Element and the Gene, Cold Spring Harb. Symp. Quant. Biol. 1956; 21: 197-216.
[M76] McCreight, E. M., A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, Vol. 23, 1976, pp. 262-272.
[MM93] Manber, U. and Myers, G., Suffix Arrays: A New Method for On-line String Searches, SIAM Journal on Computing, Vol. 22, 1993, pp. 935-948.
[MP70] Morris, J. H. and Pratt, V. R., A Linear Pattern-matching Algorithm, Technical Report 40, University of California, Berkeley, 1970.
[O72] Ohno, S., So much “junk” DNA in our genome, Brookhaven Symp Biol, Vol. 23, 1972.
[U95] Ukkonen, E., On-line Construction of Suffix Trees, Algorithmica, Vol. 14, 1995, pp. 249-260.
[UWD08] Ussery, D. W., Wassenaar, T., Borini, S., Word Frequencies, Repeats, and Repeat-related Structures in Bacterial Genomes, Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists. Computational Biology 8 (1 ed.). Springer, 2008, Pp. 133-144.
[R97] Raffinot, M., On the Multi Backward Dawg Matching Algorithm, Proc. The 4th South American Workshop on String Processing, 1997, pp. 149-165.
[S05] Shapiro, J.A., von Sternberg R. Why Repetitive DNA is Essential to Genome Function, Biol. Rev. 2005; 80: 227-250.
[W2004] Wu, B. H., Convolution and Its Application to Sequence Analysis, MS Thesis, 2004, National Chi Nan University, Puli, Nantou, Taiwan.
[W73] Weiner, P., Linear Pattern Matching Algorithms, 14th Annual IEEE Symposium on Switching and Automata Theory, 1973, pp. 1-11.