簡易檢索 / 詳目顯示

研究生: 張哲維
Alan Chang
論文名稱: 字串資料庫中的相似搜尋
Fast Similarity Search in String Databases
指導教授: 許奮輝
Fenn-Huei Simon Sheu
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2005
畢業學年度: 93
語文別: 英文
論文頁數: 25
中文關鍵詞: 字串索引結構相似搜尋
外文關鍵詞: String Index, Similarity Search
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在龐大的字串資料庫中做有效相似搜尋需要有效能的索引結構的支持,當我們要搜尋的字串可以是任意長度的時,好的索引結構設計變成一個極大挑戰,MRS是現有的方法提供一個低計算量的下限函數,讓我們能夠有效率的在資料庫中取出擁有較相似子字串的候選人,因此,我們只需要少部分的資料庫中的字串做真實昂貴edit distance的計算以便得到最後的解答,利用下限函數的過濾大大的降低查詢處理的時間,而我們的方法,主要是希望提升MRS的方法的效能。我們方法事實上只是改變原本MRS中資料庫的字串和查詢字串的角色,僅管簡單,但是我們的方法減少了10倍硬碟分頁的存取,而我們的索引結構所需的記憶體使用量只要原來MRS的一半。


    Efficient similarity search in large string databases requires effective index support. Since long strings have each numerous substrings of arbitrary length, the effective index designs are of great challenge. The existing solution, namely MRS [11], employs a low-cost lower bound function to sieve out the most similar candidates from the majority of unlikely database substrings. Therefore, only very small portions of string databases require the expensive true edit distance computation to finalize the query. A significant savings in overall query processing cost can be realized by the filtration feature of lower bound functions. In this paper, we seek to improve MRS to its full potential. Specifically, we propose a very simple method that exchanges the roles of database strings and query string in the original MRS design. Despite simplicity, our solution can further improve the query performance by 10 times in terms of disk page accesses while using only half of the original index’s size.
    Keywords: String Index, Similarity Search, Edit Distance, Near Neighbor Query

    CHAPTER 1 INTRODUCTION 1 CHAPTER 2 5 RELATED WORK: ORIGINAL MRS (MULTI RESOLUTION STRING) INDEX STRUCTURE 5 CHPATER 3 OUR SOLUTION: MRS+ 8 3.1 INDEX STRUCTURE CONSTRUCTION 8 3.2 RANGE QUERY PROCESSING 10 3.3 NEAREST NEIGHBOR QUERY PROCESSING 16 CHAPTER 4 PERFORMANCE STUDY 19 4.1 RANGE QUERY PROCESSING 19 4.2 NEAREST NEIGHBOR QUERY PROCESSING 21 CHAPTER 5 CONCLUDING REMARKS 23 BIBLIOGRAPHY 24

    [1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic Local Alignment Search Tool," Journal of Molecular Biology, 215(3):403-410, 1990.
    [2] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, 25(17):3389-3402, 1997.
    [3] R. Baeza-Yates and G. H. Gonnet, "A New Approach to Text Searching," Communications of the ACM, 35(10):74-82, October, 1992.
    [4] R. A. Baeza-Yates and G. Navarro, "Faster Approximate String Matching," Algorithmica, 23(2):127-158, 1999.
    [5] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis, Cambridge University Press, 1998.
    [6] P. Ferragina and G. Manzini, "An Experimental Study of an Opportunistic Index," in Proc. of ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 269-278, 2001.
    [7] D. Gusfiled, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997.
    [8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Mogan Kaufmann Publishers, 2001.
    [9] E. Hunt, "The Suffix Sequoia Index for Approximate String Matching," Department of Computing Science, University of Glosgow, Glasgow, UK, TR 2003-135, March, 2003.
    [10] E. Hunt, M. P. Atkinson, and R. W. Irving, "A Database Index to Large Biological Sequences," in Proc. of VLDB, Roma, Italy, 2001.
    [11] T. Kahveci and A. Singh, "An Efficient Index Structures for String Databases," in Proc. of VLDB, Roma, Italy, pp. 351-360, Sept., 2001.
    [12] W. J. Kent, "BLATThe BLAST-Like Alignment Tool," Genome Research, 12(4):656-664, April, 2002.
    [13] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, "An Information Based Sequence Distance and Its Application to Whole Mitochondrial Genome Phylogeny," Bioinformatics, 17(2):149-154, 2001.
    [14] U. Manber and E. Myers, "Suffix Arrays: A New Method fo On-line String Searches," SIAM Journal on Computing, 22(5):935-948, 1993.
    [15] C. Meek, J. M. Patel, and S. Kasetty, "OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences," in Proc. of VLDB, Germany, Sept., 2003.
    [16] NCBI, "National Center for Biotechnology Information," http://www.ncbi.nlm.nih.gov/, 2004.
    [17] Z. Ning, A. J. Cox, and J. C. Mullikin, "SSAHA: a fast search method for large DNA databases," Genome Research, 11(10):1725-1729, Oct., 2001.
    [18] B. C. Ooi, H. H. Pang, H. Wang, L. Wong, and C. Yu, "Fast Filter-and-Refine Algorithms for Subsequence Selection," in Proc. of IDEAS, pp. 243-255, 2002.
    [19] O. Ozturk and H. Ferhatosmanoglu, "Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases," in Proc. of IEEE Sym. on BioInformatics and BioEngineering, Maryland, pp. 359-366, March, 2003.
    [20] G. Qian, Q. Zhu, Q. Xue, and S. Pramanik, "The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces," in Proc. of VLDB, Germany, Sept., 2003.
    [21] S. C. Sahinalp, M. Tasan, J. Macker, and Z. M. Ozsoyoglu, "Distance Based Indexing for String Proximity Search," in Proc. of IEEE Data Engineering, 2003.
    [22] T. Smith and M. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, 147:195-197, 1981.
    [23] J. K. Uhlmann, "Satisfying General Proximity/Similarity Queries with Metric Trees," Information Processing Letters, 40(4):175-179, Nov. 25, 1991.
    [24] J.-S. Varre, J.-P. Delahaye, and E. Rivals, "The Transformation Distance : A Dissimilarity Measure Based on Movements of Segments," Bioinformatics, 15(3):194-202, 1999.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE