簡易檢索 / 詳目顯示

研究生: 陳隆琦
Chen, Long-Qi
論文名稱: 用於尋找相似字串中罕見模式之索引
τλ-Index: Locating Rare Patterns in Similar Strings
指導教授: 韓永楷
Hon, Wing-Kai
口試委員: 王弘倫
Wang, Hung-Lung
蔡孟宗
Tsai, Meng-Tsung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 22
中文關鍵詞: 文本索引罕見模式藥物發現
外文關鍵詞: text indexing, rare patterns, drug discovery
相關次數: 點閱:19下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在本論文中,我們提出了τ λ-index,一種專為在相似字串中尋找罕見模式而設計的節省空間的索引。此索引在藥物測試的研究中有應用潛力。透過在實際資料集上模擬實驗,我們驗證了此索引的可行性,同時亦與最先進的r-index進行了比較。實驗結果顯示,在尋找相似字串中的罕見模式問題上,我們的索引在時間和空間上均呈現相對優秀的效果。


    We propose a space-efficient index, called τλ-index, for locating rare patterns among similar strings, which has potential usage in drug discovery. Experiments are conducted on real data to compare our index with the state-of-the-art r-index. The results indicate superior performance of our index in both time and space for the task of locating rare patterns in similar strings.

    Abstract (Chinese) I Acknowledgments (Chinese) II Abstract III Contents IV List of Figures VI 1 Introduction 1 2 Preliminaries 3 2.1 Aho-Corasick Algorithm 3 2.2 Compressed Trie 4 3 The τ λ-Index 5 3.1 Minimal Factors 6 3.2 Filtering Minimal Factors 7 3.3 Data Structure for Minimal Factor Finding 8 3.4 Data Structure for Location Matching 9 4 Experiment Results 12 4.1 Experiment Setup 12 4.2 Space 15 4.3 Location Query Time 15 5 Conclusion 18 Bibliography 20

    [1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975.
    [2] Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403–410, 1990.
    [3] Djamal Belazzougui. Succinct Dictionary Matching with No Slowdown. In Annual Symposium on Combinatorial Pattern Matching, pages 88–100, 2010.
    [4] Michael Burrows and David J. Wheeler. A Block Sorting Lossless Data Compression Algorithm. Technical report, Digital Equipment Corporation, 1994.
    [5] Marta Byrska-Bishop and Uday S. Evani and Xuefang Zhao and Anna O. Basile and Haley J. Abel and Allison A. Regier and André Corvelo and Wayne E. Clarke and Rajeeva Musunuri and Kshithija Nagulapalli and Susan Fairley and Alexi Runnels and Lara Winterkorn and Ernesto Lowy and Evan E. Eichler and Jan O. Korbel and Charles Lee and Tobias Marschall and Scott E. Devine and William T. Harvey and Weichen Zhou and Ryan E. Mills and Tobias Rausch and Sushant Kumar and Can Alkan and Fereydoun Hormozdiari and Zechen Chong and Yu Chen and Xiaofei Yang and Jiadong Lin and Mark B. Gerstein and Ye Kai and Qihui Zhu and Feyza Yilmaz and Chunlin Xiao and {Paul Flicek} and Soren Germer and Harrison Brand and Ira M. Hall and Michael E. Talkowski and Giuseppe Narzisi and Michael C. Zody. High-Coverage Whole-Genome Sequencing of the Ex-panded 1000 Genomes Project Cohort including 602 Trios. Cell, 185(18):3426–3440.e19, 2022.
    [6] International Human Genome Consortium. Initial Sequencing and Analysis of the Human Genome. Nature, 409:860–921, 2001.
    [7] Paolo Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52(4):552–581, 2005.
    [8] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-Time Text Indexing in BWT-Runs Bounded Space. In ACM-SIAM Symposium on Discrete Algorithms, page 1459–1477, 2018.
    [9] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultra-fast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology, 10, 2005.
    [10] Heng Li and Richard Durbin. Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform. Bioinformatics, 25(14):1754–1760, 2009.
    [11] Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff, David J. Hunter, Mark I. McCarthy, Erin M. Ramos, Lon R. Cardon, Aravinda Chakravarti, Judy H. Cho, Alan E. Guttmacher, Augustine Kong, Leonid Kruglyak, Elaine Mardis, Charles N. Rotimi, Montgomery Slatkin, David Valle, Alice S. Whittemore, Michael Boehnke, Andrew G. Clark, Evan E. Eichler, Greg Gibson, Jonathan L. Haines, Trudy F. C. Mackay, Steven A. McCarroll, and Peter M. Visscher. Finding the Missing Heritability of Complex Diseases. Nature, 461(7265):747–753, Oct 2009.
    [12] Edward M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(2):262–272, 1976.
    [13] Peter Weiner. Linear Pattern Matching Algorithms. In IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973.
    [14] Xuhua Xia. Bioinformatics and Drug Discovery. Current Topics in Medical Chemistry, 17(15):1709–1726, 2017.

    QR CODE