簡易檢索 / 詳目顯示

研究生: 郭俊廷
Kuo, Chun-Ting
論文名稱: 版本文件中對時間查詢最佳化之索引架構
Index Framework for Efficient Time-Travel Phrase Queries on Versioned Documents
指導教授: 韓永楷
Hon, Wing Kai
口試委員: 李哲榮
Lee, Cherung
姚兆明
Yiu, Siu-Ming
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2014
畢業學年度: 102
語文別: 英文
論文頁數: 37
中文關鍵詞: 文件檢索索引版本文件片語搜尋
外文關鍵詞: Document Retrieval, Indexing, Versioned Document, Phrase Searching
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近來版本文件增長有飛躍上的速度。搜索這類文件通常有時間範圍的限制。如何進行索引文件這些種類和回答time-travel phrase query在資料檢索的社群裡被廣泛討論。在本文中,我們提出了基於後綴樹索引文件新的框架來幫這些文件做索引。這個框架可以在保證空間複雜度O(nlogn + nlogT) bits和查詢的複雜度O(plogn + k)來存儲版本文件的索引,並支持任何pattern的查詢。在實作中,我們討論了一些實際問題,在此框架,並用簡潔的數據結構,減少我們的空間到接近為O(nlogT) bits。同時,我們也做了實驗來證明我們的概念。最後,我們還簡要提出了關於這個框架的延伸.


    The volume of versioned documents is growing very quickly nowadays. How to exploit the redundancy among the documents to index the documents compactly, while supporting effi- cient time-constrained keyword queries has been a hot topic recently [Anand et al., SIGIR’11, SIGIR’12; He et al., CIKM’09, CIKM’10, SIGIR’11, SIGIR’12]. In this paper, we propose a new framework to index versioned documents, and extend the keyword queries into the more general phrase queries. Our index is based on suffix tree, taking O(n) space and answering a one-sided time-constrained phrase query for any phrase P in O((|P | + k) log n) time, where n is the dataset size and k is the output size. We discuss how to tune our framework with realistic assumptions; our experiments shows that under similar space budgets, our index supports queries five times faster than the baseline inverted lists when the query length |P| is at least four.

    1 Introduction 1.1 Organization 2 Preliminaries 2.1 Generalized Suffix Tree 2.2 2-Dimensional Orthogonal Range Query 3 The Framework 3.1 Problem Definition 3.2 Index Method 3.3 QueryMethod 3.4 An Example Instantiation 4 Practical Optimization 4.1 Data Preprocessing 4.2 Framework Minimization 4.3 Query Enhancement 5 Experiment 5.1 Experimenting the Index Space 5.2 Experimenting the Query Performance 6 Further Discussion 7 Conclusion

    [1] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Efficient tem- poral keyword queries over versioned text. In Proc. of ACM CIKM Conf, 2010.
    [2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Re- trieval, pages 545–554. ACM, 2011.
    [3] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Index mainte- nance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 235–244. ACM, 2012.
    [4] Peter G Anick and Rex A Flynn. Versioning a full-text information retrieval system. In
    Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 98–111. ACM, 1992.
    32
    [5] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, Luo Si, et al. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012.
    [6] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526. ACM, 2007.
    [7] Gobinda Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
    [8] J Shane Culpepper, Matthias Petri, and Falk Scholer. Efficient in-memory top-k doc- ument retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 225–234. ACM, 2012.
    [9] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compress- ing and searching xml data via two zips. In Proceedings of the 15th international con- ference on World Wide Web, pages 751–760. ACM, 2006.
    [10] Paolo Ferragina and Rossano Venturini. System and method for string processing and searching using a compressed permuterm index, August 29 2007. US Patent App. 11/897,427.
    33
    [11] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
    [12] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.
    [13] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM, 1984.
    [14] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
    [15] Jinru He and Torsten Suel. Optimizing positional index structures for versioned doc- ument collections. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
    [16] Jinru He, Hao Yan, and Torsten Suel. Compact full-text indexing of versioned document collections. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 415–424. ACM, 2009.
    34
    [17] Jinru He, Junyuan Zeng, and Torsten Suel. Improved index compression techniques for versioned document collections. In Proceedings of the 19th ACM international confer- ence on Information and knowledge management, pages 1239–1248. ACM, 2010.
    [18] Hitwise. Hitwise: Search queries are getting longer.
    [19] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for
    top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
    [20] http://www.keyworddiscovery.com/. Keyword and search engines statistics.
    [21] Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and
    Experience, 29(13):1149–71, 1999.
    [22] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and
    memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.
    [23] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows– wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
    [24] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):1966–1967, 2009.
    35
    [25] Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-Ming Yiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, Kaiyong Zhao, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics, 28(6):878–879, 2012.
    [26] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
    [27] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
    [28] S Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 657–666. Society for Industrial and Applied Mathematics, 2002.
    [29] Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 2013.
    [30] Enno Ohlebusch, Johannes Fischer, and Simon Gog. Cst++. In String Processing and
    Information Retrieval, pages 322–333. Springer, 2010.
    [31] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
    citation ranking: Bringing order to the web. 1999.
    [32] Manish Patil, Sharma V Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter,
    and Sabrina Chandrasekaran. Inverted indexes for phrases and strings. In Proceed- ings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 555–564. ACM, 2011.
    36
    [33] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput- ing Systems, 41(4):589–607, 2007.
    [34] Gerard Salton, Edward A Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022–1036, 1983.
    [35] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.
    [36] Ian H Witten, Alistair Moffat, and Timothy C Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE