版本文件中對時間查詢最佳化之索引架構｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	郭俊廷 Kuo, Chun-Ting
論文名稱：	版本文件中對時間查詢最佳化之索引架構 Index Framework for Efficient Time-Travel Phrase Queries on Versioned Documents
指導教授：	韓永楷 Hon, Wing Kai
口試委員:	李哲榮 Lee, Cherung 姚兆明 Yiu, Siu-Ming
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2014
畢業學年度：	102
語文別：	英文
論文頁數：	37
中文關鍵詞：	文件檢索、索引、版本文件、片語搜尋
外文關鍵詞：	Document Retrieval, Indexing, Versioned Document, Phrase Searching
相關次數：	點閱：92 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近來版本文件增長有飛躍上的速度。搜索這類文件通常有時間範圍的限制。如何進行索引文件這些種類和回答time-travel phrase query在資料檢索的社群裡被廣泛討論。在本文中，我們提出了基於後綴樹索引文件新的框架來幫這些文件做索引。這個框架可以在保證空間複雜度O(nlogn + nlogT) bits和查詢的複雜度O(plogn + k）來存儲版本文件的索引，並支持任何pattern的查詢。在實作中，我們討論了一些實際問題，在此框架，並用簡潔的數據結構，減少我們的空間到接近為O(nlogT) bits。同時，我們也做了實驗來證明我們的概念。最後，我們還簡要提出了關於這個框架的延伸.

The volume of versioned documents is growing very quickly nowadays. How to exploit the redundancy among the documents to index the documents compactly, while supporting effi- cient time-constrained keyword queries has been a hot topic recently [Anand et al., SIGIR’11, SIGIR’12; He et al., CIKM’09, CIKM’10, SIGIR’11, SIGIR’12]. In this paper, we propose a new framework to index versioned documents, and extend the keyword queries into the more general phrase queries. Our index is based on suffix tree, taking O(n) space and answering a one-sided time-constrained phrase query for any phrase P in O((|P | + k) log n) time, where n is the dataset size and k is the output size. We discuss how to tune our framework with realistic assumptions; our experiments shows that under similar space budgets, our index supports queries five times faster than the baseline inverted lists when the query length |P| is at least four.

Introduction
1 Organization
Preliminaries
1 Generalized Suffix Tree
2 2-Dimensional Orthogonal Range Query
The Framework
1 Problem Definition
2 Index Method
3 QueryMethod
4 An Example Instantiation
Practical Optimization
1 Data Preprocessing
2 Framework Minimization
3 Query Enhancement
Experiment
1 Experimenting the Index Space
2 Experimenting the Query Performance
Further Discussion
Conclusion
                                

[1] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Efficient tem- poral keyword queries over versioned text. In Proc. of ACM CIKM Conf, 2010.
[2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Re- trieval, pages 545–554. ACM, 2011.
[3] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Index mainte- nance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 235–244. ACM, 2012.
[4] Peter G Anick and Rex A Flynn. Versioning a full-text information retrieval system. In
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 98–111. ACM, 1992.
32
[5] Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, Luo Si, et al. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012.
[6] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. A time machine for text search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 519–526. ACM, 2007.
[7] Gobinda Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
[8] J Shane Culpepper, Matthias Petri, and Falk Scholer. Efficient in-memory top-k doc- ument retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 225–234. ACM, 2012.
[9] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Compress- ing and searching xml data via two zips. In Proceedings of the 15th international con- ference on World Wide Web, pages 751–760. ACM, 2006.
[10] Paolo Ferragina and Rossano Venturini. System and method for string processing and searching using a compressed permuterm index, August 29 2007. US Patent App. 11/897,427.
33
[11] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Dis- crete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
[12] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.
[13] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM, 1984.
[14] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Pro- ceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
[15] Jinru He and Torsten Suel. Optimizing positional index structures for versioned doc- ument collections. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
[16] Jinru He, Hao Yan, and Torsten Suel. Compact full-text indexing of versioned document collections. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 415–424. ACM, 2009.
34
[17] Jinru He, Junyuan Zeng, and Torsten Suel. Improved index compression techniques for versioned document collections. In Proceedings of the 19th ACM international confer- ence on Information and knowledge management, pages 1239–1248. ACM, 2010.
[18] Hitwise. Hitwise: Search queries are getting longer.
[19] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient framework for
top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
[20] http://www.keyworddiscovery.com/. Keyword and search engines statistics.
[21] Stefan Kurtz. Reducing the space requirement of suffix trees. Software-Practice and
Experience, 29(13):1149–71, 1999.
[22] Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. Ultrafast and
memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.
[23] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows– wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[24] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. Soap2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):1966–1967, 2009.
35
[25] Chi-Man Liu, Thomas Wong, Edward Wu, Ruibang Luo, Siu-Ming Yiu, Yingrui Li, Bingqiang Wang, Chang Yu, Xiaowen Chu, Kaiyong Zhao, et al. Soap3: ultra-fast gpu-based parallel alignment tool for short reads. Bioinformatics, 28(6):878–879, 2012.
[26] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
[27] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
[28] S Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 657–666. Society for Industrial and Applied Mathematics, 2002.
[29] Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 2013.
[30] Enno Ohlebusch, Johannes Fischer, and Simon Gog. Cst++. In String Processing and
Information Retrieval, pages 322–333. Springer, 2010.
[31] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: Bringing order to the web. 1999.
[32] Manish Patil, Sharma V Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter,
and Sabrina Chandrasekaran. Inverted indexes for phrases and strings. In Proceed- ings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 555–564. ACM, 2011.
36
[33] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Comput- ing Systems, 41(4):589–607, 2007.
[34] Gerard Salton, Edward A Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022–1036, 1983.
[35] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.
[36] Ian H Witten, Alistair Moffat, and Timothy C Bell. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文