研究生: |
辛忠翰 Hsin, Chung-Han |
---|---|
論文名稱: |
版本文件中對時間查詢最佳化之索引架構的更省空間實作 More Space Efficient and Practical Framework for Time Travel Phrase Queries on Versioned Documents |
指導教授: |
韓永楷
Hon, Wing-Kai |
口試委員: |
李哲榮
Lee, Che-Rung 盧錦隆 Lu, Chin-Lung |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 32 |
中文關鍵詞: | 時間查詢 、後綴樹 、壓縮後綴樹 、倒排索引 、同源祖先 |
外文關鍵詞: | time-travel query, compressed suffix tree, inverted list, least common ancestor, k^2 treaps, Orthogonal Range Query |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目前版本文件的成長是越來越迅速的。因此回答time-travel phrase的查詢的需求是越來越必要的。目前有很多方法可以提高回答time-travel phrase的查詢的性能。 在本篇論文中,我們參考了"Space-Efficient Index Framework for Time-Travel Phrase Query on Versioned Document"的實作。然而我們發現此實作使用了太多的索引空間,因此我們希望提供出新的實作以降低其對所以空間的需求。為了達成這個目的,我們實作出了兩個新方法:Impl1:
compressed suffix tree + 將每個版本視為獨立的檔案 和 Impl2: compressed suffix tree + K^2-Treaps。這兩個方法與郭的實作相比,Impl1減少了63%的索引空間,而Impl2更是減少了80%的索引空間。且這兩種實作也與郭的實作相同,皆支援了time-travel phrase以及Top-k的查詢。
Version documents are growing faster nowadays. Answering time-travel phrase queries is getting more and more necessary. There are lots of methods to improve the performance of answering time-travel phrase queries. In this paper, we reference the implementation of "Space-Efficient Index Framework for Time-Travel Phrase Query on Versioned Document". However, we find that Kuo's implementation use lots of index space, so we want to reduce the index space. In order to achieve this purpose, we have two implementations, Impl1: compressed suffix tree + treating versions as different documents and Impl2: compressed suffix tree + K^2-Treaps. Comparing with the total index of Kuo's implementation, Imple1 reduced by 63% and Impl2 reduced by 80%, and these two implementations also support time-travel phrase queries and Top-k queries.
[1] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Ef- ficient temporal keyword queries over versioned text. In Proc. of ACM CIKM Conf, 2010.
[2] Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Tem- poral index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and develop- ment in Information Retrieval, pages 545–554. ACM, 2011.
[3] Ira Assent, Ralph Krieger, Farzad Afschari, and Thomas Seidl. The ts-tree: efficient time series search and retrieval. In Proceedings of the 11th inter- national conference on Extending database technology: Advances in database technology, pages 252–263. ACM, 2008.
[4] Klaus Berberich, Srikanta Bedathur, Thomas Neumann, and Gerhard Weikum. Fluxcapacitor: efficient time-travel text search. In Proceedings of the 33rd international conference on Very large data bases, pages 1414–1417. VLDB Endowment, 2007.
[5] Nieves R Brisaboa, Guillermo De Bernardo, Roberto Konow, and Gonzalo Navarro. k2-treaps: Range top-k queries in compact space. In International Symposium on String Processing and Information Retrieval, pages 215–226. Springer, 2014.
[6] Shu Yao Chien, Vassilis J Tsotras, and Carlo Zaniolo. Xml document ver- sioning. ACM SIGMOD Record, 30(3):46–53, 2001.
[7] Shu Yao Chien, Vassilis J Tsotras, Carlo Zaniolo, and Donghui Zhang. Stor- ing and querying multiversion xml documents using durable node numbers. In Web Information Systems Engineering, 2001. Proceedings of the Second International Conference on, volume 1, pages 232–241. IEEE, 2001.
[8] Francisco Claude, Antonio Farin ̃a, Miguel A Mart ́ınez-Prieto, and Gonzalo Navarro. Indexes for highly repetitive document collections. In Proceedings of the 20th ACM international conference on Information and knowledge man- agement, pages 463–468. ACM, 2011.
[9] Johannes Fischer, Veli M ̈akinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theoretical Computer Science, 410(51):5354–5364, 2009.
[10] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005.
[11] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM, 1984.
[12] Jinru He and Torsten Suel. Faster temporal range queries over versioned text. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 565–574. ACM, 2011.
[13] Jinru He and Torsten Suel. Optimizing positional index structures for ver- sioned document collections. In Proceedings of the 35th international ACM SI- GIR conference on Research and development in information retrieval, pages 245–254. ACM, 2012.
[14] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Space-efficient frame- work for top-k string retrieval problems. In Foundations of Computer Science, 2009. FOCS’09. 50th Annual IEEE Symposium on, pages 713–722. IEEE, 2009.
[15] Chun-Ting Kuo and Wing-Kai Hon. Practical index framework for efficient time-travel phrase queries on versioned documents. In Data Compression Conference (DCC), 2016, pages 556–565. IEEE, 2016.
[16] Veli M ̈akinen, Gonzalo Navarro, Jouni Sir ́en, and Niko Va ̈lima ̈ki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010.
[17] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
[18] J Ian Munro, Venkatesh Raman, and S Srinivasa Rao. Space efficient suffix trees. Journal of Algorithms, 39(2):205–222, 2001.
[19] Enno Ohlebusch, Johannes Fischer, and Simon Gog. Cst++. In String Pro- cessing and Information Retrieval, pages 322–333. Springer, 2010.
[20] Enno Ohlebusch and Simon Gog. A compressed enhanced suffix array sup- porting fast string matching. In String Processing and Information Retrieval, pages 51–62. Springer, 2009.
[21] Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273–282. ACM, 2014.
[22] Lu ́ıs MS Russo, Gonzalo Navarro, and Arlindo L Oliveira. Fully-compressed suffix trees. In LATIN 2008: Theoretical Informatics, pages 362–373. Springer, 2008.
[23] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589–607, 2007.
[24] R Schneider, B Seeger, N Beckmann, and H Kriegel. The r*-tree: an efficient and robust access method for points and rectangles. In Proc. ACM SIGMOD Symposium on Principles of Database Systems, pages 322–331, 1990.
[25] Chien Shu-Yao, Vassilis J Tsotras, and Carlo Zaniolo. Version management of xml documents. In The world wide web and databases, pages 184–200. Springer, 2001.
[26] Jouni Sir ́en, Niko Va ̈lim ̈aki, Veli M ̈akinen, and Gonzalo Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In String Processing and Information Retrieval, pages 164–175. Springer, 2009.
[27] Vyacheslav Zholudev and Michael Kohlhase. Tntbase: a versioned storage for xml. In Proceedings of Balisage: The Markup Conference, volume 3, page 64, 2009.