簡易檢索 / 詳目顯示

研究生: 吳仕斌
Wu, Shih-Bin
論文名稱: 有效率的索引文件及建構
Efficient Index for Retrieving Top-k Most Frequent Documents and Its Construction
指導教授: 韓永楷
Hon, Win-Kai
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 47
中文關鍵詞: 文件索引資料索引字尾數後序樹
外文關鍵詞: document retrival, text retrieval, suffix tree, top-k
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • In the document retrieval problem, we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P,
    we can identify which documents in the collection contain P. In this thesis, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF metric.
    A related problem was studied where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be outputted. We develop
    some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(|P| + logDloglogD + k) time and taking O(DlogD) space. Our approach is based on novel use of the suffix tree called induced generalized suffix tree (IGST).


    在文件檢索的問題中,我們給定一個總長度為 D 的文件(長字串)集合,目標是要對此文件集合建立索引,使得對任意一個 P 字串,我們能夠快速地知道哪些文件有包含 P。在這篇論文,我們提出了以上問題的一個延伸,名為 top-k 文件檢索。在此問題,我們並不會列出所有包含 P 的文件,而是只把 P 出現次數最多的 k 個文件列出。這個問題是搜尋引擎的根基。

    Muthukrishnan 曾提出一個相關問題。他是要找出哪些文章 P 出現的次數超過 f 次。然而從資訊檢索的觀點來看,使用者很難知道要令 f 為多少,才能得到一個合理的或有意義的文件輸出量。在此論文,我們針對以上情形提出一些解法,並藉此得到有效率的 top-k 文件檢索的索引。我們的索引能夠在 O(|P| + logDloglogD + k) 時間內檢索,且只花 O(DlogD) 空間。我們的方法是根據廣為使用的字尾樹 (Suffix Tree) 的變形,在此我們稱之為導出字尾樹 (Induced Generalized Suffix Tree)。

    1 Introduction 1.1 Our Problems 1.2 Thesis Organization 2 Basic Tools 2.1 Generalized Suffix Tree 2.2 Optimal Index for Colored Range Query 2.3 Y-Fast Trie for Efficient Successor Query 3 Induced Generalized Suffix Tree 3.1 IGST-f: The IGST for Frequency f 3.2 Array Representation of IGST-f 4 Efficient Index for Top-k Document Retrieval 5 Construction Algorithms 5.1 Augmenting The Count Values 5.1.1 Time Analysis 6 Experimental Results 6.1 The Experimental Setup 6.1.1 The Data Set 6.1.2 The Indexes 6.1.3 The Platform 6.2 Our Limitations 6.3 Performance Comparison Among Various Indexes 6.3.1 Benefit of Count Information 6.3.2 Adaptation vs Ours: Which One to Choose 6.3.3 Array vs Y-Fast Trie: Which One to Choose 6.4 Case Study: Our Index with Array Representation 6.4.1 Index Space Distribution 6.4.2 Query Time Distribution 6.4.3 Benefit of Heuristic I 7 Conclusion and Open Problems

    [1] M. A. Bender and M. Farach-Colton. The LCA Problem Revisited. In Proceedings of Latin American Symposium on Theoretical Informatics, pages 88-94, 2000.
    [2] I. Bialynicka-Birula and R. Grossi. Rank-Sensitive Data Structures. In Proceedings of International Symposium on String Processing and Information Retrieval, pages 79-90, 2005.
    [3] R. S. Boyer and J. S. Moore. A Fast String Searching Algorithm. Communications of the ACM, 20(10):762-772, 1977.
    [4] Human Mitochondria Genome Database. Department of Genetics and Pathology, Uppsala University, Sweden, http://www.genpat.uu.se/mtDB.
    [5] eMule. Wikipedia, the free encyclopedia (based on 28 July 2008 version). http://en.wikipedia.org/wiki/EMule.
    [6] W. K. Hon, R. Shah, and J. S. Vitter. Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Department of CS, Purdue University, 2006.
    [7] R. M. Karp and M. O. Rabin. Efficient Randomized Pattern-Matching Algorithms. Technical Report TR-31-81, Aiken Computational Laboratory, Harvard University, 1981.
    [8] D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast Pattern Matching in Strings. SIAM Journal on Computing, 6(2):323-350, 1977.
    [9] Zipf's Law. Wikipedia, the free encyclopedia (based on 22 July 2008 version). http://en.wikipedia.org/wiki/Zipf's law.
    [10] V. Makinen and G. Navarro. Position-Restricted Substring Searching. In Proceedings of LATIN, pages 703-714, 2006.
    [11] Y. Matias, S. Muthukrishnan, S. C. Sahinalp, and J. Ziv. Augmenting Suffix Trees, with Applications. In Proceedings of European Symposium on Algorithms, pages 67-78, 1998.
    [12] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(2):262-272, 1976.
    [13] S. Muthukrishnan. Efficient Algorithms for Document Retrieval Problems. In Proceedings of Symposium on Discrete Algorithms, pages 657-666, 2002.
    [14] K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of Symposium on Discrete Algorithms, pages 225-232, 2002.
    [15] SourceForge. SourceForget.net: Open Source Software.
    http://sourceforge.net.
    [16] N. VÄalimÄaki and V. MÄakinen. Space-Efficient Algorithms for Document Retrieval. In Proceedings of Symposium on Combinatorial Pattern Matching, pages 205-215, 2007.
    [17] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of Symposium on Switching and Automata Theory, pages 1-11, 1973.
    [18] D. E. Willard. Log-Logarithmic Worst-Case Range Queries are Possible in Space theta(N). Information Processing Letters, 17(2):81-84, 1983.
    [19] I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA, USA, 1999.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE