簡易檢索 / 詳目顯示

研究生: 盧彥光
Yen-Kuang Lu
論文名稱: 在資料串流環境中使用封閉子樹作頻繁子樹探勘
Mining Frequent Subtrees over Data Streams Using Closed Subtrees
指導教授: 陳良弼
Arbee L.P. Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 43
中文關鍵詞: 資料串流探勘常見
外文關鍵詞: data stream, mining, frequent
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 樹狀結構被使用在許多應用上比如說生物科技、XML資料庫、和網站紀錄資料庫。常見樹狀結構的探勘則在查詢最佳化、結構分類、網站推薦上有其價值。然而,常見樹狀結構探勘的其中一個挑戰就是候選子樹的樹木會隨著樹的大小成指數性的成長。為了處理這樣大量的資訊,其中一個可能的解決方法就是封閉子樹。在此篇論文當中,我們提出了一個叫CSTMer的方法利用先找出封閉子樹來找出常見頻繁子樹。在這方法中,我們主要介紹一個壓縮的資料結構-封閉字首樹來保存用來找出封閉子樹所相關的資訊。封閉子樹的數量通常比所有子樹的數量還少,因此用來探勘頻繁子樹所需的記憶體需求減少了。此外,我們和一個探勘頻繁子樹的演算法STMer做比較。實驗的結果顯示出我們的方法大量減少了記憶體需求。


    Tree structures are commonly adopted in many applications, such as bioinformatics, XML database, and web log databases. Mining frequent tree patterns can be valuable for query optimization, pattern classification, and web recommendation. However, one of the challenges for mining frequent subtrees is that the number of the candidate subtrees may exponentially grow with the tree size. In order to deal with such massive information, the closed subtrees is one of the promising solutions. In this paper, we propose an approach, named CSTMer (Closed Stream Tree Miner), for mining frequent subtrees over data streams by discovering the closed subtrees first. In the approach, we mainly introduce a compact structure, named closed global prefix tree, which maintains the associated information needed for deriving the closed subtrees. The size of the set of closed subtrees is usually smaller than the size of the set of all frequent subtrees. Therefore, the memory consumption needed for mining frequent subtrees can be reduced. In addition, we compare the proposed approach with STMer which discovers all frequent subtrees. The experiment result shows that our approach greatly reduces the memory consumption.

    Abstract i List of Figures iii Chapter 1: Introduction 1 1.1 Main difficulties 2 1.2 Related work 5 1.4 Our contributions 6 1.4 Organization 7 Chapter 2: Preliminaries 8 2.1 Graph concepts 8 2.1.1 Labeled ordered trees 8 2.1.2 Induced subtrees 9 2.2 Problem definition 10 2.2.1 Online frequent tree mining problem 10 2.2.2 Closed frequent subtrees 11 2.3 Representing trees as strings 12 2.4 Lossy Counting 14 Chapter 3: The STMer Method 16 3.1 The framework of STMer and The GPT 16 3.2 Subtree generation 17 3.3 Subtree maintenance 19 Chapter 4: The CSTMer algorithm 20 4.1 The Closed Global Prefix Tree 20 4.2 Maintenance of the CGPT 23 4.2.1 The pruning technique 23 4.2.2 The closedness link 24 4.3 Discovering the closed frequent subtrees 28 4.4 The adaptation of Lossy Counting 30 4.5 The accuracy of Lossy Counting 33 Chapter 5: Experiments 36 5.1 Experimental dataset 36 5.2 Performance comparison 36 Chapter 6: Conclusion and future work 41 Reference: 42

    [1] Peter Buneman. Semistructured Data. In Proc. of ACM Symposium on Principles of Database Systems (PODS’97), 1997
    [2] L. H. Yang, M. L. Lee, and W. Hsu. Finding Hot Query Patterns over an XQuery Stream. In Proc. of VLDB Journal Special Issue on Data Stream, J. Gehrke and J. Hellerstein (Eds.), 2004.
    [3] L. H. Yang, M. L. Lee, and W. Hsu. Efficient Mining of XML Query Patterns for Caching. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’03), 2003.
    [4] L. H. Yang, M. L. Lee, W. Hsu, and S. Acharya. Mining Frequent Query Patterns from XML Queries. In Proc. of Intl. Conf. on Database Systems for Advanced Applications (DASFAA’03), 2003.
    [5] C.E. Hsieh, Y.H Wu, and L.P. Chen. Discovering Frequent Tree Patterns over Data Streams. In Proc. of Society for Industrial and Applied Mathematics (SIAM’06), 2006
    [6] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’02), 2002.
    [7] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: mining frequent itemsets from high speed transactional data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’04), 2004.
    [8] Y. Xiao, J. F. Yao, Z. Li, and Margaret H. Dunham. Efficient Data Mining for Maximal Frequent Subtrees. In Proc. of IEEE International Conference on Data Mining (ICDM,03), 2003
    [9] Y. Chi, Y. Xia, Y. Yang, and R. R. Muntz. Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. In Proc. of IEEE Transaction on Knowledge and Data Engineering (TKDE’05), 2005
    [10] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In Proc. of Intl. Conf. on Data Mining (ICDM’04), 2004.
    [11] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of Intl. Conf. on Database Theory (ICDT’99), 1999.
    [12] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient Substructure Discovery form Large Semi-structure Data. In Proc. of SIAM Intl. Conf. on Data Mining (SDM’02), 2002.
    [13] M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. In Proc. of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD’02), 2002.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE