在資料串流環境中使用封閉子樹作頻繁子樹探勘

簡易檢索 / 詳目顯示

回結果列表

研究生：	盧彥光 Yen-Kuang Lu
論文名稱：	在資料串流環境中使用封閉子樹作頻繁子樹探勘 Mining Frequent Subtrees over Data Streams Using Closed Subtrees
指導教授：	陳良弼 Arbee L.P. Chen
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	43
中文關鍵詞：	資料串流、探勘、常見
外文關鍵詞：	data stream, mining, frequent
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

樹狀結構被使用在許多應用上比如說生物科技、XML資料庫、和網站紀錄資料庫。常見樹狀結構的探勘則在查詢最佳化、結構分類、網站推薦上有其價值。然而，常見樹狀結構探勘的其中一個挑戰就是候選子樹的樹木會隨著樹的大小成指數性的成長。為了處理這樣大量的資訊，其中一個可能的解決方法就是封閉子樹。在此篇論文當中，我們提出了一個叫CSTMer的方法利用先找出封閉子樹來找出常見頻繁子樹。在這方法中，我們主要介紹一個壓縮的資料結構-封閉字首樹來保存用來找出封閉子樹所相關的資訊。封閉子樹的數量通常比所有子樹的數量還少，因此用來探勘頻繁子樹所需的記憶體需求減少了。此外，我們和一個探勘頻繁子樹的演算法STMer做比較。實驗的結果顯示出我們的方法大量減少了記憶體需求。

Tree structures are commonly adopted in many applications, such as bioinformatics, XML database, and web log databases. Mining frequent tree patterns can be valuable for query optimization, pattern classification, and web recommendation. However, one of the challenges for mining frequent subtrees is that the number of the candidate subtrees may exponentially grow with the tree size. In order to deal with such massive information, the closed subtrees is one of the promising solutions. In this paper, we propose an approach, named CSTMer (Closed Stream Tree Miner), for mining frequent subtrees over data streams by discovering the closed subtrees first. In the approach, we mainly introduce a compact structure, named closed global prefix tree, which maintains the associated information needed for deriving the closed subtrees. The size of the set of closed subtrees is usually smaller than the size of the set of all frequent subtrees. Therefore, the memory consumption needed for mining frequent subtrees can be reduced. In addition, we compare the proposed approach with STMer which discovers all frequent subtrees. The experiment result shows that our approach greatly reduces the memory consumption.

Abstract    i
List of Figures    iii
Chapter 1: Introduction    1
1 Main difficulties    2
2 Related work    5
4 Our contributions    6
4 Organization    7
Chapter 2: Preliminaries    8
1 Graph concepts    8
1.1 Labeled ordered trees    8
1.2 Induced subtrees    9
2 Problem definition    10
2.1 Online frequent tree mining problem    10
2.2 Closed frequent subtrees    11
3 Representing trees as strings    12
4 Lossy Counting    14
Chapter 3: The STMer Method    16
1 The framework of STMer and The GPT    16
2 Subtree generation    17
3 Subtree maintenance    19
Chapter 4: The CSTMer algorithm    20
1 The Closed Global Prefix Tree    20
2 Maintenance of the CGPT    23
2.1 The pruning technique    23
2.2 The closedness link    24
3 Discovering the closed frequent subtrees    28
4 The adaptation of Lossy Counting    30
5 The accuracy of Lossy Counting    33
Chapter 5: Experiments    36
1 Experimental dataset    36
2 Performance comparison    36
Chapter 6: Conclusion and future work    41
Reference:    42

                                

[1] Peter Buneman. Semistructured Data. In Proc. of ACM Symposium on Principles of Database Systems (PODS’97), 1997
[2] L. H. Yang, M. L. Lee, and W. Hsu. Finding Hot Query Patterns over an XQuery Stream. In Proc. of VLDB Journal Special Issue on Data Stream, J. Gehrke and J. Hellerstein (Eds.), 2004.
[3] L. H. Yang, M. L. Lee, and W. Hsu. Efficient Mining of XML Query Patterns for Caching. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’03), 2003.
[4] L. H. Yang, M. L. Lee, W. Hsu, and S. Acharya. Mining Frequent Query Patterns from XML Queries. In Proc. of Intl. Conf. on Database Systems for Advanced Applications (DASFAA’03), 2003.
[5] C.E. Hsieh, Y.H Wu, and L.P. Chen. Discovering Frequent Tree Patterns over Data Streams. In Proc. of Society for Industrial and Applied Mathematics (SIAM’06), 2006
[6] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’02), 2002.
[7] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: mining frequent itemsets from high speed transactional data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’04), 2004.
[8] Y. Xiao, J. F. Yao, Z. Li, and Margaret H. Dunham. Efficient Data Mining for Maximal Frequent Subtrees. In Proc. of IEEE International Conference on Data Mining (ICDM,03), 2003
[9] Y. Chi, Y. Xia, Y. Yang, and R. R. Muntz. Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. In Proc. of IEEE Transaction on Knowledge and Data Engineering (TKDE’05), 2005
[10] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In Proc. of Intl. Conf. on Data Mining (ICDM’04), 2004.
[11] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of Intl. Conf. on Database Theory (ICDT’99), 1999.
[12] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient Substructure Discovery form Large Semi-structure Data. In Proc. of SIAM Intl. Conf. on Data Mining (SDM’02), 2002.
[13] M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. In Proc. of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD’02), 2002.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文