簡易檢索 / 詳目顯示

研究生: 林志祥
Chih-Hsiang Lin
論文名稱: 在資料串流環境中探勘具時間性滑動窗限制之頻繁項目集
Mining Frequent Itemsets in Time-Sensitive Sliding Window over Data Streams
指導教授: 陳良弼
Arbee L.P. Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2004
畢業學年度: 92
語文別: 英文
論文頁數: 48
中文關鍵詞: 頻繁項目集資料串流資料探勘
外文關鍵詞: Frequent Itemset, Data Stream, Data Mining
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 頻繁項目集(Frequent Itemset)的探勘在最近的十年中被廣泛的討論及應用。過去的頻繁項目集探勘的研究所針對的資料通常為靜態交易資料庫。然而,隨著資料串流(Data Stream)的重要性與日俱增,如何將傳統的頻繁項目集探勘問題應用於此新興的資料串流環境(Data Stream Environment)中,便成為了一個十分有趣且重要的議題。
    資料串流具有下列的特性:串流資料的長度為無限的,其到達的速度十分的快速,且通常在資料串流環境中記憶體資源為有限的。由於上述的特性,因此當我們在處理這類新興的資料時,必須符合下面的限制:需要快速的資料處理時間、儲存所有從資料串流流過的資料是不允許的、任何回朔資料串流中的過往資料也是不被允許的。在這些要限制之下,探勘頻繁項目集於資料串流環境底下最主要的問題就是如何在無法看到所有的資料的情形下,持續不斷的探勘出完整的頻繁項目集,而不會有頻繁項目集的遺漏。 在本論文中,我們引入了時間相關滑動窗的觀念,並且提出了一套有效率的頻繁項目集探勘方法於此串流環境及時間相關滑動窗底下。我們的方法主要由一個頻繁項目集儲存結構(Frequent Itemset Storage Structure)及一個可變動大小的扣減表(Discounting Table)所構成。經由建立及維持頻繁項目集儲存結構及扣減表,所有頻繁項目集接會被完整的找出。實驗結果也顯示出我們的執行時間及記憶體使用量都是十分的精簡的。


    Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets for static transaction databases. It is challenging to extend the technique to the new data stream environment. This environment has the following characteristics: The length of the data stream is infinite, the data arrival rate is high and only limited memory can be used. Because of these characteristics, when we process a data stream, the following restrictions should be obeyed. These include a short response time, the inability to store the complete data stream and no backtracking over data streams is allowed. Under these restrictions, the main difficulty of mining frequent itemsets lies on the way to continuously discover the complete set of the frequent itemsets. In this paper, we propose a new approach for mining frequent itemsets in the time-sensitive sliding window model over data streams with no false alarm or no false dismissal guarantees. A time-sensitive sliding window is a variation of the sliding window, which uses time as the basic counting unit. Our approach consists of a frequent itemset storage structure to capture all possible frequent itemsets, and a discounting table with adaptable sizes to provide approximate counts of the expired data. By constructing and maintaining the storage structure and the discounting table, the complete set of frequent itemsets can therefore be mined. Experiment results demonstrate that the execution time of our approach is small in different minimum support thresholds, different discounting table sizes and different data sets.

    Contents Abstract I Acknowledgements II Contents III List of Figures IV List of Tables VI 1. Introduction 1 2. Preliminaries and System Framework 9 2-1 Preliminaries 9 2-2 System Framework 12 3. Derive Approximate Count of Itmesets 14 3-1 Discounting Table 14 3-2 Discounting Table With Merging Loss 17 4. Mining Algorithms and Proofs of Guarantees 22 4-1 Algorithms for frequent itemsets mining over data streams 22 4-2 Proof of no false dismissal guarantee 30 4-3 Proof of no false alarm guarantee 32 5. Experiments 35 5-1 Experiment Set-Up 35 5-2 Experiment Results 36 6. Conclusion and Future Work 45 Reference 46

    [AS94] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” Proceeding of International Conference on Very Large Data Bases (VLDB), 1994, 487-499
    [AS95] R. Agrawal and R. Srikant, “Mining sequential patterns,” Proceeding of International Conference on Data Engineering (ICDE), 1995, 3-14
    [BBD02] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proceeding of ACM Symposium on Principles of Database Systems (Invited Paper). 2002
    [CCF02] M. Charikar, K. Chen and M. Farach-Colton, “Finding Frequent Items in Data Streams,” Proceeding of ICALP, 2002
    [CHN96] D. Cheung, J. Han, V. Ng and C.Y. Wong, '' Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,'' Proceeding of International Conference on Data Engineering (ICDE),1996
    [CL03] J. H. Chang and W. S. Lee, “Finding Recent Frequent Itemsets Adaptively over Online Data Streams,” Proceeding of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2003
    [CM03] G. Cormode and S. Muthukrishnan, “What's Hot and What's Not: Tracking Most Frequent Items Dynamically,” Proceeding of ACM Symposium on Principles of Database Systems (PODS), 2003, 296-306.
    [CWC04] D. Y. Chiu, Y. H. Wu and Arbee L. P. Chen, “An efficient algorithm for mining frequent sequences by a new strategy without support counting,” Proceeding of International Conference on Data Engineering (ICDE), 2004
    [DH00] P. Domingos and G. Hulten, “Mining high-speed data streams,” Proceeding of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2000, 71-80
    [DLM02] E. Demaine, A.LopezOrtiz and J. I. Munro, “Frequency estimation of internet packet streams with limited space,” Proceeding of Annual European Symposium on Algorithms, 2002
    [GGR02] M. Garofalakis, J. Gehrke, R. Rastogi, “Ouerying and mining data streams: you only get one look,” Proceeding of International Conference on Very Large Data Bases (VLDB), 2002
    [GHP02] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities,” Proceeding of NSF Workshop on Next Generation Data Mining, 2002, 191-212
    [GMM00] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering Data Streams,” Proceeding of Annual IEEE Symposium on Foundations of Computer Science, 2000, 359-366
    [GO03] L. Golab and M. Ozsu. “Issues in Data Stream Management,” In SIGMOD Record, Volume 32, Number 2, June 2003, 5-14
    [HPY00] J. Han, J. Pei and Y. Yin, “Mining frequent patterns without candidate generation,” Proceeding of ACM International Conference on Management of Data (SIGMOD), 2000, 1-12
    [HSD01] G. Hulten, L. Spencer, and P. Domingos, “Mining time changing data streams,” Proceeding of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2001
    [JQS03] C. Jin, W. Qian, C. Sha, J. X. Yu and A. Zhou, “Dynamically Maintaining Frequent Items Over A Data Stream,” Proceeding of ACM International Conference on Information and Knowledge Management (CIKM), 2003
    [KPS03] R. M. Karp, C. H. Papadimitriou and S. Shenker, “A Simple Algorithm for Finding Frequent Elements in Streams and Bags,” Proceeding of the ACM Transactions on Database Systems (TODS), 2003
    [LLC01] C. Lee, C. Lin and M. Chen, “Sliding-window filtering: An efficient algorithm for incremental mining,” Proceeding of ACM International Conference on Information and Knowledge Management (CIKM), 2001, 263-270
    [MM02] G. Manku and R. Motwani, “Approximate frequency counts over data streams,” Proceeding of International Conference on Very Large Data Bases (VLDB), 2002, 346-357
    [TCY03] W. Teng, M. Chen and P. Yu, “A Regression-Based Temporal Pattern Mining Scheme for Data Streams,” Proceeding of International Conference on Very Large Data Bases (VLDB) 2003
    [TLH03] K. Tung, H. Lu, J. Han and L. Feng, “Efficient mining of intertranscation association rules,” In IEEE Transactions on Knowledge and Data Engineering (TKDE), 2003
    [ZS02] Y. Zhu and D. Shasha, “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,” Proceeding of International Conference on Very Large Data Bases (VLDB), 2002. 358-369

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE