簡易檢索 / 詳目顯示

研究生: 王恩慈
Wang, En-Tzu
論文名稱: 於分散式資料串流上探勘高頻樣型
Mining Frequent Itemsets over Distributed Data Streams
指導教授: 陳良弼
Chen, Arbee L.P.
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 89
中文關鍵詞: 分散式資料串流資料探勘高頻樣型連續分散式模型以雜湊為基礎之方法
外文關鍵詞: Distributed Data Streams, Data Mining, Frequent Itemsets, Continuous Distributed Model, Hash-based Approach
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現今許多應用中,資料常以連續型串流的形式持續產生。由於挖掘串流資料背後的隱含知識可能帶來極大助益,因此,於資料串流上從事探勘工作乃成為當前熱門研究議題之一。在單一資料串流上探勘高頻樣型已有許多相關研究,然而在許多應用中,卻可能存在著分散於不同遠端電腦的多重資料串流。有鑑於此,在此論文中,我們主要著眼於分散式資料串流上的總體高頻樣型探勘。不同於將所有串流資料集中於某一中央伺服器此等浪費遠端電腦計算資源的方式,我們乃是採用分散式計算方法。同時,為求加速探勘程序並支援高彈性探勘需求,我們率先定義一新問題:在每一遠端電腦皆為其監控的資料串流建立局部概要的前提下,於一中央伺服器中持續維護所有資料串流所形成之總體概要。當使用者要求回報總體高頻樣型時,透過中央伺服器目前所存有之總體概要,即可立即取得探勘結果。在此論文中,我們提出一分散式計算架構來處理此問題,此一架構主要由兩項工作結合而成:一、設計一新式概要結構,二、基於此一新式概要結構設計相關通信策略及合併方法已達到持續維護總體概要之目的。

    由於大多數在單一串流上探勘高頻樣型的研究皆假設能將概要結構完整儲存,忽略了儲存於系統中之非高頻樣型可能使記憶體利用度大幅下降;因此,我們考量將所有樣型資訊以壓縮的方式儲存於一具固定大小的雜湊表中,並發展一以雜湊法為基礎之線上處理方法,以便於在單一串流上探勘高頻樣型。我們所提之方法巧妙地將完整的串流資訊儲存於一雜湊表中,並包含一估計非高頻樣型支持計數的新技術,因此僅需於概要結構中儲存高頻樣型以加速探勘程序。此外,基於此一新設計之概要結構,我們另外提出兩個通信策略及一合併方法,以求達到持續維護總體概要之目的。通信策略的設計乃基於探勘結果的正確性,主要用於告知遠端電腦應於何時傳送何等資料給中央伺服器;同時,合併方法亦描述了如何將中央伺服器所接收到的資訊合併於總體概要中。透過通信策略與合併方法的設計,使持續維護一總體概要進而加速總體高頻樣型探勘的目標得以達成。在此論文中所提出的演算法皆有其相對應之正確性證明,且此一分散式計算方法所提供之準確性保證亦有所分析;此外,我們也設計了一系列模擬實驗,利用數個人造資料庫及一真實資料庫來驗證我們所提出之方法的有效性及有用性。


    In recent times, data in many applications are generated as a form of continuous data streams. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over a single data stream have been proposed. However, in many applications, multiple data streams generated at distributed remote sites may exist. In this dissertation, we concentrate on the problem of mining global frequent itemsets over a collection of data streams distributed at distinct remote sites. Instead of collecting and processing all the data in a central server, wasting computation resources of the remote sites, distributed computations over the data streams are performed. Moreover, to further speed up the mining process and provide more flexibility for the mining requests, we make the first attempt to address a new problem on continuously maintaining the global synopsis at the central server (named coordinator) for the union of all the distributed data streams under the condition that each remote site maintains its own local synopsis for the local stream it monitors. The global frequent itemsets therefore can be yielded on demand by directly processing the global synopsis. We propose a distributed computation framework to deal with the problem of continuously maintaining the global synopsis in this dissertation, which composes of two main works: 1) designing a local synopsis to summarize the local stream (equal to solving the problem of mining frequent itemsets over a single data stream) and 2) devising communication strategies and a merging operation rooted in the newly designed synopsis to achieve the goal of continuous maintenance.

    Since most of the existing approaches on mining frequent itemsets over a single data stream assume that the synopses of data streams can be saved in memory and ignore the fact that information of non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded, we therefore consider compressing the information of all itemsets into a structure with a fixed size using a hashing technique and then propose a hash-based approach operating in an online-processing mode that processes the current transaction immediately for mining frequent itemsets over a single data stream. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate support counts of non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Thereafter, based on the newly designed synopsis used in our hash-based approach over a single data stream, two communication strategies are designed for the distributed computation framework according to an accuracy guarantee of mining results, which decide when and what the remote sites should transmit to the coordinator. Moreover, a suitable merging operation is also proposed for merging the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and merging operation, the goal of continuously maintaining the global synopsis for efficient global frequent itemset mining can therefore be achieved. The correctness guarantees of all the proposed algorithms and the accuracy guarantee analysis of the distributed computation framework are presented. A series of experiments on synthetic datasets and a real dataset are also performed to show the effectiveness and efficiency of the hash-based approach over a single data stream and those of the distributed computation framework over distributed data streams.

    1 Introduction 6 2 Related Works 10 2.1 Related Works on Continuous Distributed Model 10 2.2 Related Works on Mining Frequent Item(set)s over a Single Data Stream 12 3 Mining Frequent Itemsets over a Single Data Stream 16 3.1 Preliminaries 16 3.1.1 Problem Definition 16 3.1.2 Basic Concepts of hCount and Lossy Counting 17 3.1.2.1 Introduction to hCount 17 3.1.2.2 Introduction to Lossy Counting 18 3.1.2.3 Drawbacks of hCount and Lossy Counting 19 3.1.3 Basic Concepts of the Hash Function 19 3.2 The Mining Approach: hMiner 20 3.2.1 Data Structure of hSynopsis 20 3.2.2 Maintenance of hSynopsis 22 3.2.2.1 Phase I: Identify the Frequent Itemsets 23 3.2.2.2 Phase II: Remove the Non-Frequent Itemsets 26 3.2.2.3 An Example for hSynopsis Maintenance 27 3.2.3 Mining Frequent Itemsets 30 3.2.4 Correctness Guarantee 30 3.3 An Adaption of hMiner: Batch-hMiner 32 3.4 Accuracy Guarantee Analyses 35 3.4.1 Parameter Setting and Error Analysis of the Hash Table 35 3.4.2 Discussion on Accuracy Guarantee of hMiner 38 3.5 Performance Evaluation 40 3.5.1 Performance Criteria 40 3.5.2 Experiment Setup 41 3.5.3 Experiment Results 43 3.5.3.1 Experiment Results on a Varying rho 43 3.5.3.2 Experiment Results on a Varying epsilon 45 3.5.3.3 Experiment Results on Scalability 48 3.5.3.4 Experiment Results on Testing Distinct Datasets 51 3.5.3.5 Discussion on the Limitations of hMiner 52 3.6 A Brief Summary of hMiner 53 4 Mining Frequent Itemsets over Distributed Data Streams 55 4.1 Problem Formulation 55 4.2 Overview of the Distributed Computation Framework 56 4.3 Merging Operation 57 4.4 Communication Strategies 58 4.4.1 The Less-Communication-Oriented Strategy (LCO) 59 4.4.2 The Status-Changing-Alarm Strategy (SCA) 60 4.4.3 Comparing the Two Communication Strategies 65 4.5 Global Mining Algorithm: GhMiner 66 4.6 Correctness and Accuracy Guarantees 67 4.6.1 Correctness Guarantees 67 4.6.2 Accuracy Guarantee 70 4.7 Performance Evaluation 73 4.7.1 Experiment Setup 74 4.7.2 Experiment Results 75 4.7.2.1 Experiment Results over a Varying delta 75 4.7.2.2 Experiment Results on a Varying mu 78 4.7.2.3 Experiment Results on a Varying k 79 4.7.2.4 Experiment Results on Scalability 80 4.7.2.5 Experiment Results on Testing Distinct Datasets 82 4.7.3 Discussion on the Efficiency of Distributed Computation Framework 83 4.8 A Brief Summary on the Distributed Computation Framework 84 5 Conclusions and Future Works 85 References 86

    [AS94] R. Agrawal and R. Srikant (1994), "Fast Algorithms for Mining Association Rules in Large Databases," In: J.B. Bocca, M. Jarke, and C. Zaniolo (Eds.) Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), Santiago, Chile, pp. 487-499.
    [AW96] M. Arlitt and C. Williamson (1996), "Web Server Workload Characterization: The Search for Invariants," In: Proceedings Performance Evaluation Review Vol. 24 Is. 1, pp. 126-137.
    [BO03] B. Babcock and C. Olston (2003), "Distributed Top-K Monitoring," In: A.Y. Halevy, Z.G. Ives, A.H. Doan (Eds.) Proceedings of the 2003 ACM SIGMODE International Conference on Management of Data (SIGMOD 2003), San Diego, California, USA, pp. 28-39.
    [CCF02] M. Charikar, K. Chen, and M. Farach-Colton (2002), "Finding Frequent Items in Data Streams," In: P. Widmayer, F.T. Ruis, R.M. Bueno, M. Hennessy, S. Eidenbenz, and R. Conejo (Eds.) Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP’02), Málaga, Spain, pp. 693-703.
    [CDG06] T. Calders, N. Dexters, and B. Goethals (2006), "Mining frequent Items in a Stream Using Flexible Windows," In: J. Cama, R. Klinkenberg, and J. Aguilar (Eds.) Proceedings of ECML/PKDD 2006 Workshop on Knowledge Discovery from Data Streams (IWKDDS) Berlin, Germany, pp. 87-96.
    [CDG07] T. Calders, N. Dexters, and B. Goethals (2007), "Mining Frequent Itemsets in a Stream," In: Proceedings of the seventh IEEE International Conference on Data Mining (ICDM’07), Omaha, USA, pp. 83-92.
    [CG05] G. Cormode, and M. Garofalakis (2005), "Sketching Streams through the Net: Distributed Approximate Query Tracking," In: K. Böhm, C.S. Jensen, L.M. Haas, M.L. Kersten, P.Å. Larson, and B.C. Ooi (Eds.) Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, pp. 13-24.
    [CGM05] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi (2005), "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles," In: F. Özcan (Eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), Baltimore, Maryland, USA, pp. 25-36.
    [CKN06] J. Cheng, Y. Ke, and W. Ng (2006), "Maintaining Frequent Itemsets over High-Speed Data Streams," In: W.K. Ng, M. Kitsuregawa, J. Li, and K. Chang (Eds.) Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006), Singapore, pp. 462-467.
    [CL03] J.H. Chang and W.S. Lee (2003), "Finding Recent Frequent Itemsets Adaptively over Online Data Streams," In: L. Getoor, T.E. Senator, P. Domingos, and C. Faloutsos (Eds.) Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining (KDD 2003), Washington, DC, USA, pp. 487-492.
    [CM03] G. Cormode and S. Muthukrishnan (2003), "What’s Hot and What's Not: Tracking Most Frequent Items Dynamically," In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2003), San Diego, CA, pp. 296-306.
    [CMZ06] G. Cormode, S. Muthukrishnan, and W. Zhuang (2006), "What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams," In: L. Liu, A. Reuter, K.Y. Whang, and J. Zhang (Eds.): Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, pp. 57-57.
    [CMZ07] G. Cormode, S. Muthukrishnan, and W. Zhuang, "Conquering the Divide: Continuous Clustering of Distributed Data Streams," In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, pp. 1036-1045.
    [CWY04] Y. Chi, H. Wang, P.S. Yu, and R.R. Muntz (2004), "Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window," In Proceedings of the fourth IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, pp. 59-66.
    [DGG04] A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi (2004), "Distributed Set-Expression Cardinality Estimation," In: M.A. Nascimento, M.T. Özsu, D. Kossmann, R.J. Miller, J.A. Blakeley, and K.B. Schiefer (Eds.) Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, pp. 312-323.
    [DLM02] E. Demaine, A. Lopez-Ortiz, and J.I. Munro (2002), "Frequency Estimation of Internet Packet Streams with Limited Space," In: R.H. Möhring and R. Raman (Eds.) Proceedings of the 10th European Symposium on Algorithms (ESA 2002), Rome, Italy, pp. 348-360.
    [DNO08] X.H. Dang, W.K. Ng, and K.L. Ong (2008), "Online mining of frequent sets in data streams with error guarantee," Knowl Inf Syst, Vol 16, No. 2, pp. 245-258.
    [FK08] R. Fuller and M. Kantardzic (2008), "Distributed Monitoring of Frequent Items," Trans. MLDM, Vol. 1, No. 2, pp. 67-82.
    [FS82] M.J. Fischer and S.L. Salzberg (1982), "Finding A Majority among N Votes: Solution to Problem 81-5," J. Algorithms, Vol. 3, Is. 4, pp. 362-380.
    [GDD03] L. Golab, D. DeHaan, E.D. Demaine, A. López-Ortiz, and J.I. Munro (2003), "Identifying Frequent Items in Sliding Windows over On-Line Packet Streams," In: Proceedings of the first ACM SIGCOMM Internet Measurement Conference (IMC’03), Florida, USA, pp. 173-178.
    [GHP04] C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu (2004), "Mining Frequent Patterns in Data Streams at Multiple Time Granularities," In: H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (Eds.) Data mining next generation challenges and future directions, AAAI Press, pp. 191-212.
    [HPY04] J. Han, J. Pei, Y. Yin, and R. Mao (2004), "Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach," Data Min Knowl Disc, Vol. 8, No. 1, pp. 53-87.
    [JA05] R. Jin and G. Agrawal (2005), "An Algorithm for In-Core Frequent Itemset Mining on Streaming Data," In: Proceedings of the fifth IEEE International Conference on Data Mining (ICDM’05), Houston, Texas, USA, pp. 210-217.
    [JG06] N. Jiang and L. Gruenwald (2006), "CFI-Stream: Mining Closed Frequent Itemsets in Data Streams," In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery in: T. Eliassi-Rad, L.H. Ungar, M. Craven, and D. Gunopulos (Eds.) Databases and Data Mining (KDD’06), Philadelphia, USA, pp. 592-597.
    [JQS03] C. Jin, W. Qian, C. Sha, J.X. Yu, and A. Zhou (2003), "Dynamically Maintaining Frequent Items Over A Data Stream," In: Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM’03), New Orleans, LA, USA, pp. 287-294.
    [KCR06] R. Keralapura, G. Cormode, and J. Ramamirtham (2006), "Communication-Efficient Distributed Monitoring of Thresholded Counts," In: S. Chaudhuri, V. Hristidis, and B. Polyzotis (Eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2006), Chicago, Illinois, USA, pp. 289-300.
    [KPS03] R.M. Karp, C.H. Papadimitriou, and S. Shenker (2003), "A Simple Algorithm for Finding Frequent Elements in Streams and Bags," ACM Trans on Database Syst, Vol. 28, Is. 1, pp. 51-55.
    [KRR08] S. Kashyap, J. Ramamirtham, R. Rastoqi, and P. Shukla (2008), "Efficient Constraint Monitoring Using Adaptive Thresholds," In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), Cancún, México, pp. 526-535.
    [LCW05] C.H. Lin, D.Y. Chiu, Y.H. Wu, and A.L.P. Chen (2005), "Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window," 2005 SIAM International Conference on Data Mining (SDM’05), Newport Beach, CA.
    [LHK06] H.F. Li, C.C. Ho, F.F. Kuo, and S.Y. Lee (2006), "A New Algorithm for Maintaining Closed Frequent Itemsets in Data Streams by Incremental Updates," In: Proceedings of IEEE International Workshop on Mining Evolving and Streaming Data (ICDM workshops 2006), Hong Kong, China, pp. 672-676.
    [LK06] C.K.S. Leung and Q. Khan (2006), "DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams," In: Proceedings of the sixth IEEE International Conference on Data Mining (ICDM’06), Hong Kong, China, pp. 928-932.
    [LL05] D. Lee and W. Lee (2005), "Finding Maximal Frequent Itemsets over Online Data Streams Adaptively," In: Proceedings of the fifth IEEE International Conference on Data Mining (ICDM’05), Houston, Texas, USA, pp. 266-273.
    [LLS04] H.F. Li, S.Y. Lee, and M.K. Shan (2004), "An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams," the first International Workshop on Knowledge Discovery in Data Streams, in Conjunction with ECML/PKDD 2004, Pisa, Italy.
    [LT06] L.K. Lee and H.F. Ting (2006), "A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows." In: S. Vansummeren (Ed.) Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’06), Chicago, USA, pp. 290-297.
    [MM02] G.S. Manku and R. Motwani (2002), "Approximate Frequency Counts over data Streams," In: Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, pp. 346-357.
    [MSD05] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston (2005), "Finding (Recently) Frequent Items in Distributed Data Streams," In: Proceedings of IEEE 21th International Conference on Data Engineering (ICDE’05), Tokyo, Japan, pp. 767-778.
    [MTZ08] B. Mozafari, H. Thakkar, and C. Zaniolo (2008), "Verifying and Mining Frequent Patterns from Large Windows over Data Streams," In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), Cancún, México, pp. 179-188.
    [SON95] A. Savasere, E. Omiecinski, and S. Navathe (1995), "An Efficient Algorithm for Mining Association Rules in Large Database," In: Dayal U, Gray PMD, and Nishio S (Eds.) Proceedings of the 21th International Conference on Very Large Databases (VLDB 1995), Zurich, Switzerland, pp. 432-444.
    [WC09] E.T. Wang and A.L.P. Chen (2009), "A Novel Hash-based Approach for Mining Frequent Itemsets over Data Streams Requiring Less Memory Space," Data Min Knowl Disc, Vol 19, No. 1, pp. 132-172.
    [WHX07] S.Y. Wang, X.L. Hao, H.X. Xu, and Y.F. Hu (2007), "Finding Frequent Items in Data Streams using ESBF," In: Proceedings of the 2007 International Workshop on High Performance Data Mining and Application (HPDMA 2007), in Conjunction with PAKDD 2007, Nanjing, China, pp. 244-255.
    [WXH07] S.Y. Wang, H.X. Xu, and Y.F. Hu (2007), "Finding Frequent Items in Sliding Windows over Data Streams Using EBF," In: Proceedings of the eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Qingdao, China, pp. 682-687.
    [YCL04] J.X. Yu, Z. Chong, H. Lu, and A. Zhou (2004), "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams," In: M.A. Nascimento, M.T. Özsu, D. Kossmann, R.J. Miller, J.A. Blakeley, and K.B. Schiefer (Eds.) Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), Toronto, Canada, pp. 204-215.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE