研究生: |
王恩慈 Wang, En-Tzu |
---|---|
論文名稱: |
於分散式資料串流上探勘高頻樣型 Mining Frequent Itemsets over Distributed Data Streams |
指導教授: |
陳良弼
Chen, Arbee L.P. |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 英文 |
論文頁數: | 89 |
中文關鍵詞: | 分散式資料串流 、資料探勘 、高頻樣型 、連續分散式模型 、以雜湊為基礎之方法 |
外文關鍵詞: | Distributed Data Streams, Data Mining, Frequent Itemsets, Continuous Distributed Model, Hash-based Approach |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現今許多應用中,資料常以連續型串流的形式持續產生。由於挖掘串流資料背後的隱含知識可能帶來極大助益,因此,於資料串流上從事探勘工作乃成為當前熱門研究議題之一。在單一資料串流上探勘高頻樣型已有許多相關研究,然而在許多應用中,卻可能存在著分散於不同遠端電腦的多重資料串流。有鑑於此,在此論文中,我們主要著眼於分散式資料串流上的總體高頻樣型探勘。不同於將所有串流資料集中於某一中央伺服器此等浪費遠端電腦計算資源的方式,我們乃是採用分散式計算方法。同時,為求加速探勘程序並支援高彈性探勘需求,我們率先定義一新問題:在每一遠端電腦皆為其監控的資料串流建立局部概要的前提下,於一中央伺服器中持續維護所有資料串流所形成之總體概要。當使用者要求回報總體高頻樣型時,透過中央伺服器目前所存有之總體概要,即可立即取得探勘結果。在此論文中,我們提出一分散式計算架構來處理此問題,此一架構主要由兩項工作結合而成:一、設計一新式概要結構,二、基於此一新式概要結構設計相關通信策略及合併方法已達到持續維護總體概要之目的。
由於大多數在單一串流上探勘高頻樣型的研究皆假設能將概要結構完整儲存,忽略了儲存於系統中之非高頻樣型可能使記憶體利用度大幅下降;因此,我們考量將所有樣型資訊以壓縮的方式儲存於一具固定大小的雜湊表中,並發展一以雜湊法為基礎之線上處理方法,以便於在單一串流上探勘高頻樣型。我們所提之方法巧妙地將完整的串流資訊儲存於一雜湊表中,並包含一估計非高頻樣型支持計數的新技術,因此僅需於概要結構中儲存高頻樣型以加速探勘程序。此外,基於此一新設計之概要結構,我們另外提出兩個通信策略及一合併方法,以求達到持續維護總體概要之目的。通信策略的設計乃基於探勘結果的正確性,主要用於告知遠端電腦應於何時傳送何等資料給中央伺服器;同時,合併方法亦描述了如何將中央伺服器所接收到的資訊合併於總體概要中。透過通信策略與合併方法的設計,使持續維護一總體概要進而加速總體高頻樣型探勘的目標得以達成。在此論文中所提出的演算法皆有其相對應之正確性證明,且此一分散式計算方法所提供之準確性保證亦有所分析;此外,我們也設計了一系列模擬實驗,利用數個人造資料庫及一真實資料庫來驗證我們所提出之方法的有效性及有用性。
In recent times, data in many applications are generated as a form of continuous data streams. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over a single data stream have been proposed. However, in many applications, multiple data streams generated at distributed remote sites may exist. In this dissertation, we concentrate on the problem of mining global frequent itemsets over a collection of data streams distributed at distinct remote sites. Instead of collecting and processing all the data in a central server, wasting computation resources of the remote sites, distributed computations over the data streams are performed. Moreover, to further speed up the mining process and provide more flexibility for the mining requests, we make the first attempt to address a new problem on continuously maintaining the global synopsis at the central server (named coordinator) for the union of all the distributed data streams under the condition that each remote site maintains its own local synopsis for the local stream it monitors. The global frequent itemsets therefore can be yielded on demand by directly processing the global synopsis. We propose a distributed computation framework to deal with the problem of continuously maintaining the global synopsis in this dissertation, which composes of two main works: 1) designing a local synopsis to summarize the local stream (equal to solving the problem of mining frequent itemsets over a single data stream) and 2) devising communication strategies and a merging operation rooted in the newly designed synopsis to achieve the goal of continuous maintenance.
Since most of the existing approaches on mining frequent itemsets over a single data stream assume that the synopses of data streams can be saved in memory and ignore the fact that information of non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded, we therefore consider compressing the information of all itemsets into a structure with a fixed size using a hashing technique and then propose a hash-based approach operating in an online-processing mode that processes the current transaction immediately for mining frequent itemsets over a single data stream. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate support counts of non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Thereafter, based on the newly designed synopsis used in our hash-based approach over a single data stream, two communication strategies are designed for the distributed computation framework according to an accuracy guarantee of mining results, which decide when and what the remote sites should transmit to the coordinator. Moreover, a suitable merging operation is also proposed for merging the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and merging operation, the goal of continuously maintaining the global synopsis for efficient global frequent itemset mining can therefore be achieved. The correctness guarantees of all the proposed algorithms and the accuracy guarantee analysis of the distributed computation framework are presented. A series of experiments on synthetic datasets and a real dataset are also performed to show the effectiveness and efficiency of the hash-based approach over a single data stream and those of the distributed computation framework over distributed data streams.
[AS94] R. Agrawal and R. Srikant (1994), "Fast Algorithms for Mining Association Rules in Large Databases," In: J.B. Bocca, M. Jarke, and C. Zaniolo (Eds.) Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), Santiago, Chile, pp. 487-499.
[AW96] M. Arlitt and C. Williamson (1996), "Web Server Workload Characterization: The Search for Invariants," In: Proceedings Performance Evaluation Review Vol. 24 Is. 1, pp. 126-137.
[BO03] B. Babcock and C. Olston (2003), "Distributed Top-K Monitoring," In: A.Y. Halevy, Z.G. Ives, A.H. Doan (Eds.) Proceedings of the 2003 ACM SIGMODE International Conference on Management of Data (SIGMOD 2003), San Diego, California, USA, pp. 28-39.
[CCF02] M. Charikar, K. Chen, and M. Farach-Colton (2002), "Finding Frequent Items in Data Streams," In: P. Widmayer, F.T. Ruis, R.M. Bueno, M. Hennessy, S. Eidenbenz, and R. Conejo (Eds.) Proceedings of the 29th International Colloquium on Automata, Languages and Programming (ICALP’02), Málaga, Spain, pp. 693-703.
[CDG06] T. Calders, N. Dexters, and B. Goethals (2006), "Mining frequent Items in a Stream Using Flexible Windows," In: J. Cama, R. Klinkenberg, and J. Aguilar (Eds.) Proceedings of ECML/PKDD 2006 Workshop on Knowledge Discovery from Data Streams (IWKDDS) Berlin, Germany, pp. 87-96.
[CDG07] T. Calders, N. Dexters, and B. Goethals (2007), "Mining Frequent Itemsets in a Stream," In: Proceedings of the seventh IEEE International Conference on Data Mining (ICDM’07), Omaha, USA, pp. 83-92.
[CG05] G. Cormode, and M. Garofalakis (2005), "Sketching Streams through the Net: Distributed Approximate Query Tracking," In: K. Böhm, C.S. Jensen, L.M. Haas, M.L. Kersten, P.Å. Larson, and B.C. Ooi (Eds.) Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, pp. 13-24.
[CGM05] G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi (2005), "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles," In: F. Özcan (Eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), Baltimore, Maryland, USA, pp. 25-36.
[CKN06] J. Cheng, Y. Ke, and W. Ng (2006), "Maintaining Frequent Itemsets over High-Speed Data Streams," In: W.K. Ng, M. Kitsuregawa, J. Li, and K. Chang (Eds.) Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2006), Singapore, pp. 462-467.
[CL03] J.H. Chang and W.S. Lee (2003), "Finding Recent Frequent Itemsets Adaptively over Online Data Streams," In: L. Getoor, T.E. Senator, P. Domingos, and C. Faloutsos (Eds.) Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining (KDD 2003), Washington, DC, USA, pp. 487-492.
[CM03] G. Cormode and S. Muthukrishnan (2003), "What’s Hot and What's Not: Tracking Most Frequent Items Dynamically," In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2003), San Diego, CA, pp. 296-306.
[CMZ06] G. Cormode, S. Muthukrishnan, and W. Zhuang (2006), "What’s Different: Distributed, Continuous Monitoring of Duplicate-Resilient Aggregates on Data Streams," In: L. Liu, A. Reuter, K.Y. Whang, and J. Zhang (Eds.): Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, pp. 57-57.
[CMZ07] G. Cormode, S. Muthukrishnan, and W. Zhuang, "Conquering the Divide: Continuous Clustering of Distributed Data Streams," In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, pp. 1036-1045.
[CWY04] Y. Chi, H. Wang, P.S. Yu, and R.R. Muntz (2004), "Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window," In Proceedings of the fourth IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, pp. 59-66.
[DGG04] A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi (2004), "Distributed Set-Expression Cardinality Estimation," In: M.A. Nascimento, M.T. Özsu, D. Kossmann, R.J. Miller, J.A. Blakeley, and K.B. Schiefer (Eds.) Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, pp. 312-323.
[DLM02] E. Demaine, A. Lopez-Ortiz, and J.I. Munro (2002), "Frequency Estimation of Internet Packet Streams with Limited Space," In: R.H. Möhring and R. Raman (Eds.) Proceedings of the 10th European Symposium on Algorithms (ESA 2002), Rome, Italy, pp. 348-360.
[DNO08] X.H. Dang, W.K. Ng, and K.L. Ong (2008), "Online mining of frequent sets in data streams with error guarantee," Knowl Inf Syst, Vol 16, No. 2, pp. 245-258.
[FK08] R. Fuller and M. Kantardzic (2008), "Distributed Monitoring of Frequent Items," Trans. MLDM, Vol. 1, No. 2, pp. 67-82.
[FS82] M.J. Fischer and S.L. Salzberg (1982), "Finding A Majority among N Votes: Solution to Problem 81-5," J. Algorithms, Vol. 3, Is. 4, pp. 362-380.
[GDD03] L. Golab, D. DeHaan, E.D. Demaine, A. López-Ortiz, and J.I. Munro (2003), "Identifying Frequent Items in Sliding Windows over On-Line Packet Streams," In: Proceedings of the first ACM SIGCOMM Internet Measurement Conference (IMC’03), Florida, USA, pp. 173-178.
[GHP04] C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu (2004), "Mining Frequent Patterns in Data Streams at Multiple Time Granularities," In: H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha (Eds.) Data mining next generation challenges and future directions, AAAI Press, pp. 191-212.
[HPY04] J. Han, J. Pei, Y. Yin, and R. Mao (2004), "Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach," Data Min Knowl Disc, Vol. 8, No. 1, pp. 53-87.
[JA05] R. Jin and G. Agrawal (2005), "An Algorithm for In-Core Frequent Itemset Mining on Streaming Data," In: Proceedings of the fifth IEEE International Conference on Data Mining (ICDM’05), Houston, Texas, USA, pp. 210-217.
[JG06] N. Jiang and L. Gruenwald (2006), "CFI-Stream: Mining Closed Frequent Itemsets in Data Streams," In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery in: T. Eliassi-Rad, L.H. Ungar, M. Craven, and D. Gunopulos (Eds.) Databases and Data Mining (KDD’06), Philadelphia, USA, pp. 592-597.
[JQS03] C. Jin, W. Qian, C. Sha, J.X. Yu, and A. Zhou (2003), "Dynamically Maintaining Frequent Items Over A Data Stream," In: Proceedings of the 12th ACM International Conference on Information and Knowledge Management (CIKM’03), New Orleans, LA, USA, pp. 287-294.
[KCR06] R. Keralapura, G. Cormode, and J. Ramamirtham (2006), "Communication-Efficient Distributed Monitoring of Thresholded Counts," In: S. Chaudhuri, V. Hristidis, and B. Polyzotis (Eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2006), Chicago, Illinois, USA, pp. 289-300.
[KPS03] R.M. Karp, C.H. Papadimitriou, and S. Shenker (2003), "A Simple Algorithm for Finding Frequent Elements in Streams and Bags," ACM Trans on Database Syst, Vol. 28, Is. 1, pp. 51-55.
[KRR08] S. Kashyap, J. Ramamirtham, R. Rastoqi, and P. Shukla (2008), "Efficient Constraint Monitoring Using Adaptive Thresholds," In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), Cancún, México, pp. 526-535.
[LCW05] C.H. Lin, D.Y. Chiu, Y.H. Wu, and A.L.P. Chen (2005), "Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window," 2005 SIAM International Conference on Data Mining (SDM’05), Newport Beach, CA.
[LHK06] H.F. Li, C.C. Ho, F.F. Kuo, and S.Y. Lee (2006), "A New Algorithm for Maintaining Closed Frequent Itemsets in Data Streams by Incremental Updates," In: Proceedings of IEEE International Workshop on Mining Evolving and Streaming Data (ICDM workshops 2006), Hong Kong, China, pp. 672-676.
[LK06] C.K.S. Leung and Q. Khan (2006), "DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams," In: Proceedings of the sixth IEEE International Conference on Data Mining (ICDM’06), Hong Kong, China, pp. 928-932.
[LL05] D. Lee and W. Lee (2005), "Finding Maximal Frequent Itemsets over Online Data Streams Adaptively," In: Proceedings of the fifth IEEE International Conference on Data Mining (ICDM’05), Houston, Texas, USA, pp. 266-273.
[LLS04] H.F. Li, S.Y. Lee, and M.K. Shan (2004), "An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams," the first International Workshop on Knowledge Discovery in Data Streams, in Conjunction with ECML/PKDD 2004, Pisa, Italy.
[LT06] L.K. Lee and H.F. Ting (2006), "A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows." In: S. Vansummeren (Ed.) Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’06), Chicago, USA, pp. 290-297.
[MM02] G.S. Manku and R. Motwani (2002), "Approximate Frequency Counts over data Streams," In: Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, pp. 346-357.
[MSD05] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston (2005), "Finding (Recently) Frequent Items in Distributed Data Streams," In: Proceedings of IEEE 21th International Conference on Data Engineering (ICDE’05), Tokyo, Japan, pp. 767-778.
[MTZ08] B. Mozafari, H. Thakkar, and C. Zaniolo (2008), "Verifying and Mining Frequent Patterns from Large Windows over Data Streams," In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), Cancún, México, pp. 179-188.
[SON95] A. Savasere, E. Omiecinski, and S. Navathe (1995), "An Efficient Algorithm for Mining Association Rules in Large Database," In: Dayal U, Gray PMD, and Nishio S (Eds.) Proceedings of the 21th International Conference on Very Large Databases (VLDB 1995), Zurich, Switzerland, pp. 432-444.
[WC09] E.T. Wang and A.L.P. Chen (2009), "A Novel Hash-based Approach for Mining Frequent Itemsets over Data Streams Requiring Less Memory Space," Data Min Knowl Disc, Vol 19, No. 1, pp. 132-172.
[WHX07] S.Y. Wang, X.L. Hao, H.X. Xu, and Y.F. Hu (2007), "Finding Frequent Items in Data Streams using ESBF," In: Proceedings of the 2007 International Workshop on High Performance Data Mining and Application (HPDMA 2007), in Conjunction with PAKDD 2007, Nanjing, China, pp. 244-255.
[WXH07] S.Y. Wang, H.X. Xu, and Y.F. Hu (2007), "Finding Frequent Items in Sliding Windows over Data Streams Using EBF," In: Proceedings of the eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007), Qingdao, China, pp. 682-687.
[YCL04] J.X. Yu, Z. Chong, H. Lu, and A. Zhou (2004), "False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams," In: M.A. Nascimento, M.T. Özsu, D. Kossmann, R.J. Miller, J.A. Blakeley, and K.B. Schiefer (Eds.) Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), Toronto, Canada, pp. 204-215.