研究生: |
林俊學 Lin, Chun-Hsueh |
---|---|
論文名稱: |
利用微簇之方法在資料串流上執行階層式分群 Hierarchical Clustering on Streaming Data Using Micro Cluster |
指導教授: |
陳良弼
Chen, Arbee L. P. |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 英文 |
論文頁數: | 35 |
中文關鍵詞: | 資料探勘 、資料串流分群 、微簇 |
外文關鍵詞: | data mining, data stream clustering, micro cluster, delaunay triangulation |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
With the advances of technology, data take the form of data streams rather than stored data sets in many recent applications. The research issues on analyzing data streams therefore have received considerable attentions in recent years. Among the issues, finding clusters over data streams is particularly important because it is widely used in many applications such as behavior analyses. This thesis studies an application where multiple clustering tasks with different parameters over data streams are simultaneously asked. In this environment, the approaches proposed in existing literatures for finding clusters are not working effectively or efficiently. Therefore, in this thesis we develop a novel clustering algorithm for efficiently processing multiple clustering tasks with different parameters.
The proposed algorithm is based on the hierarchical clustering method, which progressively merges clusters and therefore multiple clustering tasks can be efficiently processed. Through extensive experiments with synthetic data, we demonstrate the efficiency of using the proposed approach as solutions for finding clusters over data streams.
隨著科技的進步,許多新發展出來的應用環境處理的資料都是以串流的型式出現,而非傳統資庫所處理的資料集合。這一類型資料分析的研究議題在近幾年間愈來愈受到矚目。這些議題中,在資料串流上分群由於被使用在許多應用環境中,例如行為分析,而顯得特別地重要。本篇論文的研究主要在於資料串流上同時被要求多重分群任務的處理。在這樣的環境之中,我們指出現有文獻所使用的分群方法有其效率或效能上的不足之處。因此在這篇論文中,我們發展一種新的分群演算法來有效率地同時處理多重分群的任務。
這個演算法是基於階層式分群的方法,也就是逐步合併較相似的群集,來達到有效率地處理多重分群任務之目的。透過模擬群集所產生資料所做的實驗表現出,使用這個演算法可以有效率地執行在資料串流上的分群任務。
[1] C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Evolving Data Streams,” Proceedings of the 29th International Conference on Very Large Data Bases, pp. 81-92, September 09-12, 2003.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, “Models and Issues in Data Stream Systems,” Proceedings of the Twenty-first ACM SIGMOD- SIGACT-SIGART Symposium on Principles of database systems, pp. 1-16, June 03-05, 2002.
[3] P. Domingos, G. Hulten, “Mining High-Speed Data Streams,” Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71-80, August 20-23, 2000.
[4] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome. “Hierarchical Clustering,” The Elements of Statistical Learning, New York: Springer, pp. 272-280, 2001, ISBN 0-387-95284-5.
[5] J. C. Bezdek and N. R. Pal, “Some New Indexes of Cluster Validity,” IEEE Trans. on Systems, Man and Cybernetics-Part B, vol. 28, no. 3, pp. 301-315, 1998.
[6] A. Maus, “Delaunay Triangulation and the Convex Hull of n Points in Expected Linear Time,” BIT, vol. 24, pp. 151-163, 1884.
[7] J. Boissonnat, M. Teillaud, “On the Randomized Construction of the Delaunay Tree,” Theoretical Computer Science, vol.112, n.2, pp. 339-354, May 10, 1993.
[8] Olivier Devillers, “Improved incremental randomized Delaunay triangulation,” Proceedings of the Fourteenth Annual Symposium on Computational Geometry, pp. 106-115, June 7-10, 1998.
[9] C. Sohler. “Fast reconstruction of Delaunay triangulations,” Computational Geometry, vol. 31, pp. 166-178, 2005.
[10] J. Snoeyink, M. J. Kreveld, “Linear-Time Reconstruction of Delaunay Triangulations with Applications,” Proceedings of the Fifth Annual European Symposium on Algorithms, pp. 459-471, September 15-17, 1997.
[11] M. Ester, H. Kriegel, J. Sander, M. Wimmer, X. Xu, “Incremental Clustering for Mining in a Data Warehousing Environment,” Proceedings of 24th International Conference on Very Large Data Bases, pp. 323-333, August 24-27, 1998.
[12] X. H. Dang, V. Lee, W. K. Ng, A. Ciptadi, K. L. Ong, “An EM-Based Algorithm for Clustering Data Streams in Sliding Windows,” Proceeding of the 14th International Conference on Database Systems for Advanced Applications, pp. 230-235, April 21-23, 2009
[13] Babcock et al. “Maintaining Variance and k-Medians over Data Stream Windows,” Proceedings of the Twenty-Second ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, pp. 234-243, June 9-12, 2003.
[14] M. Datar, A. Gionis, P. Indyk, R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” SIAM Journal on Computing, vol. 31, n. 6, pp.1794-1813, 2002.
[15] L. O’Callahgan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms for High-Quality Clustering,” Proceedings of IEEE International Conference on Data Engineering, pp. 685-694, 26, February, 2002.
[16] C. Gupta, R. Grossman, “GenIc: A Single-Pass Generalized Incremental Algorithm for Clustering,” Proceedings of SIAM International Conference on Data Mining, April 22-24, 2004.
[17] L. Golab, M. T. □zsu, “Issues in Data Stream Management,” ACM SIGMOD Record, vol. 32, n. 2, pp.5-14, June 2003.
[18] F. Aurenhammer, “Voronoi Diagrams - A Survey of a Fundamental Geometric Data Structure,” ACM Computing Surveys, vol. 23, n. 3, pp. 345-405, 1991.