研究生: |
吳自強 Tzu-Chiang Wu |
---|---|
論文名稱: |
在資料串流環境下估計移動和 Maintaining Moving Sums over Data Streams |
指導教授: |
陳良弼
Arbee L.P. Chen |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 英文 |
論文頁數: | 45 |
中文關鍵詞: | 資料串流 、移動和 、滑動視窗 |
外文關鍵詞: | data stream, moving sum, sliding window |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在一個資料串流(data stream)的環境下,資料串流中的資料都是來自不同資料來源(data source)的數值,我們考慮如何設計一個精簡的資料結構,以用來估計每一個資料來源,在滑動視窗(sliding window)中的資料和的問題。
這個問題的困難之處在於同時有兩個變數要處理,一個是資料來源的個數,另一個是在滑動視窗中的資料個數。對於大量的資料來源而言,要計算每一個資料來源的資料和,就需要相當多的計數器。另一方面,對於滑動視窗中的大量資料而言,要保留下滑動視窗中的每個資料也需要很大的儲存空間。
針對這兩個問題,我們分別提出兩個方法來處理這兩個變數。利用我們的方法,我們可以有效率地共用計數器,來記錄不同資料來源的資料和,以減少對計數器的需求,同時,對於滑動視窗中的資料,我們的方法可以很有系統的將這些資料予以合併,以減少存放資料的空間。
利用我們的方法所建立的資料結構需要兩個參數,epsilon與delta。epsilon用來控制估計值的範圍,而delta代表估計值落在epsilon所表示的範圍之內的信心水準。我們的方法的保證在1-delta的信心水準之下,估計值會落在epsilon所表示的範圍內。
此外,我們更進一步的針對不同的性質,例如,空間複雜度,資料即時性等等,探討我們所提出的方法彼此之間的差異。最後,我們也透過實驗,來實際的比較在不同的實驗環境下,兩個方法的估計值的準確性以及準確性與空間需求的關係。
Given a data stream of numerical data elements generated from multiple sources, we consider the problem of maintaining the sum of the elements for each data source over a sliding window of the data stream. The difficulties of the problem come from two parts. One is the number of data sources and the other is the number of elements in the sliding window. For massive data sources, we need a significant number of counters to maintain the sum for each data source, while for a large number of data elements in the sliding window, we need a huge space to keep all of them. We propose two methods, which shares the counters efficiently and merge the data elements systematically so that we are able to estimate the sums using a concise data structure. Two parameters, epsilon and delta, are needed to construct the data structure. Epsilon controls the bounds of the estimate and delta represents the confidence level that the estimate is within the bounds. The estimates of both methods are proven to be bounded within a factor of □ at 1-delta probability. A qualitative analysis of these two methods is presented to contrast these two methods, and the experimental results further show the performance of these two methods.
[1] B. Babcock, M. Datar, and R. Motwani. “Sampling from a moving window over streaming data.” In ACM-SIAM Symposium on Discrete Algorithms, pages 633-634, 2002.
[2] B. Babcock, M. Datar, R. Motwani and L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows.” In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, pages 234-243, 2003.
[3] G. Cormode, and S. Muthukrishnan. “Improved Data Stream Summary: The CM sketch and its Applications.” In Journal of Algorithms, April 2005.
[4] G. Cormode, S. Muthukrishnan, F. Korn and D. Srivastava. “Effective Computation of Biased Quantiles over Data Streams.” In Proceedings of the 21st International Conference on Data Engineering, pages 20-31, 2005.
[5] M. Datar, A. Gionis, P. Indyk, and R. Motwani. “Maintaining Stream Statistics over Sliding Windows.” In SIAM Journal on Computing, Volume 31(6), pages 1794-1813, 2002.
[6] M. Datar, and S. Muthukrishnan. “Estimating Rarity and Similarity over Data Stream Windows.” In European Symposium on Algorithms, pages 323-334, 2002.
[7] P. B. Gibbons and S. Tirthapura. “Distributed streams algorithms for sliding windows.” In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 63-72, 2002.
[8] L. Golab, D. Dehaan, E. Demaine, A. López-Ortiz, and J. Ian Munro. “Identifying Frequent Items in Sliding Windows over On-Line Packet Streams.” In Proceedings of the ACM SIGCOMM Internet Measurement Conference, pages 173-178, 2003.
[9] C. Lin, D. Chiu, Y. Wu, and A. Chen. “Mining Frequent Itemsets in Time-Sensitive Sliding Window over Data Streams.” In SIAM International Data Mining Conference, 2005.
[10] X. Lin, J. Xu, H. Lu, and J. X. Yu. “Continuously Maintaining Quantile Summaries of the Most Recent N elements over a Data Stream.” In Proceedings of the 20th International Conference on Data Engineering, pages 362-374, 2004.
[11] R. Motwani, and P. Raghavan. “Randomized Algorithms.” Cambridge University Press, 1995.
[12] L. Qiao, D. Agrawal, and A. El Abbadi. “Supporting Sliding Window Queries for Continuous Data Streams.” In Proceedings of 15th International Conference on Scientific and Statistical Database Management, pages 85-94, 2003.
[13] Y. Zhu and D. Shasha. “StatStream: Statistical monitoring of thousands of data streams in real time.” In Proceedings of the 28th International Conf. on Very Large Data Bases, pages 358-369, 2002.