研究生: |
彭義凱 Yi-Kai Peng |
---|---|
論文名稱: |
應用曲線擬合建立資料串流之摘要結構 Applying Curve Fitting Techniques to Construct the Synopsis of Data Streams |
指導教授: |
鍾葉青
Yeh-Ching Chung |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 英文 |
論文頁數: | 25 |
中文關鍵詞: | 資料串流 、曲線擬合 、Synopsis |
外文關鍵詞: | Data Stream, Curve Fitting, Synopsis |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料串流是一連串即時、連續且有序的資料物件,是目前廣泛採用於處理大量動態資料的資料格式。資料內容不斷變動以及儲存資料所需空間無限是資料串流的兩大特性,同時也是處理資料串流時必須克服的議題。對於資料變動,近似解是一個已被廣泛運用的方案,使用於提供與實際值有一定程度符合的解答;在儲存空間上,數個不同的資料結構已被提出,用於儲存資料串流的部份資訊以取代儲存所有資料。建立資料串流摘要的目的在於儲存資料串流的部份資訊並利用特定演算法處理使用者的要求,提供近似解。在本論文中,資料串流中的物件被視為平面座標系中的一點,其目的在於利用曲線擬合技巧以一多項式作為資料串流的概述,並以微積分計算技巧提供近似解。經實驗可證明此演算法可將資料串流中N個物件以O(logN)空間儲存,且近似解能達到95%的準確度。
A data stream is a real-time, continuous, and ordered sequence of data items. It is a widely used data format to deal with large amount of dynamic data. Dynamic content and unbounded storage requirement are the two main characteristics of data streams. We need to deal with these two issues while processing data streams. For the dynamic content issue, the approximate answering is a widely used approach to process queries on data streams. For the unbounded storage size issue, some data structures have been proposed to summarize the data streams and keep the storage space required small. Synopsis is a data structure that summarizes the data streams. By using some algorithms, users can get approximate answers of data streams from the summarized information stored in synopsis. In this thesis, we use the curve fitting technique to construct the synopsis of data streams in the form of a curve that expressed by a polynomial function. The algorithms for constructing the synopsis data structure and querying the data stream are also proposed. We prove that the storage space required by the proposed method is O(logN). From the experimental results, we observe that our approach can achieve 95% accuracy on data contents for the queries.
[1] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, "Join synopses for approximate query answering," ACM SIGMOD Record, vol. 28, pp. 275-286, 1999.
[2] N. Alon, Y. Matias, and M. Szegedy, "The space complexity of approximating the frequency moments," Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20-29, 1996.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1-16, 2002.
[4] B. Babcock, M. Datar, and R. Motwani, "Sampling from a moving window over streaming data," Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 633-634, 2002.
[5] S. Babu and J. Widom, "Continuous queries over data streams," ACM SIGMOD Record, vol. 30, pp. 109-120, 2001.
[6] K. F. Chakrabarti, M. F. Garofalakis, R. F. Rastogi, and K. F. Shim, "Approximate query processing using wavelets," The VLDB Journal The International Journal on Very Large Data Bases, vol. 10, pp. 199-223, 2001.
[7] J. E. Dennis and R. B. Schnabel, Numerical methods for unconstrained optimization and nonlinear equations: Prentice-Hall Englewood Cliffs, NJ, 1983.
[8] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, "Processing complex aggregate queries over data streams," Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 61-72, 2002.
[9] P. Domingos and G. Hulten, "Mining high-speed data streams," Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 71-80, 2000.
[10] G. Dong, J. Han, L. V. S. Lakshmanan, J. Pei, H. Wang, and P. S. Yu, "Online Mining of Changes from Data Streams: Research Problems and Preliminary Results," Proc. 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams, 2003.
[11] P. Flajolet and G. N. Martin, "Probabilistic counting algorithms for data base applications," Journal of Computer and System Sciences, vol. 31, pp. 182-209, 1985.
[12] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, "Mining data streams: a review," ACM SIGMOD Record, vol. 34, pp. 18-26, 2005.
[13] J. Gama, R. Rocha, and P. Medas, "Accurate decision trees for mining high-speed data streams," Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 523-528, 2003.
[14] J. Gehrke, F. Korn, and D. Srivastava, "On computing correlated aggregates over continual data streams," Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pp. 13-24, 2001.
[15] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, "Surfing wavelets on streams: One-pass summaries for approximate aggregate queries," Proceedings of the 27th International Conference on Very Large Data Bases, pp. 79–88, 2001.
[16] P. E. Gill and W. Murray, "Algorithms for the Solution of the Nonlinear Least-Squares Problem," SIAM Journal on Numerical Analysis, vol. 15, pp. 977-992, 1978.
[17] L. Golab and M. T. Ozsu, "Issues in data stream management," ACM SIGMOD Record, vol. 32, pp. 5-14, 2003.
[18] M. Greenwald and S. Khanna, "Space-efficient online computation of quantile summaries," Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pp. 58–66, 2001.
[19] G. Hulten, L. Spencer, and P. Domingos, "Mining time-changing data streams," Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 97-106, 2001.
[20] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel, "Optimal Histograms with Quality Guarantees," Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 275-286, 1998.
[21] K. Levenberg, "A method for the solution of certain problems in least squares," Quart. Appl. Math, vol. 2, pp. 164–168, 1944.
[22] G. S. Manku and R. Motwani, "Approximate frequency counts over data streams," Proceedings of the Twenty-Eighth International Conference on Very Large Data Bases, 2002.
[23] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, "Random sampling techniques for space efficient online computation of order statistics of large datasets," Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pp. 251-262, 1999.
[24] Y. Matias, J. S. Vitter, and M. Wang, "Dynamic Maintenance of Wavelet-Based Histograms," Proc. of the 2000 Intl. Conf. on Very Large Data Bases, pp. 101–110, 2000.
[25] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, "Numerical Recipes in C++."