研究生: |
陳鴻元 Hung-Yuan Chen |
---|---|
論文名稱: |
使用線段估計改進長條圖近似 Improving Histogram Approximation with Line Representation |
指導教授: |
陳良弼
Arbee L. P. Chen |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 英文 |
論文頁數: | 42 |
中文關鍵詞: | 長條圖 、線段 、近似 |
外文關鍵詞: | Histogram, Line, Approximate |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
長條圖是靠著的一些小量的函式來儲存,而且通常使用來估計資料的分佈情形。在資料庫中,記錄每一個單一屬性的長條圖可以幫助我們來評佔資料庫運算所需的代價。而為了執行查詢的最佳化,通常需要估量這種運算的代價,來決定一個有效率的查詢計畫。另外,長條圖也廣泛的使用在近似查詢系統以及資料探詢上面。這種在資料庫中事先儲存長條條圖的技術,需要用到大量的記憶體空間。所以,如何能夠把長條圖壓縮在一個固定的空間中,而使的壓縮造成的誤差為最小,已經被科學家們視為重要的問題,並且研究了多年。
一般最常用在壓縮長條圖的演算法,是將長條圖切成好幾個格子,然後把每個格子裡的頻率視為相同的數目。於是,這個問題就轉變成,我們要如何去決定最佳格子的邊界而使得壓縮的誤差是最小。對於這個問題,之前的研究已經提供了些不錯的解答。然而,在現實生活中,我們所知道的許多資料分布是非常偏斜的。所以,之前的方法對於這種資料分佈並不是有太好的估計效果,因為它們沒有考慮到資料分布的傾向。在這篇論文中,我們提出一個用線段來估計每一個格子的演算法。在資料分布是非常偏斜或是在現實生活的資料中,它可以更為準確的來壓縮長條圖。我們做了一連串的實驗,並且從實驗的結果中顯示,我們的方法在偏斜的資料分布下是更為精準的。
Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years.
The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.
[1] Acton, F. S. Analysis of Straight-Line Data. New York: Dover, 1966.
[2] S. Acharya, P. Hibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. Proceedings of ACM SIGMOD, pages 275-286, June 1999.
[3] S. Acharya, P. Hibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. Proceedings of ACM SIGMOD, pages 574-578, June 1999
[4] M. N. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. Proceedings of ACM SIGMOD, 2002.
[5] M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. Proceedings of ACM SIGMOD, Santa Barbara, May 2001
[6] S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and performance evaluation. Proceeding of ICDE, 2002
[7] S. Guha, N. Koudas, and K. Shim. Data-Streams and Histograms. In Proc. of STOC, 2001
[8] Y. Ioannidis. and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, pages 233-244, May 1995
[9] H. V. Jagadish, Nick Koudas, and S. Muthukrishnan. Mining Deviants in a Time Series Database. Proceedings of VLDB, 1999
[10] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, pages 275-286, August 1998.
[11] H. V. Jagadish, Nick Koudas, and K. C. Sevcik. Choosing Bucket Boundaries for a Histogram. University of Toronto Technical Report, TR-375, May 1998
[12] J. Lee, D. Kim, and C. Chung. Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. Proceedings of ACM SIGMOD, pages 205-214, June 1999.
[13] S. Muthukrishnan, R. Shah, and J. S. Vitter. Mining Deviants in Time Series Data Streams. Proceedings of SSDBM, Santorini Island, Greece, June 2004
[14] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proceedings of ACM SIGMOD, 1998.
[15] Otto B. Linear Algebra with Applications, 2nd ed. Prentice Hall, 2001
[16] V. Poosala and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. Proceedings of VLDB, Athens Greece, pages 486-495, August 1997
[17] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improve Histograms for Selectivity Estimation of Range Predicates. Proceedings of ACM SIGMOD, pages 294-305, June 1996
[18] G. P. Shapiro and C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. Proceedings of ACM SIGMOD, pages 256-276, 1984.
[19] http://www.ics.uci.edu/~mlearn/MLRepository.html
[20] http://kdd.ics.uci.edu/