簡易檢索 / 詳目顯示

研究生: 陳弘軒
Hung-Hsuan Chen
論文名稱: 尋找資料串流之代表元素
Finding frequent items in data stream environments
指導教授: 王家祥
Jia-Shung Wang
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 45
中文關鍵詞: 資料串流熱門元素即時收視率調查阻斷服務攻擊偵測
外文關鍵詞: data stream, frequent item, on-line TV rating, DoS attack detection
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   隨著電子化時代的來臨,資訊量呈現爆炸性的增加,資料串流 (data stream) 所扮演的角色愈來愈重要,許多應用程式的資訊都以資料串流的型式傳輸與處理,諸如:無線感知網路 (wireless sensor networks) 的監控、即時熱門電視節目調查、阻斷服務攻擊 (deny of service attack) 的偵測等。資料串流的特性是:(1) 總
    資料量龐大,遠超過傳統資料庫或記憶裝置 (如:硬碟) 所能紀錄的量;(2) 資料更新頻率高;(3) 使用者對於資料的要求不再是單次的快照檢索 (one-time query),而是長期的連續檢索 (continuous query)。這些與傳統資料處理的相異點,使資料串流中,資訊統計量的計算變得很困難。
      這篇論文中,我們將提出一種在資料串流中尋找熱門元素的方式,其中,熱門元素指的是在資料串流中,出現頻率超過特定比例的元素。為了能符合資料串流的特性並提高實用性,我們的演算法將達到以下的目標:(1) 給定有限範圍內的任何一段時間,我們的系統將能有效率的回傳這段時間內的熱門元素;(2) 可採用分散式處理。我們的演算法必須能在局部刪整資料,且這個刪整資料的動作不可影響最後熱門元素統計的正確性;(3) 有效率的儲存方式。由於資料串流的資料量龐大,遠超過儲存裝置所能記載的數量,我們必須設計出一個有效率的資料結構來儲存熱門元素;(4) 低計算時間複雜度。資料串流的更新頻率頻繁,因此,此演算法必須能快速的計算出熱門元素的統計資訊。
      我們已經在數學上證明:當熱門元素存在時,此演算法保證正確。實驗也顯示:即使在熱門元素較稀少的環境中,我們的演算法回傳的結果也有很高的比例是出現次數較高的元素。


      In many real world applications, such as monitoring in sensor networks, online TV ratings, and DoS (deny of service) attack detection, information delivering is by the
    form of data streams. Differs from the traditional database systems, data streams have the following characteristics: (1) total amount of data is enormous, much more than traditional database systems or memory device (e.g. hard disk) can afford; (2) highly updating rate; (3) rather than one-time query, users tend to issue sort of continuous query instead. These differences make it difficult to calculate the statistics of data streams.
      In this thesis, a method to disclose frequent items in masses of data streams is proposed. Frequent items refer to those whose occurrence rates exceed a given frequency threshold, which can be specified by inquirers. In order to meet the characteristics of data stream processing and without losing the practicability, the proposed approach has the following objectives: (1) any query interval: Given a time period within the system-affordable range, the frequent items of this period will be responded efficiently; (2) distributed processing: the approach can separately prune away many occurrences from bottom to top without losing any fidelity; (3) storage saving: due to the enormous amount of occurrences, we have to design a mechanism to conserve the potential frequent items using only limited local counters; (4) computation saving: because of the highly updating rate, the proposed approach should be able to respond the statistical results of frequent items without delay.
      We proposed a distributed algorithm which fulfills the above requirements in this thesis as well as proved its correctness when frequent items do exist. To demonstrate
    the practicability in locating the proper frequent items we conduct some simulations, the experimental results show that most of the items generated are of eminent occurrence rates even though the frequent items are eccentric.

    Chapter 1 Introduction ...................................1 Chapter 2 Related Works ..................................3 2.1 The Concept of Data Stream Processing.................3 2.2 Reveal Frequent Items in Continuous Query Processing .5 Chapter 3 The Proposed Approach and Corresponding Algorithms................................................7 3.1 The Basic Majority Algorithm.................................................7 3.2 Distributed Majority Algorithm................................................11 3.3 Majority Algorithm for Data Stream Input ............17 3.3.1 Majority Counting with Data Stream Input ..........17 3.3.2 Majority Counting with Limited Storage Space ......20 3.4 Finding Frequent Items ..............................26 3.5 Some Improvements ...................................35 3.5.1 Raise Success Ratio ...............................35 3.5.2 Reduce the Impact of the Order of Input Elements ..36 Chapter 4 Experimental Results...........................37 4.1 Normal Distribution..................................37 4.2 Data Generated by Zipf Distribution ............................................39 Chapter 5 Conclusions and Future Works...................42 References...............................................44

    [1] The STREAM Project homepage: http://www-db.stanford.edu/stream/
    [2] The Aurora Project homepage: http://www.cs.brown.edu/research/aurora/
    [3] The Niagara Project homepage: http://www.cs.wisc.edu/niagara/
    [4] The Telegraph Project homepage: http://telegraph.cs.berkeley.edu/
    [5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” in Proceedings of the ACM Symposium on Principles of
    Database Systems (PODS), 2002.
    [6] M. Datar, A. Gionis, P. Indyk, and R. Motwani, Maintaining Stream Statistics over Sliding Windows,” in Proceedings of the 13th ACM SIAM Symposium on
    Discrete Algorithms, 2002.
    [7] Robert S. Boyer and J Strother Moore, “MJRTY - A Fast Majority Vote Algorithm,” in Technical Report ICSCA-CMP-32, 1982.
    [8] G. S. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams,” in Proceedings of Very Large Data Bases (VLDB), 2002.
    [9] M. Charikar, K. Chen, and M. Farach-Colton, “Finding Frequent Items in Data Streams,” in Proceedings of International Colloquium on Automata, Language, and Programming (ICALP), 2002
    [10] J. Misra and D. Gries, “Finding Repeated Elements,”in Science of Computer Programming, 1982.
    [11] E. Demaine, A. Lopez-Ortiz, and J. I. Munro: Frequency Estimation of Internet Packet Streams with Limited Space,” in Proceedings of the 10th Annual European Symposium on Algorithms, 2002.
    [12] G. Cormode and S. Muthukrishman, “What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically,” in ACM Principles of Database Systems (PODS), 2003.
    [13] A. Arasu and G. Manku, “Approximate Counts and Quantiles over Sliding Windows,” in ACM Principles of Database Systems (PODS), 2004.
    [14] Nan Jiang and Le Gruenwald, “Research Issues in Data Stream Association Rule Mining,” in ACM Special Interest Group on Management of Data (SIGMOD), 2006.
    [15] S. K. Park and K. W. Miller, “Random Number Generators: Good Ones Are Hard To Find,” in Communications of ACM, 1988.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE