簡易檢索 / 詳目顯示

研究生: 李東穎
Tung-Ying Lee
論文名稱: 在多重資料串流環境中探勘序列性段落規則及其後繼之延遲時間
Mining Serial Episode Rules with Successor Lag Times over Multiple Data Streams
指導教授: 陳良弼
Arbee L.P. Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 42
中文關鍵詞: 段落規則延遲時間資料串流資料探勘
外文關鍵詞: episode rule, lag time, data stream, data mining
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 從傳統資料庫中探勘序列性段落規則是一個重要的議題,並且此議題已被研究了很多年了。一個序列性段落是指一些在鄰近時間內發生的事,這些事件是依序發生的。如果這個序列性段落時常發生,它便被稱作頻繁序列性段落。而序列性段落規則是由兩個頻繁序列性段落組成,它們之間有一個時間限制,一個序列性段落發生後,另一個序列性段落常發生在某時間之後。近年來由於網路監控、交通監控的相關應用,致使資料串流的概念重要性與日俱增。串流資料可能快速地到達且其資料可能是沒有限制的;因為計算資源與記憶體資源有限,因此在資料串流環境發現序列性段落規則是具有挑戰性的。在本論文中我們提出一個框架來進行資料串流上探勘序列性段落規則的工作。我們使用字首樹來精簡地表示所有的頻繁序列性段落,字首樹上的一個路徑表示一個頻繁序列性段落,它出現的次數被記在對應路徑的尾端節點上。若每個段落出現的時間點都被保留在對應節點,我們走訪字首樹便能將序列性段落規則找出來。為了節省記憶體,我們周期性的移除那些出現過少次的段落。然而這個結構可能仍是太大。這個結構改成只記錄任兩個事件及他們發生時間差的個數。事件種類遠小於段落的種類,因此進階的方法是節省空間的,我們也證明所有原本能發現的段落規則在進階方法能被找出。實驗展示出在記憶體的使用及探勘的計算時間上進階方法比基本的方法要有更好的表現。


    Mining serial episode rules from large databases is an important issue and has been studied by many researchers. A serial episode is a sequence of events that occur close to each other. A serial episode is frequent if it occurs frequently. A serial episode rule is composed of two frequent serial episodes X and Y where Y often occurs after X occurs in a time constraint. In recent years, the concept of data streams is motivated by some applications such as network monitoring and traffic monitoring. The streaming data arrive in high rate and can be infinite. Therefore, discovering serial episode rules over data streams is challenging due to limited CPU and memory resources. In this work, we propose a framework for serial episode rule mining over data streams. We use a prefix trie such that the frequent serial episodes having the same prefix can be represented in a compact way. Each path of the trie represents an episode X and the corresponding frequency is recorded in the last node of the path. Moreover, the time positions of X are kept in the node. Since all the frequent episodes and the sufficient information are recorded, the serial episode rules can be generated by traversing the trie. To save the memory, we periodically remove the serial episodes whose frequencies are low from the tries. However, the trie can still be too large. As a result, the trie is revised by only keeping the time difference of any two events instead of all the time positions for each serial episode. The number of event types is much less than the number of episodes. Hence, the advanced method is space-efficient. Moreover, we also prove that each episode rule can be generated from the revised data structure. The experiments show that the advanced method outperforms the original one in both the memory usage and rule generating time.

    摘要 II Abstract III Acknowledgements IV Contents V List of Figures VII List of Tables VIII 1. Introduction 1 2. Preliminaries and Related Works 6 2.1. Episodes, Episode Rules, and Time lags 6 2.1.1. Episode and Episode Rules 6 2.1.2. Episode Rules with Time Lags 7 2.1.3. Counting Strategy 7 2.2. Mining over Data Streams and Space-Efficient Synopses 7 2.2.1. Mining Frequent Items 8 2.2.2. Mining Frequent Itemsets 8 2.2.3. Mining Serial Episodes 9 2.2.4. Mining Serial Episode Rules 9 2.3. The Prefix Tree and Serial Episodes 10 2.4. Traffic Flow Theory 11 2.4.1. Traffic Flow Characteristics and Traffic Flow Regimes 11 3. Problem Formulation 12 3.1. Notations and Definitions 12 3.2. Problem 13 3.3. Example 13 4. Observation 14 4.1. Minimal Occurrence 14 4.2. Single-Item Successors 15 4.3. Single-Item Precursors 15 5. Methods 16 5.1. Naive Prefix Tree Method 16 5.2. Simplified Prefix Tree Method 18 5.2.1. Lag Bitmap Structures 19 5.2.2. Multiple Data Streams 20 5.3. Table-based Method 22 6. Analysis 25 6.1. Properties 25 6.2. Correctness of Naive Prefix Tree Method 27 6.3. Correctness of Simplified Prefix Tree Method 27 6.4. Guarantee and Correctness of Table-based Method 29 7. Experimental Results 30 7.1. Datasets 30 7.2. Parameter Settings 31 7.3. Updating Time and Reporting Time and Memory Space 34 7.4. Comparison 36 8. Conclusion and the Future Work 38 8.1. Conclusion 38 8.2. Future Work 38 8.2.1. Numerical and Symbolic Data Streams 38 8.2.2. Episode Extension 38 8.2.3. Distributed Environment 39 8.2.4. Compact Representation 39 8.2.5. Lag Time Intervals 39 9. Reference 40

    [1] P. P. Angelov Evolving rule-based models :a tool for design of flexible adaptive systems. Physica-Verlag, 2002.
    [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993: 207-216.
    [3] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996.
    [4] G. Cormode, and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 2004.
    [5] G. Das, K. I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. Proc. of KDD, 1998.
    [6] D. Y. Feng, L. J. Wei, S. J. De, and S. H. Ying. Association Rule Mining and its application in postal EMS service. ICII 2001.
    [7] M. Greenwald, and S. Khanna. Space-efficient online computation of quantile summaries. Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, 2001.
    [8] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000.
    [9] S. K. Harms, J. Deogun, and T. Tadesse. Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Proceedings of the 13th International Symposium on Foundations of Intelligent Systems table of contents Pages: 432-441 Year of Publication: 2002.
    [10] S. K. Harms, J. Deogun, and T. Tadesse. Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Foundations of Intelligent Systems: 13th International Symposium, ISMIS 2002, Lyon, France, June 27-29, 2002.
    [11] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001.
    [12] P. Karras, and N. Mamoulis. One-Pass Wavelet Synopses for Maximum-Error Metrics. Proceedings of the 31th ACM VLDB International Conference on Very Large Data Bases, 2005.
    [13] E. G. Keogh, and J. G. Lin. Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowledge and Information Systems 8(2), pp 154-177, 2005.
    [14] M. Klemettinen, H. Mannila, and H. Toivonen. Rule Discovery in Telecommunication Alarm Data. Journal of Network and Systems Management, Vol. 7, No. 4, 1999.
    [15] G. S. Manku, and R. Motwani. Approximate Frequency Counts Over Data Streams. Proceedings of the Twenty-Eighth International Conference on Very Large Data Bases, 2002.
    [16] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. KDD, 1995.
    [17] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery 1, 259-289, 1997.
    [18] A. Metwally, D. Agrawal, and A. E. Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks. Proceedings of the 31th ACM VLDB International Conference on Very Large Data Bases, 2005.
    [19] T. Mielikainen. Discovery of Serial Episodes from Streams of Events. Scientific and Statistical Database Management, 2004.
    [20] T. Oates, and P. R. Cohen. Searching for Structure in Multiple Streams of Data. ICML 1996.
    [21] T. Tadesse, D. A. Wilhite, S. K. Harms, M. J. Hayes, and S. Goddard. Drought Monitoring Using Data Mining Techniques: A Case Study for Nebraska, USA. Natural Hazards, 2004
    [22] T. Tadesse, D. A. Wilhite, and M. J. Hayes. Discovering Associations between Climatic and Oceanic Parameters to Monitor Drought in Nebraska Using Data-Mining Techniques. Journal of Climate, May2005, Vol. 18 Issue 10, p1541, 10p.
    [23] J. G. Wardrop. Some theoretical aspects of road traffic research. Proceedings of the Institution of Civil Engineers, volume 1 of 2, 1952.
    [24] J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. Proceedings of the 30th ACM VLDB International Conference on Very Large Data Bases, 2005.
    [25] http://jisao.washington.edu/data_sets/pdo/
    [26] http://tdrl1.d.umn.edu/services.htm.
    [27] http://www.cdc.noaa.gov/people/klaus.wolter/MEI/

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE