在多重資料串流環境中探勘序列性段落規則及其後繼之延遲時間

簡易檢索 / 詳目顯示

回結果列表

研究生：	李東穎 Tung-Ying Lee
論文名稱：	在多重資料串流環境中探勘序列性段落規則及其後繼之延遲時間 Mining Serial Episode Rules with Successor Lag Times over Multiple Data Streams
指導教授：	陳良弼 Arbee L.P. Chen
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	42
中文關鍵詞：	段落規則、延遲時間、資料串流、資料探勘
外文關鍵詞：	episode rule, lag time, data stream, data mining
相關次數：	點閱：130 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

從傳統資料庫中探勘序列性段落規則是一個重要的議題，並且此議題已被研究了很多年了。一個序列性段落是指一些在鄰近時間內發生的事，這些事件是依序發生的。如果這個序列性段落時常發生，它便被稱作頻繁序列性段落。而序列性段落規則是由兩個頻繁序列性段落組成，它們之間有一個時間限制，一個序列性段落發生後，另一個序列性段落常發生在某時間之後。近年來由於網路監控、交通監控的相關應用，致使資料串流的概念重要性與日俱增。串流資料可能快速地到達且其資料可能是沒有限制的；因為計算資源與記憶體資源有限，因此在資料串流環境發現序列性段落規則是具有挑戰性的。在本論文中我們提出一個框架來進行資料串流上探勘序列性段落規則的工作。我們使用字首樹來精簡地表示所有的頻繁序列性段落，字首樹上的一個路徑表示一個頻繁序列性段落，它出現的次數被記在對應路徑的尾端節點上。若每個段落出現的時間點都被保留在對應節點，我們走訪字首樹便能將序列性段落規則找出來。為了節省記憶體，我們周期性的移除那些出現過少次的段落。然而這個結構可能仍是太大。這個結構改成只記錄任兩個事件及他們發生時間差的個數。事件種類遠小於段落的種類，因此進階的方法是節省空間的，我們也證明所有原本能發現的段落規則在進階方法能被找出。實驗展示出在記憶體的使用及探勘的計算時間上進階方法比基本的方法要有更好的表現。

Mining serial episode rules from large databases is an important issue and has been studied by many researchers. A serial episode is a sequence of events that occur close to each other. A serial episode is frequent if it occurs frequently. A serial episode rule is composed of two frequent serial episodes X and Y where Y often occurs after X occurs in a time constraint. In recent years, the concept of data streams is motivated by some applications such as network monitoring and traffic monitoring. The streaming data arrive in high rate and can be infinite. Therefore, discovering serial episode rules over data streams is challenging due to limited CPU and memory resources. In this work, we propose a framework for serial episode rule mining over data streams. We use a prefix trie such that the frequent serial episodes having the same prefix can be represented in a compact way. Each path of the trie represents an episode X and the corresponding frequency is recorded in the last node of the path. Moreover, the time positions of X are kept in the node. Since all the frequent episodes and the sufficient information are recorded, the serial episode rules can be generated by traversing the trie. To save the memory, we periodically remove the serial episodes whose frequencies are low from the tries. However, the trie can still be too large. As a result, the trie is revised by only keeping the time difference of any two events instead of all the time positions for each serial episode. The number of event types is much less than the number of episodes. Hence, the advanced method is space-efficient. Moreover, we also prove that each episode rule can be generated from the revised data structure. The experiments show that the advanced method outperforms the original one in both the memory usage and rule generating time.

摘要    II
Abstract    III
Acknowledgements    IV
Contents    V
List of Figures    VII
List of Tables    VIII
   Introduction    1
   Preliminaries and Related Works    6
1.    Episodes, Episode Rules, and Time lags    6
1.1.    Episode and Episode Rules    6
1.2.    Episode Rules with Time Lags    7
1.3.    Counting Strategy    7
2.    Mining over Data Streams and Space-Efficient Synopses    7
2.1.    Mining Frequent Items    8
2.2.    Mining Frequent Itemsets    8
2.3.    Mining Serial Episodes    9
2.4.    Mining Serial Episode Rules    9
3.    The Prefix Tree and Serial Episodes    10
4.    Traffic Flow Theory    11
4.1.    Traffic Flow Characteristics and Traffic Flow Regimes    11
   Problem Formulation    12
1.    Notations and Definitions    12
2.    Problem    13
3.    Example    13
   Observation    14
1.    Minimal Occurrence    14
2.    Single-Item Successors    15
3.    Single-Item Precursors    15
   Methods    16
1.    Naive Prefix Tree Method    16
2.    Simplified Prefix Tree Method    18
2.1.    Lag Bitmap Structures    19
2.2.    Multiple Data Streams    20
3.    Table-based Method    22
   Analysis    25
1.    Properties    25
2.    Correctness of Naive Prefix Tree Method    27
3.    Correctness of Simplified Prefix Tree Method    27
4.    Guarantee and Correctness of Table-based Method    29
   Experimental Results    30
1.    Datasets    30
2.    Parameter Settings    31
3.    Updating Time and Reporting Time and Memory Space    34
4.    Comparison    36
   Conclusion and the Future Work    38
1.    Conclusion    38
2.    Future Work    38
2.1.    Numerical and Symbolic Data Streams    38
2.2.    Episode Extension    38
2.3.    Distributed Environment    39
2.4.    Compact Representation    39
2.5.    Lag Time Intervals    39
   Reference    40

                                

[1] P. P. Angelov Evolving rule-based models :a tool for design of flexible adaptive systems. Physica-Verlag, 2002.
[2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993: 207-216.
[3] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996.
[4] G. Cormode, and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 2004.
[5] G. Das, K. I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. Proc. of KDD, 1998.
[6] D. Y. Feng, L. J. Wei, S. J. De, and S. H. Ying. Association Rule Mining and its application in postal EMS service. ICII 2001.
[7] M. Greenwald, and S. Khanna. Space-efficient online computation of quantile summaries. Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, 2001.
[8] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000.
[9] S. K. Harms, J. Deogun, and T. Tadesse. Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Proceedings of the 13th International Symposium on Foundations of Intelligent Systems table of contents Pages: 432-441 Year of Publication: 2002.
[10] S. K. Harms, J. Deogun, and T. Tadesse. Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. Foundations of Intelligent Systems: 13th International Symposium, ISMIS 2002, Lyon, France, June 27-29, 2002.
[11] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001.
[12] P. Karras, and N. Mamoulis. One-Pass Wavelet Synopses for Maximum-Error Metrics. Proceedings of the 31th ACM VLDB International Conference on Very Large Data Bases, 2005.
[13] E. G. Keogh, and J. G. Lin. Clustering of time-series subsequences is meaningless: implications for previous and future research. Knowledge and Information Systems 8(2), pp 154-177, 2005.
[14] M. Klemettinen, H. Mannila, and H. Toivonen. Rule Discovery in Telecommunication Alarm Data. Journal of Network and Systems Management, Vol. 7, No. 4, 1999.
[15] G. S. Manku, and R. Motwani. Approximate Frequency Counts Over Data Streams. Proceedings of the Twenty-Eighth International Conference on Very Large Data Bases, 2002.
[16] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. KDD, 1995.
[17] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery 1, 259-289, 1997.
[18] A. Metwally, D. Agrawal, and A. E. Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks. Proceedings of the 31th ACM VLDB International Conference on Very Large Data Bases, 2005.
[19] T. Mielikainen. Discovery of Serial Episodes from Streams of Events. Scientific and Statistical Database Management, 2004.
[20] T. Oates, and P. R. Cohen. Searching for Structure in Multiple Streams of Data. ICML 1996.
[21] T. Tadesse, D. A. Wilhite, S. K. Harms, M. J. Hayes, and S. Goddard. Drought Monitoring Using Data Mining Techniques: A Case Study for Nebraska, USA. Natural Hazards, 2004
[22] T. Tadesse, D. A. Wilhite, and M. J. Hayes. Discovering Associations between Climatic and Oceanic Parameters to Monitor Drought in Nebraska Using Data-Mining Techniques. Journal of Climate, May2005, Vol. 18 Issue 10, p1541, 10p.
[23] J. G. Wardrop. Some theoretical aspects of road traffic research. Proceedings of the Institution of Civil Engineers, volume 1 of 2, 1952.
[24] J. Yu, Z. Chong, H. Lu, and A. Zhou. False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams. Proceedings of the 30th ACM VLDB International Conference on Very Large Data Bases, 2005.
[25] http://jisao.washington.edu/data_sets/pdo/
[26] http://tdrl1.d.umn.edu/services.htm.
[27] http://www.cdc.noaa.gov/people/klaus.wolter/MEI/

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文