簡易檢索 / 詳目顯示

研究生: 左聰文
Cho,Chung-Wen
論文名稱: 從序列資料庫中探勘和擷取循序型樣
Mining and Retrieving Various Sequential Patterns in Sequence Databases
指導教授: 陳良弼
Chen,Arbee L. P.
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2008
畢業學年度: 97
語文別: 英文
論文頁數: 72
中文關鍵詞: 序列資料庫事件串流資料探勘頻繁序列事件法則擷取預測
外文關鍵詞: Sequence database, event stream, data mining, frequent sequence, episode rule, retrieval, prediction
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來由於大量序列資料的成長,像是顧客的交易依照購買時間排序(顧客序列)和事件序列。因此,有效率及正確地分析這些資料和應用所分析的結果已成為現今重要的課題。在此論文,我們提出了一架構從序列資料庫中推導和擷取有意義的法則。在此架構中,我們從顧客交易序列中推導出顧客的購買行為(序列法則,sequence rules)以利於一些應用,例如:商品的推薦。另一方面,我們應用從過去發生事件的序列中所推導出的法則(事件法則,episode rules)來預測未來的事件。在這兩個工作主題中,我們分別發展了有效率、具正確性的序列探勘和法則擷取的方法。
    從大量顧客序列中所探勘出的序列法則:X□Y將有助於商品的推薦。一序列法則:X□Y的含意在於假如一位顧客有相對於X(前項,predicate)序列中所描述的購買行為,此顧客可能就會買在Y(後項,consequent)中所描述的商品。我們說一個序列(由多個項目集合以有序的方式排列而成)為頻繁的(frequent),假如包含此序列的顧客序列數目達到由使用者定義一個臨界值。一1-序列為一特別的序列,因為它只包含了一個項目集合而不是一有序排序。而k-序列為一包含k個項目集之序列。比較探勘1-序列和探勘k-序列(k大於等於2),前者所耗費的時間比後者來得微小許多。
    在本論文中,我們採用一二-階段的架構來分別探勘出兩種形式的頻繁序列(1-序列和k-序列)。採用二-階段的架構可以在單獨探勘頻繁k-序列的情況之下,設計出更有效率的演算法。在探勘頻繁k-序列中,每一頻繁1-序列通常都會被編碼成為一新的項目(編碼的目的是為了完整的探勘出頻繁k-序列),而顧客序列也會根據其所包含的頻繁1-序列而被轉換成只有包含該頻繁1-序列的顧客序列。在本論文中,我們發現並非所有的頻繁1-序列都要被編碼並且利用已探勘出的頻繁2-序列來編碼,因此,所轉換出來的資料庫大小因而小很多。對於每個轉換後的顧客序列會被一遍遍的掃描直到所有的頻繁序列都找出來為止。我們發展出一種較有效率的方式來表示顧客序列,在此表示之下,我們可以列舉出該顧客序列包含的所有子序列且不會重覆列舉。
    在一些應用中,例如,在電信網絡中的警報或者是股票的變動都可以視為事件,而這些事件產生的速度通常都呈現串流的特質。在這些應用中,預測將發生的事件是非常重要的。在一事件串流中,我們稱一連串的流入事件在限定的時間範圍內比對到一事件法則的前項為一該前項的事例。在比對到此事例後,對應到事件法則的後項事件將可以預測其會發生在未來的一段時間範圍之內。然而在這個問題之下,我們發現有些後項事件的發生時間點會被重覆的計算,這些時間點變成是對預測沒有幫助且多餘的。因此,我們的問題在於如何避免重覆預測事件。
    在本論文中,我們首先提出一有效的比對規範來避免比對那些預測多餘時間點的前項事例。基於所發展的規劃,我們提出了兩個演算法來有效率比對前項事例。第一個演算法為每一個法則前項建構一事件過濾器。當事件不斷的流入時,此事件過濾器可以馬上判斷該事件是否可以構成前項事例的一部份而將之留下來。在這樣的方式之下,我們可以馬上判斷是否已經比對到一前項事例,而且可以避免在事件串流上回溯去比對前項事例。另一方面,第二個方法首先建構一有效率的樹狀結構(索引)來儲存流入的事件。在這個方法中,我們平時只要維護該樹狀結構,只有等到一些關鍵的事件進入時才到結構中比對前項事例。基於此樹狀結構,我們發展了有效率的比對方法,此方法可以避免一一的去掃描結構中的每個事件。


    In recent years, there has been an enormous growth in the amount of sequence data, such as the log of customer transactions and events. Providing efficient ways to analyze the larger amount of sequence data and to make use of the analyzed results has become one of the most important issues nowadays.
    In this thesis, a framework aiming at efficiently and effectively deriving and retrieving significant rules when given the sequence datasets is proposed. In the framework, given a set of sequences composed of customer transactions, we aim at deriving the customer buying behavior (sequence rules) for the recommendation of hot items. On the other hand, we utilize the discovered rules (episode rules) from the past sequence of events for the prediction of coming events. In each of these works, efficient and effective methods for mining the sequence data and retrieving the discovered rules are developed respectively.
    Given a large number of customer sequences, mining sequence rules in the form of X□Y can be useful some applications. i.e., item recommendation. The rule X□Y implies that if a customer has the buying behavior corresponding to X (the predicate), she/he will likely buy the items in Y (the consequent). In general, the sequence rules are derived from the frequent sequences. A sequence (an ordered list of item-sets) is frequent if the number of customer sequences containing it satisfies the user-specified threshold. The 1-sequence is a special type of sequences because it consists of only a single itemset instead of an ordered list, while the k-sequence is a sequence composed of k itemsets. Compared with the cost of mining frequent 1-sequences and mining frequent k-sequences (k□2), the cost of mining frequent 1-sequences is negligible.
    In this thesis, we adopt a two-phase architecture to find the two types of frequent sequences separately in order that the discovery of frequent k-sequences can be well designed and optimized. For efficient frequent k-sequence mining, every frequent 1-sequence is encoded as a unique symbol and the database is transformed into one constituted by the symbols. We find that it is unnecessary to encode all the frequent 1-seqences, and make full use of the discovered frequent 1-sequences to transform the database into one with a smaller size. For every k□2, the customer sequences in the transformed database are scanned to find all the frequent k-sequences. We devise the compact representation for a customer sequence and elaborate the method to enumerate all distinct subsequences from a customer sequence without redundant scans.
    In many applications, events such as alarms in telecommunication networks and stock fluctuations in the stock market often come in a stream. The prediction of coming events has great importance in such applications. In an event stream, a sequence of events which matches the predicate of the episode rule satisfying a specified time constraint is called an occurrence of the predicate. After finding the occurrence, the consequent event which will occur in a time interval can be predicted. However, the time intervals computed from some occurrences for predicting the event can be contained in the time intervals computed from other occurrences, and become duplicate.
    In this thesis, an effective scheme is proposed to avoid matching the predicate events corresponding to duplicate predicting intervals. Based on the scheme, we propose two algorithms for the efficient matching of predicate events over event streams. The first algorithm constructs an event filter to incrementally maintain parts of the matched results as events arrive, and thus it avoids backward scans of the event stream. On the other hand, the second one is to maintain the recently arrived events in a tree structure. The matching of predicate events is only triggered by distinguishable events, and an efficient matching algorithm is proposed from the tree structure, which avoids exhaustive scans of the arrived events.

    Abstract Ⅰ 1 Introduction 1 1.1 Rule derivation 2 1.2 Rule retrieval 3 2 Related works 7 2.1 Related work on sequence rule derivation 7 2.2 Related Work on episode rule retrieval 9 3 Sequence rule derivation from customer sequences 12 3.1 Introduction 12 3.2 Frequent itemset mining 16 3.3 Frequent sequence mining 18 3.3.1 Database transformation 19 3.3.2 Candidate Generation 20 3.3.3 Support Computation 21 3.4 Implementation details and pivot simplification 24 3.4.1 Implementation Details 24 3.4.2 Pivot Simplification 26 3.5 Performance evaluation 28 3.5.1 Comparisons on minsup 28 3.5.2 Comparisons on scalability 32 3.5.3 Studies on the influence of sequence density 34 4 Episode rule retrieval over event streams 36 4.1 Introduction 36 4.2 Problem statement 36 4.2.1 Preliminary 36 4.2.2 Basic concept 37 4.3 The ToFel approach 42 4.3.1 Queue Maintenance 42 4.3.2 Queue-based retrieval 46 4.4 The CBS-tree approach 48 4.4.1 CBS-Tree 48 4.4.2 CBS-tree-based retrieval 49 4.5 Performance evaluation 57 4.5.1 Comparisons on synthetic data 57 4.5.2 Comparisons on real data 61 5 Conclusion and future works 64 References 67

    1. Abadi, D.J. et al., “Aurora: A Data Stream Management System,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 666, 2003.
    2. Agrawal, R. and R. Srikant. “Mining Sequential Patterns,” Proceedings of IEEE Conference on Data Engineering, 3-14, 1995.
    3. Agrawal, R. and R. Srikant. “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proceedings of International Conference on Extending Database Technology, 3-17, 1996.
    4. Aguilera, M.K., R.E. Strom, D.C. Sturman, M. Astley, and T.D. Chandra, “Matching Events in a Content-Based Subscription System,” Proceedings of Annual ACM Symposium on Principles of Distributed Computing, 53-61, 1999.
    5. Altinel, M. and M.J. Franklin, “Efficient Filtering of XML Documents for Selective Dissemination of Information,” Proceedings of International Conference on Very Large Data Bases: 53-64, 2000.
    6. Atallah, M.J., R. Gwadera, and W. Szpankowski, “Detection of Significant Sets of Episodes in Event Sequences,” Proceedings of International Conference on Data Mining, 3-10, 2004.
    7. Arasu, A., B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom, “STREAM: The Stanford Stream Data Manager,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 665, 2003.
    8. Ayres, J., J. Gehrke, T. Yiu, and J. Flannick. “Sequential PAttern Mining using A Bitmap Representation,” Proceedings of ACM SIGKDD Conference, 429-435, 2002.
    9. Ayad, A. and J.F. Naughton, “Static Optimization of Conjunctive Queries with Sliding Windows Over Infinite Streams,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 419-430, 2004.
    10. Babu S., R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. “Adaptive Ordering of Pipelined Stream Filters,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 407-418, 2004.
    11. Chai, W. and B. Vercoe. “Folk Music Classification Using Hidden Markov Models,” Proceedings of International Conference on Artificial Intelligence, 2001.
    12. Chakravarthy, S., V. Krishnaprasad, Eman A., and S.K. Kim, “Composite Events for Active Databases: Semantics, Contexts and Detection,” Proceedings of the International Conference on Very Large Data Bases, 606-617, 1994.
    13. Chan, C.Y., P. Felber, M.N. Garofalakis, and R. Rastogi, “Efficient filtering of XML documents with XPath expressions,” VLDB Journal 11(4), 354-379, 2002.
    14. Chen, H.C. and A.L.P. Chen. “A Music Recommendation System Based on Music Data Grouping and User Interests,” Proceedings of ACM Conference on Information and Knowledge Management, 231-238, 2001.
    15. Chen S.K., J.J. Jeng, and H. Chang. “Complex Event Processing using Simple Rule-based Event Correlation Engines for Business Performance Management,” Proceedings of IEEE International Conference on E-Commerce Technology / Third IEEE International Conference on Enterprise Computing, E-Commerce and E-Services, 2006.
    16. Cheng, H., X. Yan, and J. Han, “SeqIndex: Indexing Sequences by Sequential Pattern Analysis,” Proceedings of SIAM International Conference on Data Mining, 2005.
    17. Chiu, D.Y., Y.H. Wu, and A.L.P. Chen. “An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting,” Proceedings of International Conference on Data Engineering, 375-386, 2004.
    18. Cho, C.W., Y.H. Wu, and A.L.P. Chen. “Effective Database Transformation and Efficient Support Computation for Mining Sequential Patterns,” Proceedings of Database Systems for Advanced Applications Conference, 163-174, 2005.
    19. Cho, C.W., Y. Zheng, and A.L.P. Chen. “Continuously Matching Episode Rules for Predicting Future Events over Event Streams,” Proceedings of joint conference of Asia-Pacific Web Conference and International Conference on Web-Age Information Management, 884-891, 2007.
    20. Chomicki J. “History-less Checking of Dynamic Integrity Constraints,” Proceedings of International Conference on Data Engineering, 557-564, 1992.
    21. Clark, J. and S. DeRose, “XML Path Language (XPath) Version 1.0,” W3C Recommendation, http://www.w3.org/TR/xpath, 1999.
    22. Collet, C. and T. Coupaye, “Composite Events in NAOS,” Proceedings of International Conference on Database and Expert Systems Applications, 1996, 244-253.
    23. Dayal, U. et al., “The HiPAC Project: Combining Active Databases and Timing Constraints,” SIGMOD Record 17(1), 51-70, 1988.
    24. Demers, A.J., J. Gehrke, M.S. Hong, M. Riedewald, and W.M. White, “Towards Expressive Publish/Subscribe Systems,” Proceedings of International Conference on Extending Database Technology, 627-644, 2006.
    25. Diao, Y. and M.J. Franklin, “High-Performance XML Filtering: An Overview of YFilter,” IEEE Data Engineering Bulletin 26(1), 41-48, 2003.
    26. Fabret, F., H.A. Jacobsen, F. Llirbat, J. Pereira, K.A. Ross, and D. Shasha, “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 115-126, 2001.
    27. Faloutsos, C., M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-series Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1994.
    28. Franklin, M.J., S.R. Jeffery, S. Krishnamurthy, F. Reiss, S. Rizvi, E. Wu, O. Cooper, A. Edakkunni, and W. Hong, “Design Considerations for High Fan-In Systems: The HiFi Approach,” Proceedings of Biennial Conference on Innovative Data Systems Research, 290-304, 2005.
    29. Gatziu, S. and K.R. Dittrich, “SAMOS: an Active Object-Oriented Database System,” IEEE Database Engineering Bulletin, 15(1-4), 23-26, 1992.
    30. Gehani, N.H., H.V. Jagadish, and O. Shmueli, “Composite Event Specification in Active Databases: Model & Implementation,” Proceedings of International Conference on Very Large Data Bases, 327-338, 1992.
    31. Giugno, R. and D. Shasha. “GraphGrep: A Fast and Universal Method for Querying Graphs,” Proceedings of International Conference on Pattern Recognition, 112-115, 2002.
    32. Hall, F.L. “Traffic stream characteristics,” Traffic Flow Theory, U.S. Federal Highway Administration, 1996.
    33. Han, J. and M. Kamber. “Data Mining: Concepts and Techniques,” Morgan Kaufmann, 2001.
    34. Hätönen, K., M. Klemettinen, H. Mannila, P. Ronkainen, and H. Toivonen. “Knowledge Discovery from Telecommunication Network Alarm Databases,” Proceedings of IEEE International Conference on Data Engineering, 115-112, 1996.
    35. He, H. and A.K. Singh. “Closure-Tree: An Index Structure for Graph Queries,” Proceedings of International Conference on Data Engineering, 38, 2006.
    36. Hsieh, C.E., Y.H. Wu and A.L.P. Chen. “Discovering Frequent Tree Patterns over Data Streams,” Proceedings of SIAM International Conference on Data Mining, 2006.
    37. Hsu, J.L. and A.L.P. Chen, “Building a Platform for Performance Study of Various Music Information Retrieval Approaches,” Proceedings of International Symposium on Music Information Retrieval, 2001.
    38. Laxman, S., P.S. Sastry, and K.P. Unnikrishnan, “A fast algorithm for finding frequent episodes in event streams,” Proceedings of International Conference on Knowledge Discovery and Data Mining, 410-419, 2007.
    39. Lee, C.H., C.W. Cho, Y.H. Wu, and A.L.P. Chen. “A Novel Representation of Sequence Data Based on Structural Information for Effective Music Retrieval,” Proceedings of Database Systems for Advanced Applications Conference, 393-404, 2004.
    40. Mannila, H., H. Toivonen, and A.I. Verkamo. “Discovering Frequent Episodes in Sequences,” Proceedings of International Conference on Knowledge Discovery and Data Mining, 210-215, 1995.
    41. Mannila, H., H. Toivonen, and A.I. Verkamo. “Discovery of Frequent Episodes in Event Sequences,” Data Mining and Knowledge Discovery, 1(3): 259-2, 1997.
    42. Naughton, J.F. et al, “The Niagara Internet Query System,” IEEE Data Engineering Bulletin 24(2), 27-33, 2001.
    43. Moon, Y.S., K.Y. Whang, and W.S. Han, “General Match: A Subsequence Matching Method in Time-series Databases Based on Generalized Windows,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002.
    44. Ng, A. and A.W.C. Fu. “Mining Frequent Episodes for Relating Financial Events and Stock Trends,” Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining, 27-39, 2003.
    45. Qin, M. and K. Hwang. “Frequent Episode Rules for Internet Anomaly Detection,” Proceedings of International Symposium on Network Computing and Applications, 161-168, 2004.
    46. Olteanu, D., T. Kiesling, and F. Bry. “An Evaluation of Regular Path Expressions with Qualifiers against XML Streams,” Proceedings of International Conference on Data Engineering, 702-704, 2003.
    47. Pei J., J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. “Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach,” IEEE Transactions on Knowledge and Data Engineering, 16(11): 1424-1440, 2003.
    48. Peng, F. and S.S. Chawathe. “XPath Queries on Streaming Data,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 431-442, 2003.
    49. Rao, P. and B. Moon. “SketchTree: Approximate Tree Pattern Counts over Streaming Labeled Trees,” Proceedings of International Conference on Data Engineering, 80, 2006.
    50. Rafiei, D. and A. Mendelzon, “Similarity-Based Queries for Time Series Data,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1997.
    51. Shah, M.A. and S. Chandrasekaran, “Fault-tolerant, Load-balancing Queries in Telegraph,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 611, 2001.
    52. Sun, X., M. E. Orlowska, and X. Li. “Introducing Uncertainty into Pattern Discovery in Temporal Event Sequences,” Proceedings of IEEE International Conference on Data Mining, 299-306, 2003.
    53. Tzvetkov, P., X. Yan, and J. Han. “TSP: Mining Top-K Closed Sequential Patterns,” Proceedings of IEEE International Conference on Data Mining, 347-354, 2003.
    54. Viglas, S. and J.F. Naughton, “Rate-based query optimization for streaming information sources,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 37-48, 2002.
    55. Wang, F. and P. Liu. “Temporal Management of RFID Data,” Proceedings of International Conference on Very Large Data Bases, 2006, 1128-1139.
    56. Wu, E., Y. Diao, and S. Rizvi. “High-performance complex event processing over streams,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 407-418, 2006.
    57. Yan, X. and J. Han. “gSpan: Graph-Based Substructure Pattern Mining,” Proceedings of IEEE International Conference on Data Mining, 721-724, 2002.
    58. Yan, X., J. Han, and R. Afshar. “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proceedings of SIAM International Conference Data Mining, 166-177, 2003.
    59. Yang, J., W. Wang, P.S. Yu, and J. Han. “Mining Long Sequential Patterns in a Noisy Environment,” Proceedings of ACM SIGMOD International Conference on Management of Data, 406-417, 2002.
    60. Yan, X., P.S. Yu, and J. Han. “Graph Indexing: A Frequent Structure-based Approach,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 335-346, 2004.
    61. Yen, S.J. and Y.S. Lee. “An Incremental Updating Technique for Discovering Frequent Traversal Patterns,” Proceedings of Asia Pacific Web Conference, 479-488, 2004.
    62. Zaki, M.J. “An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, 42(1/2): 31-60, 2001.
    63. Zaki, M.J. “Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications,” IEEE Transactions on Knowledge and Data Engineering, 17(8): 1021-1035, 2005.
    64. Zimmer, D. and R. Unland. “On the Semantics of Complex Events in Active Database Management Systems,” Proceedings of International Conference on Data Engineering, 392-399, 1999.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE