簡易檢索 / 詳目顯示

研究生: 謝承恩
Cheng-Enn Hsieh
論文名稱: 由資料串流環境探勘常見樹狀結構
Discovering Frequent Tree Patterns over Data Streams
指導教授: 陳良弼
Arbee L.P. Chen
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2005
畢業學年度: 93
語文別: 英文
論文頁數: 52
中文關鍵詞: 資料串流探勘常見樹狀結構演算法線上
外文關鍵詞: data stream, mining, frequent, tree pattern, algorithm, online
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,如XML文件這類半結構化資料已經被廣泛地應用在網際網路上作為標準資料表示及交換方式。因此,從半結構化資料所構成的串流探勘出常見之樹狀結構,便成為一個有趣的研究議題。在本論文裡,我們針對樹狀結構應用領域中最為常見之資料模型-具標示項目有序樹(例如:XML文件,XQuery等)設計一個線上演算法,用以發掘在串流裡所有出現比例高於使用者定義門檻值子樹結構。整體方法著重於利用有限的實體資源如:CPU、記憶體空間解決無止盡串流資料探勘的問題並提供相對應的近似解集合。
    在此目標下,一個重要的議題為如何有效的列舉出所有可能的常見子樹,並摒除在過程中發生重覆列舉的情況。為此,我們設計了一個有效而快速的列舉方式-進階尾部擴張法來達成此目的;另一個重要的議題則為已產生之子樹的儲存與管理:隨著串流資料的進入,部份子樹的出現比例可能從滿足門檻值降至其下,而必須從答案之列移除。另一方面,某些子樹可能於串流資料進入期間累積足夠的出現比例,而成為使用者所求。綜合上述,僅儲存當下常見之子樹無法支援上述情況的發生。因此,我們除了設計出一個精實的資料結構以儲存子樹之外,亦修改應用了Lossy Counting演算法以作為子樹之管理儲存機制。
    在本論文的最後,實驗數據顯示出此演算法的效率及可擴充性。除此之外,基於理論上的分析,此演算法具備提供保証精確性及正確性之答案。此特色亦進一步地加強了其通用性。


    Since semi-structured data such as XML files are widely used for data representation and exchange through the Internet, discovering frequent tree patterns over semi-structured data streams becomes an interesting issue. In this thesis, we propose an online algorithm to continuously compute the current set of frequent tree patterns from the data stream. A novel technique is introduced to incrementally generate all candidate tree patterns without duplicates in an efficient way. Moreover, a framework for counting the approximate frequency of the candidate tree patterns is adopted. Combining these techniques, the proposed algorithm is capable of computing frequent tree patterns with guarantees on completeness and accuracy. The experiment results show that this algorithm is both efficient and scalable.

    Abstract i Acknowledgements ii List of Figures iv List of Algorithms v Chapter 1: Introduction 1 1.1 Main difficulties 3 1.2 Motivation 6 1.3 Related work 7 1.4 Organization 10 Chapter 2: Preliminaries and problem definition 11 2.1 Basic terminologies 11 2.1.1 Semi-structured data 11 2.1.2 Labeled ordered tree 12 2.1.3 Induced sub-tree 13 2.1.4 Semi-structured data stream 14 2.2 Problem definition 15 2.3 An example 17 Chapter 3: Methodology 18 3.1 Algorithm overview 18 3.2 String encoding of trees 23 3.3 Global Prefix Tree (GPT) 26 3.4 Sub-tree generation 28 3.4.1 Tail-expansion 28 3.4.2 Advanced tail-expansion 33 3.5 Sub-tree maintenance 36 3.5.1 Lossy Counting 36 3.5.2 The adaptation of Lossy Counting 38 3.6 Extensions of STMer 42 Chapter 4: Experiments 44 4.1 Experimental dataset 44 4.2 Sensitivity experiments 45 4.3 A comparative experiment 48 Chapter 5: Conclusion and future work 50 Reference: 51

    [1] P. Buneman. Semistructured Data. In Proc. of ACM Symposium on Principles of Database Systems (PODS’97), 1997.
    [2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of ACM Symposium on Principles of Database Systems (PODS’02), 2002.
    [3] L. H. Yang, M. L. Lee, and W. Hsu. Efficient Mining of XML Query Patterns for Caching. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’03), 2003.
    [4] L. H. Yang, M. L. Lee, W. Hsu, and S. Acharya. Mining Frequent Query Patterns from XML Queries. In Proc. of Intl. Conf. on Database Systems for Advanced Applications (DASFAA’03), 2003.
    [5] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’02), 2002.
    [6] J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143-152, 1982.
    [7] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: mining frequent itemsets from high speed transactional data streams. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’04), 2004.
    [8] A. Arasu and G. S. Manku. Approximate Counts and Quantiles over Sliding Windows. In Proc. of ACM Symposium on Principles of Database Systems (PODS’04), 2004.
    [9] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In Proc. of Intl. Conf. on Data Mining (ICDM’04), 2004.
    [10] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of Intl. Conf. on Database Theory (ICDT’99), 1999.
    [11] A. Manjhi, V. Shkapenyuk, and K. Dhamdhere. Finding (Recently) Frequent Items in Distributed Data Streams. In Proc. of Intl. Conf. on Data Engineering (ICDE’05), 2005.
    [12] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. of ACM Special Interest Group on Management of Data (SIGMOD’01), 2001.
    [13] X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously maintaining quantile summaries of the most recent N elements over a data stream. In Proc. of Intl. Conf. on Data Engineering (ICDE’04), 2004.
    [14] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Online Algorithms for Mining Semi-structured Data Stream. In Proc. of Intl. Conf. on Data Mining (ICDM’02), 2002.
    [15] M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. In Proc. of ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD’02), 2002.
    [16] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient Substructure Discovery form Large Semi-structure Data. In Proc. of SIAM Intl. Conf. on Data Mining (SDM’02), 2002.
    [17] C. Hidber. Online Association Rule Mining. In Proc. of ACM Special Interest Group on Management of Data (SIGMOD’99), 1999.
    [18] L. H. Yang, M. L. Lee, and W. Hsu. Finding Hot Query Patterns over an XQuery Stream. In Proc. of VLDB Journal Special Issue on Data Stream, J. Gehrke and J. Hellerstein (Eds.), 2004.
    [19] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’04), 1994.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE