簡易檢索 / 詳目顯示

研究生: 鄭昶延
Chang-Yeng Cheng
論文名稱: 串流資料上快速式樣偵測技術
Fast Pattern Detection in Stream Data
指導教授: 許奮輝
Fenn-Huei Simon Sheu
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2004
畢業學年度: 92
語文別: 英文
論文頁數: 30
中文關鍵詞: 串流資料數位污染字串比對入侵偵測決策樹
外文關鍵詞: data stream, digital pollution, string matching, intrusion detection, decision tree
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於網際網路的盛行,使得我們能很容易取得各種文件(包含XML格式化的文件及一般未格式化的普通文件)以及接收多媒體影音串流(例如聲音及影像),這使可得到的資料量大增。資料的來源並不缺乏,反而是必須花費很多時間從眾多的資料中找到我們真正想要的。另外,像是網路入侵偵測系統(Network intrusion detection system, NIDS)會以入侵的特徵(signature)去比對每個被偵測的封包(package),偵測速度和精確性成了網路入侵偵測系統效能的重要指標。為了從這龐大的串流資料中有效率的找尋到資料或在文件內找到指定的字串,我們提出一種有別於傳統搜尋的方式,即是以樹狀的資料結構去搜尋我們想要的資料或字串。藉由將查詢字串的各個位置的字元分類再合併成一個搜尋樹,我們將它命名為C-樹(Comparison tree),成為搜尋時的索引。依據C-樹,我們可以快速的將不符合的資料屏除,節省不必要的比較時間。由於以C-樹作索引的搜尋方法會在最少的比較次數將不會出現的可能符合位置(possible matching shift)去除,藉由這種平行比較方式來減少比較時間。實驗顯示,若以C-樹作索引會比現存的方法來得快。這種比較方式在對長字串做搜尋時,加快的效果尤其明顯。我們相信未來將C-樹整合於需要大量的搜尋時間應用(如病毒特徵的掃描以及網路入侵模式的偵測)內,可以有效而快速的得到搜尋結果。


    Digital pollution is emerging as an overwhelming threat to the Internet, whose ubiquitous connectivity conversely cultivates the widespread outbreaks of such dirt. Considerable amount of human efforts and network resources are wasted at a little cost of the few polluters. To prevent flooding of the contamination, classical string matching schemes and their variants can be the first aid for the effective quarantine to establish its censorship. The features of the typical pollutants are extracted and refined into so-called signatures. Every transfer post then looks through the incoming stream data for these signatures. Upon detecting any such pattern, the post can obstruct the connection and sound an alert to inform higher-level security systems. Obviously, the processing speed and accuracy of pattern detection schemes is crucial to the effectiveness of security systems. To expedite the scrutiny, we propose a novel pattern detection technique based on the decision tree induction to seek for significant improvement over the classical schemes. According to the intrinsic of the pattern, the tree is sprouted adaptively to minimize the number of symbols in the data stream needed to be examined. This allows a unique order to inspect the symbols in a strategic way optimized contextually, as opposed to the fixed order followed by the other schemes. In other words, this strategy inspects the symbols in every possible matching positions in parallel, and rules out the unmatching ones that have at least one false matching symbol. Finally only the possible positions that match the previous inspected symbols needed to be checked with the entire pattern. This way reduces considerable amounts of context symbol checks to confirm a matching. Performance study indicates our approach achieves the speed-up of five or more over the best competitors.

    Chapter 1 Introduction 1 Chapter 2 Related Work 5 Chapter 3 The proposed Solution: Comparison Tree (CT) 8 3.1 Construction of Comparison Tree 8 3.2 Exact String Matching Strategy 14 Chapter 4 Performance Study 17 4.1 String Matching in Speech 17 4.2 String Matching in DNA String 19 4.3 Realistic Search Time for String Matching 22 Chapter 5 Discussion and Future Work 23 5.1 Shortage of CT 23 5.2 Applications of CT 24 Chapter 6 Concluding Remark 26 Bibliography 27

    [1] J. Aach, "Aligning gene expression time series with time warping algorithms," http://arep.med.harvard.edu/timewarp/supplement.htm, June 11, 2001.
    [2] A. Apostolico and R. Giancarlo, "The Boyer-Moore-Galil string searching strategies revisited," SIAM Journal on Computing, 15(1):98-105, 1992.
    [3] R. S. Boyer and J. S. Moore, "A fast string searching algorithm," Communications of the ACM, 20(10):762-772, 1977.
    [4] C. Charras, T. Lecroq, and J. D. Pehoushek, "A very fast string matching algorithm for small alphabets and long patterns," in Proc. of 9th Annual Symp. on Combinatorial Pattern Matching, pp. 55-64, 1998.
    [5] C. J. Coit, S. Staniford, and J. McAlerney, "Towards Faster Pattern Matching for Intrusion Detection, or Exceeding the Speed of Snort," in Proc. of the 2nd DARPA Information Survivability Conference and Exposition (DISCEX II), June, 2002.
    [6] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, "Comparing Data Streams Using Hamming Norms (How to Zero In)," in Proc. of VLDB, Hong Kong, China, August 20-23, 2002.
    [7] M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter, "Speeding up Two String Matching Algorithms," Algorithmica, 12(4/5):247-267, 1994.
    [8] M. Crochemore, C. Hancart, and T. Lecroq, "A Unifying Look at the Apostolico-Giancarlo String Matching Algorithm," Journal of Discrete Algorithms, 2000.
    [9] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: Cambridge University Press, 1998.
    [10] M. Fisk and G. Varghese, "Fast Content-Based Packet Handling for Intrusion Detection," University of California San Diego, TR CS2001-0670, May, 2001.
    [11] G. R. Ganger, G. Economou, and S. M. Bielski, "Finding and containing enemies within the walls with self-securing network interfaces," Carnegie Mellon University, TR CMU-CS-03-109, January, 2003.
    [12] D. Gusfiled, Algorithms on Strings, Trees, and Sequences: Cambridge University Press, 1997.
    [13] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Mogan Kaufmann Publishers, 2001.
    [14] M. Hirao, S. Inenaga, A. Shinohara, M. Takeda, and S. Arikawa, "A Practical Algorithm to Find the Best Episode Patterns, “Lecture Notes in Computer Science,” 2226:435-440, 2001.
    [15] E. Hunt, "The Suffix Sequoia Index for Approximate String Matching," Department of Computing Science, University of Glosgow, Glasgow, UK, TR 2003-135, March, 2003.
    [16] T. Kahveci and A. K. Singh, "An Efficient Index Structure for String Databases," in Proc. of VLDB, Roma, Italy, pp. 351-360, 2001.
    [17] W. J. Kent, "BLAT-The BLAST-Like Alignment Tool," Genome Research, 12(4):656-664, April, 2002.
    [18] S. Kim and Y. Kim, "A Fast Multiple String-Pattern Matching Algorithm," in Proc. of the 17th AoM/IAoM Conference on Computer Science, 1999.
    [19] D. E. Knuth, J. H. Morris, and V. R. Pratt, "Fast pattern matching in strings," SIAM Journal on Computing, 6(1):323-350, 1977.
    [20] T. Lecroq, "A Variant on the Boyer-Moore Algorithm," Theoretical Computer Science, 92(1):119-144, 1992.
    [21] Y.-S. Moon, K.-Y. Whang, and W.-S. Han, "General Match: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows," in Proc. of ACM SIGMOD, Madison, Wisconsin, pp. 382-393, June 3-6, 2002.
    [22] Y.-S. Moon, K.-Y. Whang, and W.-K. Loh, "Duality-Based Subsequence Matching in Time-Series Databases," in Proc. of IEEE Data Engineering, Heidelberg, Germany, pp. 263-272, April 2-6, 2001.
    [23] D. Moore, C. Shannon, G. M. Voelker, and S. Savage, "Internet Quarantine: Requirements for Containing Self-Propagating Code," in Proc. of IEEE INFOCOM, San Francisco, USA, March, 2003.
    [24] S. H. Mueller, "Fight Spam on the Internet!," http://spam.abuse.net/, 2003.
    [25] G. Navarro and M. Raffinot, "Compact DFA Representation for Fast Regular Expression Search," in Proc. of Workshop on Algorithm Engineering (WAE'01), LNCS 2141, pp. 1-12, 2001.
    [26] Z. Ning, A. J. Cox, and J. C. Mullikin, "SSAHA: A Fast Search Method for Large DNA Databases," Genome Research, 11(10):1725-1729, October, 2001.
    [27] O. Ozturk and H. Ferhatosmanoglu, "Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases," in Proc. of IEEE Sym. on BioInformatics and BioEngineering, Maryland, pp. 359-366, March, 2003.
    [28] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shinohara, and S. Arikawa, "Speeding Up Pattern Matching by Text Compression," in Proc. of the 4th Italian Conference on Algorithms and Complexity, Italy, pp. 306-315, 2000.
    [29] D. M. Sunday, "A Very Fast Substring Search Algorithm," Comm. of the ACM, 33(8):132-142, 1990.
    [30] K. Thompson, "Regular Expression Search Algorithm," Comm. of the ACM, 11(6):419-422, 1968.
    [31] S. Wu, U. Manber, and E. W. Myers, "A Subquadratic Algorithm for Approximate Regular Expression Matching," Journal of Algorithms, 19(3):346-360, 1995.
    [32] R. F. Zhu and T. Takaoka, "On Improving the Average Cast of the Boyer-Moore String Matching Algorithm," Journal of Information Processing, 10(3):173-177, 1987.
    [33] Y. Zhu and D. Shasha, "StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time," in Proc. of VLDB, Hong Kong, China, August 20-23, 2002.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE