研究生: |
李大雋 Li, Dai-Jiun |
---|---|
論文名稱: |
On Continuous Top-k Similarity Joins |
指導教授: |
陳良弼
Chen, Arbee L.P. |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2010 |
畢業學年度: | 99 |
語文別: | 英文 |
論文頁數: | 26 |
中文關鍵詞: | 相似連結 |
外文關鍵詞: | similarity join |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
假如給予一個相似度公式和一個介於0到1之間的門檻值s,一個相似連結的計算能夠將所有分屬於兩個不同集合的配對中,相似度高於門檻值的配對回傳給使用者。由於相似連結是相當常見且應用層面很廣的一種運算,許多的研究者開始投入對相似連結的研究,相似連結應用的例子包含了重複資料的偵測、資料的合併和格式的識別。在一個最近的研究當中,研究者為了避免門檻值很難設定的問題,將原先需要一個門檻值才能進行運算的相似連結,變成不需要門檻值就能運算,而這個新的運算就叫做top-k相似連結。為了加速在許多應用中必須被連續計算的相似連結,同時為了避免上面所提到門檻值的問題,我們在這篇論文提出了一個在資料串流環境之下的新的top-k相似連結,我們稱它為連續計算下的top-k相似連結。這個新的問題是這樣的,給於一個叫做詢問組的集合,我們能夠回傳介於這個詢問組與一個被我們監測中的資料串流之間的top-k相似連結答案,並且在資料串流更新時回傳新的topk相似連結答案出來。我們提出了兩個演算法來解決這個問題,第一個演算法我們利用了現存的top-k相似連結演算法來找出詢問組與最新的資料之間相似度最高的k個配對,並將答案存在一個候選集合裡面,而我們最終所求的答案將會從候選集合裡面產生出來。另一個演算法我們將詢問組前處理成一種新的資料結構,而這種資料結構能夠幫助我們讓我們在處理一筆資料串流的資料時,可以同時讓這筆資料跟所有詢問組裡面的資料作比較,進而加速整體的運算時間。我們做了一連串的實驗來測試這兩個演算法,而這些實驗也證明了有做前處理的方法比起其他利用舊有方法做改變的方法擁有更好的效能。
Given a similarity function and a threshold s within a range of [0, 1], a similarity join query between two sets of records returns pairs of records from the two sets, which have similarity values exceeding or equaling s. Similarity joins have received much research attention since it is a fundamental operation related to wide applications such as duplicate detection, data integration, and pattern recognition. Recently, a variant of similarity joins is proposed to avoid the need to set the threshold s, i.e. top-k similarity joins. Since data in many applications are generated as a form of continuous data streams, in this paper, we make the first attempt to solve the problem of top-k similarity joins considering a dynamic environment involving a data stream, named continuous top-k similarity joins. Given a set of records as the query, we continuously maintain the top-k pairs of records, ranked by their similarity values, for the query and the most recent data in a monitored data stream. Two algorithms are proposed to solve this problem. The first one extends an existing approach for static datasets to find the top-k pairs regarding the query and the newly arrived data and then keep the obtained pairs in a candidate set. As a result, the top-k pairs can be found from the candidate set. In the other algorithm, the records in the query are preprocessed to be indexed using a novel data structure. By the structure, the data in the monitored stream can be compared with all records in the query at one time, substantially reducing the processing time of finding the top-k results. A series of experiments are performed to evaluate the two proposed algorithms and the experiment results demonstrate that the algorithm with preprocessing outperforms the other algorithm extended from an existing approach.
[AG06] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd International Conference on Very Large Data Base Endowment, VLDB2006. pp. 918-929.
[BG97] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, vol. 29, no. 8-13, (1997) pp. 1157–1166.
[BM03] M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., vol. 18, no. 5, (2003)pp. 16–23.
[BS07] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. The 16th International World Wide Web Conference, WWW2007, New York, NY, USA, pp. 131-140.
[CF02] A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., vol. 20, no. 2, (2002)pp. 171–191.
[CG06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE2006, Atlanta, Georgia, pp. 5.
[Cohen98] W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the ACM Special Interest Group on Management of Data, SIGMOD1998, New York, NY, USA, pp. 201–212.
[Charikar02] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, STOC2002, Montreal, Quebec, Canada, pp. 380-388.
[GI99] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB1999, Edinburgh, Scotland, UK, pp. 518-529.
[HC08] M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In Proceedings of the 24st International Conference on Data Engineering ,ICDE2008, Cancun, Mexico, pp. 267–276.
[HS98] M. A. Hern´andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, vol. 2, no. 1, (1998) pp. 9–37.
[Henzinger06] M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the ACM Special Interest Group on Information retrieval,SIGIR2006, New York, NY, USA. pp.284-291..
[SB02] S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. ACM Special Interest Group on Knowledge Discovery in Data, KDD2002, New York, NY, USA, pp. 269-278.
[SK04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proceedings of the ACM Special Interest Group on Management of Data, SIGMOD2004, New York, NY, USA, pp743-754.
[Winkler99] W. E. Winkler. The state of record linkage and current research problems. U.S. Bureau of the Census, Tech. Rep., 1999.
[XW09] Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang. Top-k set similarity joins. 25th International Conference on Data Engineering. ICDE2009, Shanghai, China, pp. 916-927.
[XW08] C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International World Wide Web Conference, WWW2008, New York, NY, USA, pp. 131-140.