簡易檢索 / 詳目顯示

研究生: 卡洛斯
Sanchez Rosa, Carlos Amilcar
論文名稱: 叢聚分析的研究:改良穩定度的量測
A Study on Cluster Analysis: Improving Performance on Stability Measurement
指導教授: 陳朝欽
Chen, Chaur-Chin
口試委員: 陳宜欣
Chen, Yi-Shin
張隆紋
Chang, Long-Wen
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 31
中文關鍵詞: 叢聚分析
外文關鍵詞: Critical Stability, Cluster Stability
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 叢聚分析的研究:改良穩定度的量測

    面對分群問題,目前已存在相當多的技術及演算法來嘗試解析未標籤資料
    的潛在結構。然而,如何驗證分群結果仍是個大難題,更衍生出許多相關議
    題。舉例而言:該選擇哪種分群方法?該使用何種距離、相似度度量方法來
    比較資料?該分成幾群?
    本研究著重於探討分群穩定度的議題,藉由計算在原始資料中加入不穩
    定資料點後的預期分群距離來檢驗分群穩定度。此方法常用於找出最佳的分
    群數。然而,當資料量過大時,此方法會不易於實現,更會影響度量之效
    能。為解決此問題,本研究改變以往使用隨機加入不穩定資料點的方式,提 出Critical Stability演算法,著重於最容易破壞演算法分群的關鍵資料點,以改 善度量穩定度方法之效能。對於驗證Critical Stability演算法,本研究採用真實 資料集與著名的人工測試資料集,比較傳統演算法與Critical Stability演算法執 行結果與所耗費的時間。


    A Study on Cluster Analysis: Improving Performance on Stability Measurement

    Several techniques and algorithms have been designed for clustering. Given that the results of these techniques come from describing the hidden structures of unlabeled data, the validation of the output and method thus becomes a hard task. This raises several problems, some of which have still not been solved. As examples, consider the following: Which clustering method should we use? Which distance or similarity method should we choose to compare data? How should we assess the significance of the cluster? How can we determine the number of clusters? Several proposals to address these problems have been proposed. In this research, we focus on the stability score, which is measured by calculating the expected distance of a perturbed version of the original data. This is commonly used to know the number of clusters in a dataset, however this method becomes difficult to run when the data size increases. Calculating on large data also affects the performance of the measurements. Depending on the nature of the data can take several weeks to get a precise result. To address this issue we present a variation of the stability algorithm named “Critical Stability”, that focuses on the main perturbations that can destroy patterns, replacing the randomly generated ones thus improving the performance of the measurement. To validate this new algorithm, we tested real and artificial datasets that have known patterns and compared time and results for both the stability algorithm and the critical stability.

    1 Introduction ....................................................... 1 2 Related Work ....................................................... 3 2.1 Stability Measurement Variations ................................. 3 2.2 Cluster Stability in a Large Scale for Phylogenetic Analysis ..... 4 3 Methodology ........................................................ 5 3.1 Stability Definition............................................... 5 3.2 Datasets ......................................................... 7 3.3 Importance of a large number of copies b ......................... 11 3.4 Algorithm Implementation ......................................... 13 3.5 Instable Cluster Analysis ........................................ 14 3.6 Critical Stability ............................................... 17 3.7 Complexity Analysis .............................................. 18 4 Experiment and Results ............................................. 20 4.1 Artificial Datasets ............................................... 20 4.2 Real Datasets .................................................... 22 4.3 Time Comparison .................................................. 26 5 Conclusion ......................................................... 27 6 References ......................................................... 28

    E. M. L. Beale. “Euclidean cluster analysis”. In Bulletin of the International Statistical Institute 43, pages 92–94, 1969.

    A. Elisseeff A, Ben-Hur and I. Guyon. “A stability based method for discovering structure in clustered data”. In Proceedings of the Pacific Symposium on Biocomputing, pages 6–17, 2002.

    Battista Biggio, Ignazio Pillai, Samuel Rota Bul`o, Davide Ariu, Marcello Pelillo, and Fabio Roli. “Is data clustering in adversarial settings secure?” In Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec ’13, pages 87–98, New York, NY, USA, 2013. ACM.

    R. B. By Cattell. “The description of personality: basic traits resolved into clusters”. In The Journal of Abnormal and Social Psychology, pages 476–506, 1943.

    Us consumer complaint. https://catalog.data.gov/ dataset/consumer-complaint-database. Accessed: 201704-05.

    Us demographic data by zip code. https://data. cityofnewyork.us/api/views/kku6-nxdu/rows.csv? accessType=DOWNLOAD. Accessed: 2017-04-05.

    Bernd Drewes. “Some industrial applications of text mining”. In Knowledge Mining, pages 233–238, 2005.

    H. E. Driver and A. L. Kroeber. “Quantitative expression of cultural relationships”. In University of California Publications in American Archaeology and Ethnology, pages 211–256, 1932.

    S. Dudoit and J. Fridlyand. “A prediction-based resampling method to estimate the number of clusters in a dataset”. In Genome Biology 3, page 0036.1– 0036.21, 2002.

    Alejandra Arizmendi Er´endira Rend´on, Itzel Abundez and Elvia M. Quiroz. “Internal versus external cluster validation indexes”. In International Journal of Computers and Communications, 2011.

    Richard Ernest Bellman. Dynamic programming. In Princeton University Press, 1957.

    Jaroslav Pokorn´y Athena Vakali and Theodore Dalamagas. “Unresolved problems in cluster analysis”. In Biometrics, pages 169–181, 1979.

    B.Grun and F. Leisch. “Bootstrapping finite mixture models”. In Antoch, J. (ed.), COMPSTAT, page 1115–1122, 2004.

    J.Nolan M.J.L.Hoon, S. Imoto and S. Miyano. “Open source clustering software”. In Bioinformatics, pages 1453–1454, 2004.

    Anil K. Jain and Richard C. Dubes. “Algorithms for clustering data”. Upper Saddle River, NJ, USA, 1988. Prentice-Hall, Inc.

    M.N. Murty A. K. Jain and P.J. Flynn. Data clustering: A review. In ACM Computer, pages 264–323, 1999.

    Perry Fizzano Tyler A. Land and Robin B. Kodner. Measuring cluster stability in a large scale phylogenetic analysis of functional genes in metagenomes using pplacer. In IEEE/ACM Transations on Computational Biology and Bioinformatics, 2016.

    M. L. Braun T. Lange, V. Roth and J. M. Buhmann. Stability-based validation of clustering solutions. In Neural Computation 16, page 1299–1323, 2004.

    P. Raghavan C. Manning and H. Sch¨utze. Hierarchical clustering. In Introduction to Information Retrieval, pages 377–401, 2008.

    N.Vinh and J.Epps. Consensus clustering: A resamplingbasedmethod for class discovery and visualization of gene expression microarray data. In Machine Learning, pages 84—-91, 2003.

    Natthakan Iam-on and Simon Garrett. Linkclue: a matlab package for link-based cluster ensembles. In Journal of Statistical Software, pages 84—-91, 2010.

    Periklis A. Data clustering techniques. In Qualifying Oral Examination Paper, 2002.

    K. Sabo and R. Scitovski. An approach to cluster separability in a partition. volume 305, pages 208–218, New York, NY, USA, June 2015. Elsevier Science Inc.

    R.L.Thorndike. “Who belongs in a family?” In Psychometrika 18, pages 267–276, 1953.

    Robert C. Tryon. “Cluster analysis: Correlation profile and orthometric (factor) analysis for the isolation of unities in mind and personality”. In Edwards Brothers, 1939.

    Ulrike von Luxburg. “Clustering stability: An overview”. In Found. Trends Mach, pages 235–274, 2010.

    Us data.gov. https://catalog.data.gov/dataset. Accessed: 2017-0405.

    Jaroslav Pokorn´y Athena Vakali and Theodore Dalamagas. “Overview of web data clustering practices”. In Lecture Notes Computer Science, 2005.

    N. Vinh and J. Epps. A novel approach for automatic number of clusters detection in microarray database don consensus clustering. In the Ninth IEEE International Conference on Bioinformatics and Bioengineering, pages 84—-91, 2009.

    Juan Ye, Graeme Stevenson, and Simon Dobson. Us mart: An unsupervised semantic mining activity recognition technique. volume 4, pages 16:1–16:27, New York, NY, USA, November 2014. ACM.

    J. Zubin. A technique for measuring like-mindedness. In Journal of Abnormal and Social Psychology, pages 508–516, 1938

    QR CODE