研究生: |
陳俗玄 Chen, Su-Hsuan |
---|---|
論文名稱: |
運用基因演算法發展差異性極大化之集成式分類器 Using Genetic Algorithm to Optimize the Diversity of Classifier Ensemble |
指導教授: |
蘇朝墩
Su, Chao-Ton 薛友仁 Shiue, Yeou-Ren |
口試委員: |
薛友仁
Shiue, Yeou-Ren 陳穆臻 Chen, Mu-Chen |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 45 |
中文關鍵詞: | 集成式分類器 、基因演算法 、決策樹 、資料探勘 |
外文關鍵詞: | Classifier Ensemble, Genetic algorithm, Decision Tree, Data Mining |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
分類方法為資料探勘的主要內容之一,在過去文獻中,常被提出來使用的分類器如決策樹、類神經網路…等,皆屬獨立型分類器。近年來,多位學者指出由多個獨立分類器結合而成的集成式分類器被認為能比獨立分類器有更好的分類效果,而集成式分類器主要的分類方式是整合各獨立分類器的輸出結果並得到最後的決策,因此分類器間是否具有差異就成為影響分類效果的重要因素。
由於差異性被認為對分類正確率有所影響,且在過去研究中少有極大化差異性的研究,因此本研究採用操控訓練樣本產生差異性的方式提出一個運用基因演算法極大化分類器間差異性的集成式演算法(DECRTS)來提高分類器的差異性,同時與文獻中具代表性的集成式分類方法以21個UCI案例資料集實驗並分析比較。實驗結果顯示,本研究提出的DECRTS在研究使用的6種演算法中具有較佳的分類正確率平均值(82.19%),並具統計上的顯著性,亦表示DECRTS能使多數的研究資料集在分類正確率獲得改善。此外,由實驗結果亦可發現若使用不同形式產生差異性的方法,在特定資料集會有較優異的分類結果。
Data classification method is one of the main tasks of data mining. In the literature, there are many classic base inducers used to train the classifier such as decision tree, neural network…etc., which are all individual classifier. In the past few years, many researches have proposed that the classifier ensemble, which composed by more than one individual classifier, is more effective than any individual classifier of the classifier ensemble. The main idea for classifier ensemble to classify a new sample is to combine the output of each individual classifier and then reach the final decision. Therefore, the diversity between the classifiers is considered as an important factor in classification accuracy.
Because there are few literatures to research about how to optimize the diversity, this paper would propose an ensemble method(Diversity by evolutionary computing resampling training subset, DECRTS)that uses the genetic algorithm to encourage the diversity between classifiers by manipulating the train data set. We design an experiment using 21 UCI Repository of machine learning databases to test and verify and then comparing with individual classifier and other classifier ensembles. The result provides that the DECRTS in our experiment has better average accuracy(82.19%)and is significantly difference with other method except Adaboost(81.99%). Moreover, the experiment appears the different method to create diversity sometimes would have better performance in particular datasets.
[1] Rokach, L., 2010, “Ensemble-based classifiers,” Artificial Intelligence Review, Vol. 33, No.1-2, pp.1-39.
[2] Breiman, L., 1996, “Bagging predictors,” Machine Learning, Vol. 24, No.2, pp.123-140.
[3] Freund, Y. and Schapire, R. E., 1996, “Experiments with a new boosting algorithm,” In Proceedings of the 13th International Conference on Machine Learning, pp.148-146, San Francisco, CA: Morgan Kaufmann.
[4] Tumer, K. and Ghosh, J., 1996, “Error Correlation and Error Reduction in Ensemble Classifiers,” Connection Science, Special issue on combining artificial neural networks: ensemble approaches, Vol. 8, No.3-4, pp.385-404.
[5] Krogh, A., and Vedelsby, J., 1995, “Neural network ensembles, cross validation and active learning,” In Advances in Neural Information Processing Systems 7, pp.231-238.
[6] Kuncheva, L.I., 2005, “Using diversity measures for generating error-correcting output codes in classifier ensembles,” Pattern Recognition Letters, Vol. 26, pp.83-90.
[7] Kuncheva, L. and Whitaker, C., 2003, “Measures of diversity in classifier ensembles and their relationship with ensemble accuracy,” Machine Learning, pp.181-207.
[8] Kuncheva, L.I., 2005, “Diversity in multiple classifier systems,” Information Fusion, Vol. 6, No.1, pp.3-4.
[9] Hu, X., 2001, “Using Rough Sets Theory and Database Operations to Construct a Good Ensemble of Classifiers for Data Mining Applications,” ICDM2001, Proceedings IEEE International Conference, pp.233-240.
[10] Zenobi, G. and Cunningham, P., 2001, “Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error,” In Proceedings of the European Conference on Machine Learning, pp.576-587.
[11] Brown, G., Wyatt, J., Harris, R., Yao, X., 2005, “Diversity creation methods: a survey and categorisation,” Information Fusion, Vol. 6, No.1, pp.5-20.
[12] Skalak, D., 1996, “The sources of increased accuracy for two proposed boosting algorithms,” In Proc. American. Association for Artificial Intelligence, AAAI-96, Integrating Multiple Learned Models Workshop, pp. 120-125.
[13] Tang, E.K., Suganthan, P. N., Yao X., 2006, “An analysis of diversity measures,” Machine Learning, Vol. 65, No. 1, pp.247-271.
[14] Melville, P., and Mooney, R. J., 2005, “Creating diversity in ensembles using artificial data,” Information fusion, Vol. 6, No.1, pp.99-111.
[15] Breiman, L., Friedman, J., Olshen, R., and Stone, C., 1984, “Classification and Regression Trees,” Reading, MA: Wadsworth.
[16] Blake, C.L., Merz, C.J., 1998, “UCI Repository of machine learning databases,” Irvine, CA: University of California, Department of Information and Computer Science, http://www.ics.uci.edu/~mlearn/MLRepository.html.
[17] Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/index.html.
[18] Kohavi, R., 1995, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” C.S. Mellish, Proceedings IJCAI-95, pp.1137-1143, Montreal, Que., Morgan Kaufmann, Los Altos, CA.
[19] Demsar, J., 2006, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, Vol. 7, pp.1-30.
[20] Provost, F., and Fawcett, T., 2001, “Robust classification for imprecise environments,” Machine Learning, Vol. 42, No.3, pp.203-231.
[21] Quinlan, J.R., 1993, “C4.5: Programs for Machine Learning,” Reading, Morgan Kaufmann, Los Altos.
[22] Michalewicz, Z., 1992, “Genetic algorithms + data structures = evolution programs,” reading, New York: Springer.
[23] Quinlan, J.R., 1996, “Bagging, Boosting, and C4.5,” In: Proceedings of the thirteenth national conference on artificial intelligence, pp.725-730.
[24] Maclin, R., and Opitz, D., 1997, “An empirical evaluation of bagging and boosting,” In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pp.546-551, Cambridge, MA: MIT Press.
[25] Bauer, E., and Kohavi, R., 1999, “An empirical comparison of voting classification algorithms: Bagging, boosting, and variants,” Machine Learning, Vol. 36, No.1-2, pp.105-139.
[26] Goldberg, D. E., 1989, “Genetic Algorithms in search, Optimization and Machine learning,” Reading, MA: Addison-Wesley.
[27] Shiue, Y.R., Guh, R.S., and Tseng, T.Y., 2009, “GA-based learning bias selection mechanism for real time scheduling systems,” Expert Systems with Applications, Vol. 36, No. 9, pp.11451-11460.
[28] Breiman, L., 2000, “Randomizing outputs to increase prediction accuracy,” Machine Learning, Vol. 40, No.3, pp.229-242.
[29] Rokach, L., 2010, “Pattern Classification Using Ensemble Methods,” Reading, Singapore: World Scientific Publishing.
[30] Dudoit, S., Fridlyand, J., 2001, “Bagging to Improve the Accuracy of a Clustering Procedure,” Technical Report 600, Department of Statistics, University of California, Berkeley, Oxford Journals of Life Sciences and Mathematics and Physical Sciences, Vol.19, No.9, pp.1090-1099.
[31] Dos Santos, E.M., Sabourin, R., Maupin, P., 2006, “Single and multi-objective genetic algorithms for the selection of ensemble of classifiers,” Proceedings of International Joint Conference on Neural Networks, pp.5377-5384.
[32] Blanco, A., Delgado, M., Pegalajr, M.C., 2001,”A real-coded genetic algorithm for training recurrent neural networks,” Neural Networks, Vol.14, pp.93-105.
[33] 蘇朝墩,2002,品質工程,中華民國品質學會。
[34] 施雅月、賴錦慧譯, Pang-Ning Tan, Michael Steinbach and Vipin Kuma, 2008, 資料探勘,Pearson,ISBN:978-986-154-657-5。