A Hybrid Filter/Wrapper Method Using Simplified Swarm Optimization for Feature Selection in High-Dimensional Imbalanced Data

簡易檢索 / 詳目顯示

回結果列表

研究生：	楊御台 Yang, Yu Tai
論文名稱：	A Hybrid Filter/Wrapper Method Using Simplified Swarm Optimization for Feature Selection in High-Dimensional Imbalanced Data 應用簡化群體演算法之混合式特徵選取於於高維度不平衡資料集之研究
指導教授：	葉維彰 Yeh, Wei Chang
口試委員:	黃佳玲劉淑範
學位類別：	碩士 Master
系所名稱：	工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management
論文出版年：	2016
畢業學年度：	104
語文別：	英文
論文頁數：	44
中文關鍵詞：	特徵選取、不平衡資料、簡化群體演算法
外文關鍵詞：	feature selection, imbalanced data, simplified swarm optimization
相關次數：	點閱：73 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來特徵選取已成為資料探勘中一個重要的研究領域，並被廣泛運用在許多究中。特徵選取的目的是從現存的資料集選取一個最佳的特徵子集來最大化準確率。然而，仍少有研究探討資料不平衡對特徵選取問題的影響。資料不平衡乃資料集中某類別之個數遠小於其他類別。因此，此研究之目的為提供一個特徵選取方法來提升資料不平衡下的高維度資料集之分辨準確率。在此研究中，我們提出了一個混合式的演算法，其能找到更好的特徵子集。
在提出的演算法中，資訊增益被用來從原始資料集中選取含有最多資訊的特徵。而資料集中的資料不平衡則運用合成少數類別技術(SMOTE)來消除不平衡。然後運用簡化群體演算法來尋找最佳的特徵子集。最後，支持向量機被運用來評量所提出方法之表現。為了驗證所提出之演算法的效能，我們將本研究提出的演算法運用在十個基準資料集上並將結果與其他演算法相比較。其結果顯示所提出之演算法之結果相較於其競爭者擁有較佳的準確率。

In recent years, feature selection has become an important field in data mining and been wildly used in numerous regions. The purpose of feature selection is to search an optimal subset of features from existing data to maximize the accuracy. However, there are still few studies investigating the impact of data imbalance, the existence of underrepresented categories of data, on feature selection problem. Therefore, the aim of this study is to provide a feature selection method for increasing classifying high-dimensional imbalanced data accuracy. In this study, we proposed a hybrid method which can spot a better optimal features subset.
In the proposed method, information gain as a filter selects the most informative features from the original dataset. The imbalance of the dataset with selected features is justified by using Synthetic minority over-sampling technique. Then, simplified swarm optimization is implemented as feature search engine to guide the search for an optimal feature subset. Finally, support vector machine serve as a classifier to evaluate the performance of the proposed method. To evaluate the performance of proposed algorithm, we apply our algorithm in ten benchmark datasets and compare the results with existing algorithm The results show that our algorithm has a better performance than its competitor.

Acknowledgement    iii
Abstract    iv
List of Tables    vi
List of Illustrations    vii
Chapter 1    Introduction    1
1.1    Background and Motivation    1
1.2    Framework and Organization    4
Chapter 2    Literature Review    6
2.1    Imbalanced Data    6
2.2    Feature Selection    8
Chapter 3    Methodology    10
3.1    Imbalanced Data    10
3.2    Information Gain    12
3.3    Simplified Swarm Optimization (SSO)    13
3.4    Support Vector Machine (SVM)    15
Chapter 4    Proposed Method    17
4.1    Encoding Method    17
4.2    Fitness Function    17
4.3    The Proposed Method    18
Chapter 5    Experiment Result    21
5.1    Experiment Datasets    21
5.2    Result    22
Chapter 6    Conclusion    32
6.1    Conclusion    32
6.2    Limitation and Future Study    32
Reference    34
Appendix    42
A.1     Selected features in IG-SSO    42

                                

[1] H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification and clustering," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, pp. 491-502, 2005.
[2] S. Maldonado, R. Weber, and F. Famili, "Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines," Information Sciences, vol. 286, pp. 228-246, 2014.
[3] L. Yu and H. Liu, "Feature selection for high-dimensional data: A fast correlation-based filter solution," in ICML, 2003, pp. 856-863.
[4] A. Al-Ani, A. Alsukker, and R. N. Khushaba, "Feature subset selection using differential evolution and a wheel based search strategy," Swarm and Evolutionary Computation, vol. 9, pp. 15-26, 2013.
[5] R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial intelligence, vol. 97, pp. 273-324, 1997.
[6] I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh, Feature extraction: foundations and applications vol. 207: Springer, 2008.
[7] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, "Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance," Neural networks, vol. 21, pp. 427-436, 2008.
[8] D. Anil Kumar and V. Ravi, "Predicting credit card customer churn in banks using data mining," International Journal of Data Analysis Techniques and Strategies, vol. 1, pp. 4-28, 2008.
[9] N. Japkowicz, C. Myers, and M. Gluck, "A novelty detection approach to classification," in IJCAI, 1995, pp. 518-523.
[10] N. Japkowicz, "The class imbalance problem: Significance and strategies," in Proc. of the Int’l Conf. on Artificial Intelligence, 2000.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, pp. 321-357, 2002.
[12] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in Advances in intelligent computing, ed: Springer, 2005, pp. 878-887.
[13] H. He and E. A. Garcia, "Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 1263-1284, 2009.
[14] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, "Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem," in Advances in knowledge discovery and data mining, ed: Springer, 2009, pp. 475-482.
[15] Y. Yang and J. O. Pedersen, "A comparative study on feature selection in text categorization," in ICML, 1997, pp. 412-420.
[16] W.-C. Yeh, "New parameter-free simplified swarm optimization for artificial neural network training and its application in the prediction of time series," Neural Networks and Learning Systems, IEEE Transactions on, vol. 24, pp. 661-665, 2013.
[17] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp. 273-297, 1995.
[18] S.-W. Lin, Z.-J. Lee, S.-C. Chen, and T.-Y. Tseng, "Parameter determination of support vector machine and feature selection using simulated annealing approach," Applied soft computing, vol. 8, pp. 1505-1512, 2008.
[19] GEMS: Gene Expression Model Selector. Available: http://www.gems-system.org/
[20] K. Yang, Z. Cai, J. Li, and G. Lin, "A stable gene selection in microarray data analysis," BMC bioinformatics, vol. 7, p. 228, 2006.
[21] G. M. Weiss and F. Provost, "The effect of class distribution on classifier learning: an empirical study," Rutgers Univ, 2001.
[22] A. Estabrooks, T. Jo, and N. Japkowicz, "A multiple resampling method for learning from imbalanced data sets," Computational intelligence, vol. 20, pp. 18-36, 2004.
[23] I. Mani and I. Zhang, "kNN approach to unbalanced data distributions: a case study involving information extraction," in Proceedings of workshop on learning from imbalanced datasets, 2003.
[24] M. Kubat and S. Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," in ICML, 1997, pp. 179-186.
[25] S. Barua, M. M. Islam, X. Yao, and K. Murase, "MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning," Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 405-425, 2014.
[26] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, "SMOTEBoost: Improving prediction of the minority class in boosting," in Knowledge Discovery in Databases: PKDD 2003, ed: Springer, 2003, pp. 107-119.
[27] S. Hido, H. Kashima, and Y. Takahashi, "Roughly balanced bagging for imbalanced data," Statistical Analysis and Data Mining, vol. 2, pp. 412-426, 2009.
[28] T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, "Comparing boosting and bagging techniques with noisy and imbalanced data," Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 41, pp. 552-568, 2011.
[29] J. Van Hulse, T. M. Khoshgoftaar, A. Napolitano, and R. Wald, "Feature selection with high-dimensional imbalanced data," presented at the IEEE International Conference on Data Mining Workshops, 2009.
[30] N. Japkowicz and S. Stephen, "The class imbalance problem: A systematic study," Intelligent data analysis, vol. 6, pp. 429-449, 2002.
[31] R. Blagus and L. Lusa, "Evaluation of smote for high-dimensional class-imbalanced microarray data," in Machine learning and applications (icmla), 2012 11th international conference on, 2012, pp. 89-94.
[32] K. Kira and L. A. Rendell, "The feature selection problem: Traditional methods and a new algorithm," in AAAI, 1992, pp. 129-134.
[33] C.-H. Yang, L.-Y. Chuang, and C. H. Yang, "IG-GA: a hybrid filter/wrapper method for feature selection of microarray data," Journal of Medical and Biological Engineering, vol. 30, pp. 23-28, 2010.
[34] X. Li and M. Yin, "Multiobjective binary biogeography based optimization for feature selection using gene expression data," NanoBioscience, IEEE Transactions on, vol. 12, pp. 343-353, 2013.
[35] L.-Y. Chuang, H.-W. Chang, C.-J. Tu, and C.-H. Yang, "Improved binary PSO for feature selection using gene expression data," Computational Biology and Chemistry, vol. 32, pp. 29-38, 2008.
[36] M. Schiezaro and H. Pedrini, "Data feature selection based on Artificial Bee Colony algorithm," EURASIP Journal on Image and Video Processing, vol. 2013, pp. 1-8, 2013.
[37] B. Huang, B. Buckley, and T.-M. Kechadi, "Multi-objective feature selection by using NSGA-II for customer churn prediction in telecommunications," Expert Systems with Applications, vol. 37, pp. 3638-3646, 2010.
[38] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist multiobjective genetic algorithm: NSGA-II," Evolutionary Computation, IEEE Transactions on, vol. 6, pp. 182-197, 2002.
[39] A. Tesfahun and D. L. Bhaskari, "Intrusion detection using random forests classifier with SMOTE and feature reduction," in Cloud & Ubiquitous Computing & Emerging Technologies (CUBE), 2013 International Conference on, 2013, pp. 127-132.
[40] M. Sokolova, N. Japkowicz, and S. Szpakowicz, "Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation," in AI 2006: Advances in Artificial Intelligence, ed: Springer, 2006, pp. 1015-1021.
[41] F. Al-Obeidat, N. Belacel, J. A. Carretero, and P. Mahanti, "An evolutionary framework using particle swarm optimization for classification method PROAFTN," Applied Soft Computing, vol. 11, pp. 4971-4980, 2011.
[42] J. J. Liang, A. K. Qin, P. N. Suganthan, and S. Baskar, "Comprehensive learning particle swarm optimizer for global optimization of multimodal functions," Evolutionary Computation, IEEE Transactions on, vol. 10, pp. 281-295, 2006.
[43] W.-C. Yeh, "A two-stage discrete particle swarm optimization for the problem of multiple multi-level redundancy allocation in series systems," Expert Systems with Applications, vol. 36, pp. 9192-9200, 2009.
[44] W.-C. Yeh, W.-W. Chang, and C.-W. Chiu, "A simplified swarm optimization for discovering the classification rule using microarray data of breast cancer," International Journal of Innovative Computing, Information and Control, vol. 7, pp. 2235-2246, 2011.
[45] V. N. Vapnik and V. Vapnik, Statistical learning theory: Wiley New York, 1998.
[46] P. Villar, A. Fernández, R. A. Carrasco, and F. Herrera, "Feature selection and granularity learning in genetic fuzzy rule-based classification systems for highly imbalanced data-sets," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 20, pp. 369-397, 2012.
[47] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD explorations newsletter, vol. 11, pp. 10-18, 2009.
[48] C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 27, 2011.
[49] G.-G. Wang, S. Deb, and Z. Cui, "Monarch butterfly optimization," Neural Computing and Applications, pp. 1-20, 2015.
[50] J. Derrac, S. García, D. Molina, and F. Herrera, "A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms," Swarm and Evolutionary Computation, vol. 1, pp. 3-18, 2011.
[51] Y. Hochberg, "A sharper Bonferroni procedure for multiple tests of significance," Biometrika, vol. 75, pp. 800-802, 1988.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文