研究生: |
陳世承 Chen, Shih-Cheng |
---|---|
論文名稱: |
不平衡資料集學習之少數類別過抽樣技術的一個改良方法 An Improved Synthetic Minority Over-sampling Technique for Imbalanced Data Set Learning |
指導教授: |
林華君
Lin, Hwa-Chun |
口試委員: |
陳俊良
蔡榮宗 |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 59 |
中文關鍵詞: | 類別不平衡問題 、過抽樣技術 |
外文關鍵詞: | Over-sampling Technique, Imbalanced Data Set Learning |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當資料集的少數類別實例有相對其他類別較少的實例數目時,則這樣的資料集可能隱含著類別不平衡的問題,也就是說訓練出的分類模型很可能因為少數類別實例發現機率較低的原因,而將少數類別實例錯誤判斷成多數類別實例。
透過人造少數類別資料實例以平衡多數類別以及少數類別之間的分佈不平衡是一種解決策略。有多種演算法已經依據此概念被設計出來。本研究提出一個新奇演算法ISMOTE來解決類別不平衡問題。ISMOTE與以往演算法不同的地方是並非僅考慮少數類別的分布,而是同時衡量少數類別和多數類別在密度分布上的相對優勢,並以此作為權重衡量的基礎。另外,我們的方法會選擇以少數類別實例與距離其最近的多數類別實例作為參考實例產生人造實例。此作法可減少因為產生錯誤的人造資料實例而使分類器的學習更加地困難的狀況發生,並且透過此作法的人造實例能更好的幫助分類器的學習。
每一個少數類別實例具有一個分類器對於此資料實例困難學習的權重。公式的設計原則與此少數類別資料實例的困難學習程度呈成正比。 因此ISMOTE可以針對每一個少數類別資料實例的權重,產生相對應數量的人造資料實例而逐漸改變分類決策的界線往較困難學習的方向。
When a few categories of instances of a data set have fewer instances than other categories, such data sets may imply a problem of category imbalances, meaning that the trained classification model is likely to be found for a small number of instances Low cause, and a small number of instances of the wrong case to determine the majority of categories of examples.
It is a solution to the distribution of imbalances between the majority of categories and the few categories through the artificial minority category data examples. A variety of algorithms have been designed based on this concept. This study proposes a novel algorithm ISMOTE to solve the problem of class imbalance. ISMOTE differs from previous algorithms in that it does not take into account only a few categories of distributions, but rather measures the relative advantages of a few categories and most categories in density distributions as a basis for weighting. In addition, our approach will choose to produce artificial instances with a few category instances and most of the nearest category instances as a reference instance. This approach can reduce the situation where the classifier's learning is more difficult due to the generation of erroneous man-made data instances, and the artificial examples through this approach can better help the classifier to learn.
Each of the few category instances has a weight that the classifier has difficulty studying for this data instance. The design principles of the formula are proportional to the degree of difficulty in learning with this few categories of data instances. So ISMOTE can be for each of a few categories of data instances of the weight, resulting in the corresponding number of examples of artificial data and gradually change the boundaries of classification decisions to more difficult to learn the direction.
[1] Y. M. Huang, C. M. Hung and H. C. Jiao, “Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,” Nonlinear Analysis: Real World Applications 7 (4) 720–747, 2006
[2] W. Z. Lu and D. Wang, “Ground-level ozone prediction by support vector
machine approach with a cost-sensitive classification scheme,” Sci. Total.
Enviro, vol. 395, no. 2-3, pp. 109–116, 2008
[3] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw., vol. 21, no. 2–3, pp. 427–436, 2008
[4] S. Barua, M. M. Islam, X. Yao, K. Murase, “MWMOTE – majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng. 99 (preprints), 2014
[5] A. Ali, S. M. Shamsuddin, A. L. Ralescu, “Classification with class imbalance problem: a review,” Int. J. Advance Soft Compute. Appl. 7 (3) 176– 204, 2015
[6] Japkowicz, N. and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent data analysis, 6(5): p. 429-449, 2002
[7] Ronaldo C. Prati, Gustavo E. A. P. A. Batista, “Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior,” computer science, Vol. 2972. 2004. p. 312–21, 2004
[8] Jo. T. and N. Japkowicz, “Class imbalances versus small disjoints,” ACM SIGKDD Explorations Newsletters 6:40–49. 2004
[9] Denil. M, T. Trappenberg, “Overlap versus imbalance,” Advances in Artificial Intelligence, Springer. p. 220-231, 2010
[10] Weiss, G. M. and F. Provost, “Learning when training data are costly: The effect of class distribution on tree induction,” Journal of Artificial Intelligence Research, 19: p. 315354, 2003
[11] S. Bara, M. M. Islam. York. Mutase, “MWMOTE – majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Trans. Knowl. Data Eng. 99, 2014
[12] Hu. S. Liang, Y. Ma, L. and He. Y. “MSMOTE: improving classification performance when training data is imbalanced,” Computer Science and Engineering, WCSE’09. Second International Workshop on, volume 2, pages 13–17. IEEE, 2009
[13] H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: adaptive synthetic sampling approach for imbalanced learning,” IEEE International Joint Conference on Neural Networks (IJCNN’08) pp. 1322–1328, 2008
[14] H. Han, W.Y. Wang, B.H. Mao, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” International Conference on Intelligent Computing (ICIC’05). Lecture Notes in Computer Science 3644, Springer-Verlag, pp. 878–887, 2005.
[15] N.V. Chawla, K. K. W. Bowyer, L. O. Hall, W. P. Kegel Meyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, 321-357, 2002
[16] N. Japkowic and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429449, 2002.
[17] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, “Safe–level–SMOTE: Safe–level–synthetic minority over–sampling Technique for handling the class imbalanced problem,” Artif. Intell, 5476: 475-482, 2009
[18] Bunkhumpornpat. C and Subpaiboonkit. S “Safe level graph for synthetic minority over-sampling techniques,” Communications and Information Technologies (ISCIT), 13th International Symposium on, pages 570–575. IEEE, 2013
[19] Enislay. Ramentol, Yael. Caballero, Rafael. Bello, and Francisco. Herrera. “SMOTE-RSB*: A hybrid preprocessing approach based on oversampling and under sampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowl. Inform. Syst. 33, 2, 245–265, 2012
[20] Chumphol. Bunkhumpornpat, Krung. Sinapiromsaran, Chidchanok. Lursinsap. “DBSMOTE: Density-based synthetic minority over-sampling technique,” Applied Intelligence 36, 3, 664–684, 2012
[21] P. Branco, L. Torgo, and R. P. Ribeiro “A Survey of Predictive Modeling on Imbalanced Domains,” ACM Comput. Surv, vol. 49, no. 2, pp. 31:1-31:50, 2016.
[22] He. H. Garcia, E. A. “Learning from imbalanced data,” Ieee Transactions On Knowledge and Data Engineering, 21, 1263–1284, 2009
[23] S. Chen, H. He, E.A. Garcia, “Ramoboost: Ranked minority oversampling in boosting,” IEEE Transactions on Neural Networks 21 (10) 1624–1642, 2010
[24] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera “A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches” IEEE Transactions on Systems, Man, and Cybernetics – part C: Applications and Reviews 42 (4) 463–484, 2011
[25] R. C. Prati, G. E. Batista, D. F. Silva, “Class imbalance revisited: a new experimental setup to assess the performance of treatment methods,” Knowledge and Information Systems, pp. 1–24, 2014
[26] M. Nakamura, Y. Kajiwara, A. Otsuka, H. Kimura, “LVQ-SMOTE – learning vector quantization based synthetic minority over-sampling technique for biomedical data,” BioData Min. 6 Article number 16, 2013
[27] Yoav. Freund. “Boosting a weak learning algorithm by Majority,” Information and Computation, 1995
[28] L. Breiman, “Random forests,” Machine learning, 45(1):5–32, 2001
[29] J.A. Saez, J. Luengo, J. Stefanowski, F. Herrera, “SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a resampling method with filtering,” Inf. Sci. 291 184–203, 2015
[30] R. Barandela, R.M. Valdovinos, J.S. Sánchez, F.J. Ferri, “The imbalanced training sample problem: under or over sampling?,” A. Fred, T. Caelli, R.P.W. Duin, A. Campilho, D.d. Ridder (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, Lectures Notes in Computer Science, vol. 3138, Springer-Verlag, Berlin, pp. 806–814, 2004
[31] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, “Study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Expl. Newsl. 6 (1) 20–29, 2004
[32] N. Japkowicz, S. Stephen, “The class imbalance problem: a systematic study, Intell,” Data Anal. 6 (5) 429–449, 2002
[33] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, “Study of the behavior of several
Methods for balancing machine learning training data,” ACM SIGKDD
Expl Newsl. 6 (1) 20–29, 2004
[34] H. Zhang and M. Li, “RWO-sampling: A random walk over-sampling approach to imbalanced data classification,” Inform. Fusion, vol. 20, pp. 99–116, Nov, 2014.
[35] M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, “PDFOS: PDF estimation based over-sampling for imbalanced two-class problems, “Neurocomputing, 2014
[36] Sezer EA, Nefeslioglu HA, Gokceoglu “An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models,” Appl Soft Comput 24:126–134, 2014
[37] B. Tang and H. He, “KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning,” in Proc. IEEE Congress Evol. Comput. pp. 664–671, 2015
[38] A. Fernandez, V. V. L ´ opez, M. J. del Jesus, and F. Herrera, “Revisiting ´ evolutionary fuzzy systems: Taxonomy, applications, new trends and challenges.” Knowledge-Based Systems, In Press, Accepted Manuscript, February, 2015
[39] S. García, J. Luengo, F. Herrera, “Tutorial on practical tips of the most influential data preprocessing algorithms in data mining,” Knowl. Based Syst. 98 1–29, 2016
[40] Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and María Carolina Monard, “A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explorations,” 2004
[41] Ramentol, E., Caballero, Y., Bello, R., Herrera, F. “SMOTE-RSB*: A Hybrid Preprocessing Approach based on Oversampling and Undersampling for High Imbalanced DataSets using SMOTE and Rough Sets Theory.” Knowledge and Information Systems Journal, 2011
[42] VERBIEST, N., RAMENTOL, E., CORNELIS, C., AND HERRERA, “Improving smote with fuzzy rough prototype selection to detect noise in imbalanced classification data.” In Proceedings of the 13th Ibero-American Conference on Artificial Intelligence, 2012
[43] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, “Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’09, SpringerVerlag, Berlin, Heidelberg, 2009
[44] T. Maciejewski, J. Stefanowski, “Local neighbourhood extension of SMOTE for mining imbalanced data,” in: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining, SSCI IEEE, IEEE Press, 2011
[45] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “DBSMOTE: Densitybased synthetic minority over-sampling technique,” Applied Intelligence, vol. 36, pp. 1–21, 2011.
[46] A. Fernández, V. López, M. Galar, M.J. del Jesus, F. Herrera, “Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches”, 2013
[47] Ester M, Kriegel H-P, Sander J, Xu X, “A density-based algorithm
Ford is covering clusters in large spatial databases with noise.” In: The 2nd international conference on knowledge discovery and data mining, Portland, Oregon, 1996
[48] UC Irvine Machine Learning Repository, http://archive.ics.uci.edu/ml/
[49] G.W. Corder and D.I. Foreman, “Nonparametric Statistics for NonStatisticians: A
Step-by-Step Approach.” John Wiley & Sons, 2009
[50] KEEL (Knowledge Extraction based on Evolutionary Learning) http://sci2s.ugr.es/keel/index.php