研究生: |
楊長鳴 Yang, Chang-Ming |
---|---|
論文名稱: |
抽樣、權重、機率修正不平衡數據, 並應用於決策樹分類 Sampling, Weighting and Probability Correction for Classifying Imbalanced Data Using Decision Trees |
指導教授: |
徐茉莉
Shmueli, Galit |
口試委員: |
雷松亞
Ray, Soumya 林福仁 Lin, Fu-Ren |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 服務科學研究所 Institute of Service Science |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 英文 |
論文頁數: | 53 |
中文關鍵詞: | 不平衡數據 、決策樹 、解釋性與預測性模型 、權重 、機率 、抽樣 、修正 |
外文關鍵詞: | Explanatory modeling, Predictive modeling, Sampling, Weights, Correction |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
此研究以三大面向來探索不平衡數據:(1)當類別數據不齊全時,如何估計類別數據在母體內的分佈。(2)比較邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據,所建出的解釋性模型與預測性模型間的差異。(3)依據分析目的考量合適的衡量指標。
此篇的研究問題:在探討邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據時,透過“減少多數抽樣法”處理的數據來建出分類模型,此模型以權重與機率修正(數據修正),並探討兩者的關係。驗證方式如下:
(1) “減少多數抽樣法”建出的決策樹模型,有權重修改與無權重修改模型之關係。
(2) 找出是否能藉由機率來修正決策樹,並還原成母體模型。
(3) 解釋性與預測性所建出的不平衡數據決策樹模型之間的差異。
我們利用不同的數據,透過數據修正來實驗,發現在運用邏輯迴歸分析與區別分析於修正不平衡數據時,訓練出來的解釋性與預測性模型並無差異。相反地,運用決策樹修正不平衡數據時,會有不同的解釋性與預測性模型,因此,當使用決策樹來建立不平衡數據模型時,須考量清楚分析目的。
In this dissertation, we study three analytical goals related to imbalanced data: (1) determining the population class distribution when it is difficult to obtain sufficient observations on one of the classes; (2) comparing explanatory and predictive modeling of imbalanced data using logistic regression, discriminant analysis, and decision trees; and (3) considering suitable performance evaluation metrics for different purposes.
Our research question is focused on comparing weighting and intercept correction for undersampled data using logistic regression, discriminant analysis, and decision trees (three models) when dealing with imbalanced data. Specifically, we study the following:
(1) The relationship between sampling data with a weighted decision tree and an unweighted decision tree.
(2) What are rules to induce a correction of the probability cutoff in a decision tree for obtaining the population model?
(3) What is the difference between explanatory and predictive modeling?
We study these questions using several datasets with different corrections. We find that when building training models across different data distributions, when the imbalanced data set is very large, using weighting and intercept correction with logistic regression and discriminant analysis lead to consistent results in both explanatory and predictive tasks. In contrast, decision trees display different results when we investigate explanatory factors (tree variables) and predicting or ranking new observations. Therefore, we should carefully deal with imbalanced data using decision trees.
Andrew, A. M. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods by Nello Christianini and John Shawe-Taylor, Cambridge University Press, Cambridge.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Blake, C., & Merz, C. J. (1998). {UCI} Repository of machine learning databases.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107-119. Springer Berlin Heidelberg.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data (Technical Report 666). Berkeley, CA: University of California, Berkeley, Department of Statistics.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on Neural Networks, pp. 1322-1328.
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 137-163.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML,Vol. 97, pp. 179-186.
Klecka, W. R. (1980). Discriminant analysis (No. 19). Sage.
Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63.
Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289-310.
Shmueli, G., Patel, N. R., & Bruce, P. C. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. 3rd Edition. John Wiley & Sons.
Sanders, E. B. N., & Stappers, P. J. (2008). Co-creation and the new landscapes of design. Co-design, 4(1), 5-18.
Therneau, T. M., & Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines (Vol. 61, p. 452). Mayo Foundation: Technical report.