抽樣、權重、機率修正不平衡數據，並應用於決策樹分類

簡易檢索 / 詳目顯示

回結果列表

研究生：	楊長鳴 Yang, Chang-Ming
論文名稱：	抽樣、權重、機率修正不平衡數據，並應用於決策樹分類 Sampling, Weighting and Probability Correction for Classifying Imbalanced Data Using Decision Trees
指導教授：	徐茉莉 Shmueli, Galit
口試委員:	雷松亞 Ray, Soumya 林福仁 Lin, Fu-Ren
學位類別：	碩士 Master
系所名稱：	科技管理學院 - 服務科學研究所 Institute of Service Science
論文出版年：	2017
畢業學年度：	105
語文別：	英文
論文頁數：	53
中文關鍵詞：	不平衡數據、決策樹、解釋性與預測性模型、權重、機率、抽樣、修正
外文關鍵詞：	Explanatory modeling, Predictive modeling, Sampling, Weights, Correction
相關次數：	點閱：80 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

此研究以三大面向來探索不平衡數據：（1）當類別數據不齊全時，如何估計類別數據在母體內的分佈。（2）比較邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據，所建出的解釋性模型與預測性模型間的差異。（3）依據分析目的考量合適的衡量指標。
此篇的研究問題：在探討邏輯迴歸分析、區別分析與決策樹分析在處理不平衡數據時，透過“減少多數抽樣法”處理的數據來建出分類模型，此模型以權重與機率修正（數據修正），並探討兩者的關係。驗證方式如下：
(1) “減少多數抽樣法”建出的決策樹模型，有權重修改與無權重修改模型之關係。
(2) 找出是否能藉由機率來修正決策樹，並還原成母體模型。
(3) 解釋性與預測性所建出的不平衡數據決策樹模型之間的差異。
我們利用不同的數據，透過數據修正來實驗，發現在運用邏輯迴歸分析與區別分析於修正不平衡數據時，訓練出來的解釋性與預測性模型並無差異。相反地，運用決策樹修正不平衡數據時，會有不同的解釋性與預測性模型，因此，當使用決策樹來建立不平衡數據模型時，須考量清楚分析目的。

In this dissertation, we study three analytical goals related to imbalanced data: (1) determining the population class distribution when it is difficult to obtain sufficient observations on one of the classes; (2) comparing explanatory and predictive modeling of imbalanced data using logistic regression, discriminant analysis, and decision trees; and (3) considering suitable performance evaluation metrics for different purposes.
Our research question is focused on comparing weighting and intercept correction for undersampled data using logistic regression, discriminant analysis, and decision trees (three models) when dealing with imbalanced data. Specifically, we study the following:
(1) The relationship between sampling data with a weighted decision tree and an unweighted decision tree.
(2) What are rules to induce a correction of the probability cutoff in a decision tree for obtaining the population model?
(3) What is the difference between explanatory and predictive modeling?
We study these questions using several datasets with different corrections. We find that when building training models across different data distributions, when the imbalanced data set is very large, using weighting and intercept correction with logistic regression and discriminant analysis lead to consistent results in both explanatory and predictive tasks. In contrast, decision trees display different results when we investigate explanatory factors (tree variables) and predicting or ranking new observations. Therefore, we should carefully deal with imbalanced data using decision trees.

Chapter 1: Introduction    9
1 Background and motivation    9
2 Research question    11
Chapter 2: Literature review    12
1 Models    12
1.1 Logistic regression    13
1.2 Discriminant analysis    13
1.3 Decision Tree    14
2 Explanatory modeling    15
2.1 Prior Correction for Logistic Regression with Rare Events    16
2.2 Prior Correction for Discriminant Analysis    16
3 Predictive modeling    17
3.1 Predictive methods    18
3.1.1 Sampling methods    18
4 Correction of the probability cutoff    19
4.1 Calibration in Logistic Regression    20
4.2 Calibration in Discriminant Analysis    21
4.3 Proposed Calibration in Decision Trees    22
4.4 Weight argument in glm() and rpart() function in R language
parameter/ class    23
5 Evaluation metrics    24
5.1 Confusion matrix    25
5.2 Lift Charts    26
Chapter 3: Experimental Framework    27
1 Experimental design    27
1 Experimental Datasets    28
2 Data Partition and Sampling Methods    29
2.1 Data Partition:    29
2.2 Sampling of The Training Data    29
3 Algorithms    31
4 Evaluation metrics using confusion matrix and lift charts    32
Chapter 4: Experimental results    34
1 Summary table of experimental results    34
1.1 Explanatory summary table (coefficients or tree variables) from the training model    34
1.2 Predictive summary table from test data    35
2 Detailed comparison for each dataset    36
2.1 Explanatory (coefficients or tree variables) from training model    36
2.1.1 Bank dataset - Coefficients/tree variables    36
2.1.2 Letter dataset - Coefficients/tree variables    36
2.1.3 Adult dataset - Coefficients    36
2.2 Predictive confusion matrix and lift charts (from test data)    40
2.2.1 Bank dataset - confusion matrix    40
2.2.2 Letter dataset - confusion matrix    40
2.2.3 Adult dataset - confusion matrix    40
2.2.4 Three datasets - Lift charts    42
Chapter 5: Conclusions and Future work    44
References    47
Appendices    49
A.1 Experiment results on small imbalanced datasets    49
Abalone dataset - Coefficients/tree variables    50
Abalone dataset - confusion matrix    51
Diseased Tree dataset - Coefficients/tree variables    51
Diseased Tree dataset - confusion matrix    52
Two datasets - Lift charts    53
A.2 R code for experimental models    53

                                

Andrew, A. M. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods by Nello Christianini and John Shawe-Taylor, Cambridge University Press, Cambridge.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Blake, C., & Merz, C. J. (1998). {UCI} Repository of machine learning databases.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107-119. Springer Berlin Heidelberg.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data (Technical Report 666). Berkeley, CA: University of California, Berkeley, Department of Statistics.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on Neural Networks, pp. 1322-1328.
King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 137-163.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML,Vol. 97, pp. 179-186.
Klecka, W. R. (1980). Discriminant analysis (No. 19). Sage.
Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63.
Shmueli, G. (2010). To explain or to predict?. Statistical science, 25(3), 289-310.
Shmueli, G., Patel, N. R., & Bruce, P. C. (2016). Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. 3rd Edition. John Wiley & Sons.
Sanders, E. B. N., & Stappers, P. J. (2008). Co-creation and the new landscapes of design. Co-design, 4(1), 5-18.
Therneau, T. M., & Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines (Vol. 61, p. 452). Mayo Foundation: Technical report.

簡易檢索 / 詳目顯示

相關論文