利用分類樹演算法偵測代表性不足之高同質性群體

簡易檢索 / 詳目顯示

回結果列表

研究生：	林若涵 Lin, Jo-Han
論文名稱：	利用分類樹演算法偵測代表性不足之高同質性群體 Using Classification Trees for Detecting (Almost)-Perfectly-Classified Minority Groups
指導教授：	徐茉莉 Shmueli, Galit
口試委員:	林福仁 Lin, Fu-Ren 李曉惠 Lee, Hsiao-Hui
學位類別：	碩士 Master
系所名稱：	科技管理學院 - 服務科學研究所 Institute of Service Science
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	42
中文關鍵詞：	演算法偏差、不平衡問題、不平衡預測變數組合、分類樹
外文關鍵詞：	Algorithmic Bias, Data Imbalance, Predictor Combination Imbalance, Classification Tree
相關次數：	點閱：76 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

演算法偏差(algorithmic bias)的出現在資料探勘領域中引起很多的關注，儘管演算法目的在減少決策中人為帶來的偏見，然而，由於演算法依賴大量數據的訓練，當數據具有缺乏多樣性等問題時，數據本身會將偏見引入算法中，從而產生對某些族群的不利結果。過去的研究多聚焦在訓練數據中的不平衡問題(data imbalance)所引起的偏差，包含類別不平衡（class imbalance）以及預測變量不平衡(predictor imbalance)。

本研究旨在偵測由預測變量組合(combinations of predictors)所組成代表性不足之高同質性群體，我們將問題從檢測受到歧視的個人或群體延伸到識別特定行為，例如某部分用戶在應用程式上具有非常相似的結果。通過偵測高同質性子群體並研究其中的模式(patterns)，可以為應用程式設計師提供介面設計與優化的建議，此外，這些資訊還能夠指出需要針對哪些用戶進行進一步質化研究。

此研究中我們比較以不純度為分類基礎的分類樹(impurity-based trees)和以統計方法為分類基礎的分類樹(statistics-based trees)對此問題的適用性。研究結果發現，與以統計方法為分類基礎、採用置換測驗(permutation test)的分類樹相比，以不純度為分類基礎的分類樹使用不純度(特別是使用Gini impurity measure)逐一衡量每個預測變量可能拆分點，不自動為預測變量校正偏差，更有可能偵測到代表性不足之高同質性群體。為了說明這個問題，我們將分類樹演算法應用於由電動機車共享租賃服務公司所蒐集有關應用程式用戶行為的大型數據集。

The presence of algorithmic bias has recently attracted a lot of attention in the data mining community. Although data mining algorithms are designed to tackle and reduce human bias in decision making, the algorithms are trained on data, which itself can still introduce bias into the algorithms, and thus generate unwanted outcomes that discriminate against certain categories of people. Previous studies have focused on biases that arise from imbalance issues in the data used to train algorithms (training data): class imbalance (unbalanced outcome) as well as predictor imbalance can both lead to bias towards the majority class.

In this research, we extend the study of detecting minority subgroups by considering combinations of predictors that create a subgroup(s) are almost perfectly classified . We also extend the problem from detecting discriminated individuals or groups, to identifying specific behavior profiles, such as on web mobile applications, that have extremely homogeneous outcomes. By detecting homogeneous subgroups and studying the subgroups’ different patterns and profiles, detection can provide insights for app designers. Such information can also point to patterns that require further qualitative investigation.

We focus on decision trees and compare the suitability of impurity-based trees and statistics-based trees for this task. We find that the most potent approach is using impurity-based CART-type trees, such as those constructed by rpart in R, which do not correct for predictor bias and use impurity measures for selecting splits. Specifically, we find that using the Gini impurity measure is most suitable. This approach is more likely to find homogeneous subgroups compared to the two-step, permutation-test-based approach taken by statistics-based trees such as ctree. To illustrate these issues, we apply the different tree approaches to a large dataset on app user behavior collected by a leading e-scooter sharing economy service.

Introduction...................................................6
Literature Review on Algorithmic Bias..........................8
1 Sources of biases.............................................8
2 The Data Imbalance Problem...................................10
Data Mining Methods and Software Implementation Used in This Research.........................................................12
1 General Ideas and Structure of Classification Tree...........12
2 Impurity-based Trees.........................................13
3 Statistics-based Trees.......................................15
4 Reasons for Using Tree-based Algorithms......................16
4.1 Random Forest..............................................16
4.2 Boosted Trees..............................................18
4.3 Logistic Regression........................................19
Application to User App Behavior Data.........................21
1 Case Descriptions: E-scooter Reservation Cancellation........21
2 Applying Two Tree Approaches for Detecting Minority Groups Created by a Combination of Predictors...........................23
2.1 Impurity-based Trees.......................................23
2.2 Statistics-based Trees.....................................28
3 Comparing ctree Results with rpart Results...................29
3.1 Variable Selection Bias....................................29
3.2 Gini vs Permutation Chi Squared Test.......................33
4 Conclusion...................................................38
Conclusions and Future Directions.............................39
1 Conclusions..................................................39
2 Future Directions............................................39
References.......................................................41

                                

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica. Retrieved from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Baer, T. (2019). Understand, Manage, and Prevent Algorithmic Bias: A Guide for Business Users and Data Scientists. Apress

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees Belmont. CA:Wadsworth International Group.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.

Hung, T. H.(2018). Investigating the effects of unbalanced predictors on identifying discriminatory predictors, Master's thesis. National Tsing Hua University, Hsinchu, Taiwan

Jokar, P., Arianpoo, N., & Leung, V. C. (2015). Electricity theft detection in AMI using customers’ consumption patterns. IEEE Transactions on Smart Grid, 7(1), 216-226.

Kumar, M., & Sheshadri, H. S. (2012). On the classification of imbalanced datasets. International Journal of Computer Applications, 44(8), 1-7.

Marshall, D. (2013). Recognizing your unconscious bias. Business Matters. Retrieved from https://www.bmmagazine.co.uk/in-business/recognising-unconscious-bias/

Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: classification of skewed data. Acm sigkdd explorations newsletter, 6(1), 50-59.

Rokach, L., & Maimon, O. (2005). Decision trees. In Data mining and knowledge discovery handbook (pp. 165-192). Springer, Boston, MA.

Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.

Spanakis, E. K., & Golden, S. H. (2013). Race/ethnic difference in diabetes and diabetic complications. Current diabetes reports, 13(6), 814-823.

Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25.

Suresh, H., & Guttag, J. V. (2019). A Framework for Understanding Unintended Consequences of Machine Learning. arXiv preprint arXiv:1901.10002

Therneau, T. & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart

Torsten Hothorn, Achim Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905-3909. http://jmlr.org/papers/v16/hothorn15a.html

簡易檢索 / 詳目顯示

相關論文