研究生: |
林若涵 Lin, Jo-Han |
---|---|
論文名稱: |
利用分類樹演算法偵測代表性不足之高同質性群體 Using Classification Trees for Detecting (Almost)-Perfectly-Classified Minority Groups |
指導教授: |
徐茉莉
Shmueli, Galit |
口試委員: |
林福仁
Lin, Fu-Ren 李曉惠 Lee, Hsiao-Hui |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 服務科學研究所 Institute of Service Science |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 42 |
中文關鍵詞: | 演算法偏差 、不平衡問題 、不平衡預測變數組合 、分類樹 |
外文關鍵詞: | Algorithmic Bias, Data Imbalance, Predictor Combination Imbalance, Classification Tree |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
演算法偏差(algorithmic bias)的出現在資料探勘領域中引起很多的關注,儘管演算法目的在減少決策中人為帶來的偏見,然而,由於演算法依賴大量數據的訓練,當數據具有缺乏多樣性等問題時,數據本身會將偏見引入算法中,從而產生對某些族群的不利結果。過去的研究多聚焦在訓練數據中的不平衡問題(data imbalance)所引起的偏差,包含類別不平衡(class imbalance)以及預測變量不平衡(predictor imbalance)。
本研究旨在偵測由預測變量組合(combinations of predictors)所組成代表性不足之高同質性群體,我們將問題從檢測受到歧視的個人或群體延伸到識別特定行為,例如某部分用戶在應用程式上具有非常相似的結果。通過偵測高同質性子群體並研究其中的模式(patterns),可以為應用程式設計師提供介面設計與優化的建議,此外,這些資訊還能夠指出需要針對哪些用戶進行進一步質化研究。
此研究中我們比較以不純度為分類基礎的分類樹(impurity-based trees)和以統計方法為分類基礎的分類樹(statistics-based trees)對此問題的適用性。研究結果發現,與以統計方法為分類基礎、採用置換測驗(permutation test)的分類樹相比,以不純度為分類基礎的分類樹使用不純度(特別是使用Gini impurity measure)逐一衡量每個預測變量可能拆分點,不自動為預測變量校正偏差,更有可能偵測到代表性不足之高同質性群體。為了說明這個問題,我們將分類樹演算法應用於由電動機車共享租賃服務公司所蒐集有關應用程式用戶行為的大型數據集。
The presence of algorithmic bias has recently attracted a lot of attention in the data mining community. Although data mining algorithms are designed to tackle and reduce human bias in decision making, the algorithms are trained on data, which itself can still introduce bias into the algorithms, and thus generate unwanted outcomes that discriminate against certain categories of people. Previous studies have focused on biases that arise from imbalance issues in the data used to train algorithms (training data): class imbalance (unbalanced outcome) as well as predictor imbalance can both lead to bias towards the majority class.
In this research, we extend the study of detecting minority subgroups by considering combinations of predictors that create a subgroup(s) are almost perfectly classified . We also extend the problem from detecting discriminated individuals or groups, to identifying specific behavior profiles, such as on web mobile applications, that have extremely homogeneous outcomes. By detecting homogeneous subgroups and studying the subgroups’ different patterns and profiles, detection can provide insights for app designers. Such information can also point to patterns that require further qualitative investigation.
We focus on decision trees and compare the suitability of impurity-based trees and statistics-based trees for this task. We find that the most potent approach is using impurity-based CART-type trees, such as those constructed by rpart in R, which do not correct for predictor bias and use impurity measures for selecting splits. Specifically, we find that using the Gini impurity measure is most suitable. This approach is more likely to find homogeneous subgroups compared to the two-step, permutation-test-based approach taken by statistics-based trees such as ctree. To illustrate these issues, we apply the different tree approaches to a large dataset on app user behavior collected by a leading e-scooter sharing economy service.
Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica. Retrieved from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Baer, T. (2019). Understand, Manage, and Prevent Algorithmic Bias: A Guide for Business Users and Data Scientists. Apress
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees Belmont. CA:Wadsworth International Group.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.
Hung, T. H.(2018). Investigating the effects of unbalanced predictors on identifying discriminatory predictors, Master's thesis. National Tsing Hua University, Hsinchu, Taiwan
Jokar, P., Arianpoo, N., & Leung, V. C. (2015). Electricity theft detection in AMI using customers’ consumption patterns. IEEE Transactions on Smart Grid, 7(1), 216-226.
Kumar, M., & Sheshadri, H. S. (2012). On the classification of imbalanced datasets. International Journal of Computer Applications, 44(8), 1-7.
Marshall, D. (2013). Recognizing your unconscious bias. Business Matters. Retrieved from https://www.bmmagazine.co.uk/in-business/recognising-unconscious-bias/
Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: classification of skewed data. Acm sigkdd explorations newsletter, 6(1), 50-59.
Rokach, L., & Maimon, O. (2005). Decision trees. In Data mining and knowledge discovery handbook (pp. 165-192). Springer, Boston, MA.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.
Spanakis, E. K., & Golden, S. H. (2013). Race/ethnic difference in diabetes and diabetic complications. Current diabetes reports, 13(6), 814-823.
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25.
Suresh, H., & Guttag, J. V. (2019). A Framework for Understanding Unintended Consequences of Machine Learning. arXiv preprint arXiv:1901.10002
Therneau, T. & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. https://CRAN.R-project.org/package=rpart
Torsten Hothorn, Achim Zeileis (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905-3909. http://jmlr.org/papers/v16/hothorn15a.html