研究生: |
吳思葦 Wu, Sz-Wei |
---|---|
論文名稱: |
釐清 Gains chart 與 Lift chart 之混淆以增進實務中的有效應用 Clarifying Confusions about Gains and Lift Charts to Improve Their Current Underuse in Practice |
指導教授: |
徐茉莉
Shmueli, Galit |
口試委員: |
林福仁
Lin, Fu-Ren 李曉慧 Lee, Hsiao-Hui |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 服務科學研究所 Institute of Service Science |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 57 |
中文關鍵詞: | 增益圖 、累積增益圖 、分類排序 、資料探勘 |
外文關鍵詞: | gains chart, lift chart, cumulative gains, cumulative lift |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Gains chart及lift chart為檢驗資料探勘方法預測結果之評估標準,尤用以評估排序問題(ranking problem)。此二圖主要依分類結果之機率排序,以協助排名靠前的數據子集選擇特定門檻。即使gains chart與lift chart 已應用於許多領域,且常被教科書及期刊論文提及,兩者之間仍有許多術語及定義上的混淆處,造成使用上的困難或是錯誤解讀。因此,本論文研究旨在釐清上述混淆以增進gains chart及lift chart 在實務中的有效應用。
本研究先透過展示其他分類評估標準(如:準確率(accuracy)、ROC (Receiver Operating Characteristic) 曲線、敏感度(sensitivity)、特異度(specificity)等在文獻中的主導地位,以顯示gains chart及lift chart應用率相對低落之問題。再經本研究調查結果,此二圖之命名和定義在多數刊物及資料探勘軟體中經常混淆不清。故本研究乃以清晰、有條理的方式組織gains chart及lift chart之不同術語、計算方法、以及相關定義,藉以闡明其用途與再現性;繼而引入使用gains及lift數值的十分位圖、利潤圖、與非累積圖;且做為整合之用,我們創建了一個gainslift R語言之套件,提供清晰並一致的gains chart及lift chart。最後,本論文提出此二圖的三種主要用途,用於比較不同情況下資料探勘方法的預測結果,並以Kaggle平台的實際案例進行說明。在此實際案例中,我們亦提供使用gainslift套件的範例圖表。
Gains chart and lift chart are two useful data mining performance measures for evaluating ranking problems. These two charts are based on ranking the data by the classification probability, which then helps choose a threshold for targeting a subset of top-ranked data. Although deployed in some application areas, and mentioned in textbooks and papers, there are confusions in terminology and definition around gains and lift charts which leads to difficulty or wrong interpretations when using them. In this research, we clarify the above confusions to improve their current under use in practice.
We bring up this issue by showing the dominance of other classification evaluation criteria, such as accuracy, ROC curve, sensitivity, and specificity through our literature search. Our survey also shows that the naming and definition of gains chart and lift chart are often mixed up in both publications and data mining software. We organize the disparate terminology, computation approaches, and perspectives on gains and lift charts in a clear, methodic way to clarify their uses and reproducibility. Decile, profit, and non-cumulative charts using gains and lift values are also introduced successively. As an integration of this research, we created the gainslift R package to provide consistent and clear gains and lift charts. Finally, we propose three uses of the charts for comparing performance of data mining algorithms on different circumstances, and illustrate them with a practical case from the Kaggle platform. The example of gains and lift charts derived from our package are also provided in this case.
Bing, H., Xu, H. & Yujiang, O. (2013). Research of Using Fourier Series Fitting Cam Lift Curve Based on the Least Square Method. In 2013 Third International Conference on Intelligent System Design and Engineering Applications (pp. 1144-1147).
Brandenburger, T., & Furth, A. (2009). Cumulative gains model quality metric. Advances in Decision Sciences, 2009.
Flach, P. (2012). Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press.
Friedman, J., Hastie, T., & Tibshirani, R. (2017). The elements of statistical learning. New York: Springer series in statistics.
Jaffery, T., & Liu, S. X. (2009). Measuring campaign performance by using cumulative gain and lift chart. In SAS Global Forum (p. 196).
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422-446.
Jurczyk, T. (2019). Gains vs ROC curves. Do you understand the difference?, TIBCO Community, <https://community.tibco.com/wiki/Gains-vs-roc-curves-do-you-understand-difference>.
Keskustalo, H., Järvelin, K., Pirkola, A., & Kekäläinen, J. (2008, July). Intuition-supporting visualization of user's performance based on explicit negative higher-order relevance. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 675-682). ACM.
Laitone, E. V. (1989). Lift-curve slope for finite-aspect-ratio wings. Journal of Aircraft, 26(8), 789-790.
Li, Y., Murali, P., Shao, N., & Sheopuri, A. (2015, November). Applying Data Mining Techniques to Direct Marketing: Challenges and Solutions. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW) (pp. 319-327). IEEE
Ling, C. X., & Li, C. (1998, August). Data mining for direct marketing: Problems and solutions. In Kdd (Vol. 98, pp. 73-79).
Piatetsky-Shapiro, G., & Masand, B. (1999, August). Estimating campaign benefits and modeling Lift. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 185-193). ACM.
Piatetsky-Shapiro, G., & Steingold, S. (2000). Measuring lift quality in database marketing. SIGKDD explorations, 2(2), 76-80.
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media, Inc.
Shahinfar, S., Guenther, J. N., Page, C. D., Kalantari, A. S., Cabrera, V. E., Fricke, P. M., & Weigel, K. A. (2015). Optimization of reproductive management programs using lift chart analysis and cost-sensitive evaluation of classification errors. Journal of dairy science, 98(6), 3717-3728.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R. John Wiley & Sons.
Singh, J. P. (2013). Predictive validity performance indicators in violence risk assessment: A methodological primer. Behavioral Sciences & the Law, 31(1), 8-22.
Zaki, M. J., Meira Jr, W., & Meira, W. (2014). Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press.