研究生: |
李逢嘉 Li, Feng-Chia |
---|---|
論文名稱: |
特徵選取為基礎之複合分類預測模式-以信用資料為例 Constructing a Compound Classification Model Based on Features Selection:An Empirical Study on Credit Scoring |
指導教授: |
陳飛龍
Chen, Fei-Long |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 中文 |
論文頁數: | 110 |
中文關鍵詞: | 特徵選取 、邏輯斯迴歸 、類神經網路 、K最鄰近法 、支援向量機 |
外文關鍵詞: | Features Selection, Logistic Regression, Neural Network, K Nearest Neighborhood, Support Vector Machine |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
分類是根據事物之特性將事物指派到某一類別的過程,為資料探勘領域最常被探討的問題。處理大量資料的分類問題有許多方法論,各適用在不同情況與資料性質。大多數分類方法為確保分類品質建議先進行資料特徵選取,避免多餘或不相關特徵影響分類正確率。透過多樣化的特徵選取方式縮減特徵數改善正確率進而發展出許多不同的分類模式,其中複合式的資料探勘分類方法常被用以建立有效分類模式。
本研究以資料探勘分類方法論建構複合分類預測模式,運用線性區別分析法 (Linear Discriminate Analysis, LDA)、約略集理論法(Rough Sets Theory approach, RST)、決策樹法 (Decision Tree, DT)、 F分數法(F-Score )與灰關聯分析法(Grey Relational Analysis, GRA)五種特徵選取技術篩選影響分類的重要屬性,之後結合類神經網路(Neural Network, NN)、K最鄰近法(K-Nearest Neighborhood, KNN)、支援向量機(Support Vector Machine, SVM)與邏輯斯迴歸(Logistic Regression, LR)四種分類方法形成不同複合分類模式,計算所得複合模式之測試樣本平均分類正確率,比較預測分類結果與實際結果,並同時比較不同特徵選取方法應用在相同分類機制上、相同特徵選取方法應用在不同分類機制上的差異;此外分類結果採用無母數符號檢定(Wilcoxon Sign Rank Test)探討模式間是否有顯著差異。所得結果顯示,經過特徵篩選後所得之複合模式比原始模式分類能力顯著提升,各複合模式所保留之重要特徵相對重要性排序呈現類似結果;以F分數法作為特徵選取結合不同分類法所形成複合模式具最佳平均分類正確率。
Classification is a process of assigning objects into different classes by their attributes which has been discussed mostly in the field of data mining. There are many classification methodologies in dealing with huge data that apply to various situations and different characteristics of data. Most classification methodologies suggest features selection first to ensure the quality of classification so that the accuracy of classification will not be affected due to redundant or irrelevant features. Diversity classification models will be developed through the reduction of features that improve the accuracy of original classification models. The compound data classification methods are usually employed to establish effective classification models.
This research establishes prediction models of classification by data mining methodology. Important attributes are extracted by five various features selection approaches that combine with the four different classifiers to optimize features space. The average accuracy of each approach is compared in combination with different classifiers and nonparametric Wilcoxon signed rank test is taken to show if there is any significant difference between these models. The experimental results demonstrate that the proposed structures outperform original methods and the features selection approach of F-score is a promising method for the fields of data mining.
參考文獻
壹、 中文部份
王信勝(2000)。整合分析層級程序與類神經網路之信用評分模型。輔仁大學資訊管理研究所碩士論文。
江淑娟(2003)。信用評等因素與信用卡違約風險之關係-以台灣A 金融機構所發行之信用卡為例。逢甲大學保險研究所碩士論文。
李美笑(2001)。信用卡持卡人信用風險之研究。逢甲大學保險研究所碩士論文。
李豪剛(2007)。運用資料探勘技術於臺灣鋼筋混凝土橋梁構件劣化因子之研究。國立中央大學營建管理研究所碩士論文。
林芝儀(2002)。應用資料探勘於信用卡授信決策模式之實證研究。元智大學 工業工程與管理研究所碩士論文。
俞慧華(2001)。改良式類神經網路模式於信用卡顧客關係管理之研究。國立台北科技大學商業自動化與管理研究所碩士論文。
馬芳資(1994)。信用卡信用風險預警範例學習系統之研究。國立政治大學資訊管理研究所碩士論文。
張嘉豪(2007)。應用平滑支撐向量分類於台灣股票市場選股之研究。國立臺灣科技大學資訊管理研究所碩士論文。
張淑珍(2006)。利用一次性的SQL改良決策樹建立信用卡審核之信用評等。東吳大學商學院資訊科學系碩士論文。
張筑嬪(2006)。應用模糊層級分析法建立個人信用評估準則-以信用卡審核為例。中華大學資訊管理研究所碩士論文。
莊瑞珠(2007)。邏輯斯迴歸模型運用在女性信用卡評分制度之研究。輔仁管理評論,中華民國96 年1 月,第十四卷第一期,127-154。
許愛惠(1993)。信用卡信用風險審核範例學習系統之研究。國立政治大學資訊管理研究所碩士論文。
陳昭穎(2006)。資料探勘技術於超音波旋轉肌肌群影像之診斷應用。國立屏東商業技術學院資訊管理所碩士論文
彭慧雯(2000)。建構信用卡資料挖礦架構及其實證研究。國立台北科技大學生產系統工程與管理研究所碩士論文。
曾俊堯(1990)。信用卡信用管理之研究。國立政治大學企業管理研究所碩士論文。
黃承龍(2004)。支援向量機於信用評等之應用。計量管理期刊,中華民國93 年7月,第一卷第二期,155-172。
楊元琪(2007)。綜合法則歸納系統之延伸研究。國立中山大學資訊管理研究所碩士論文。
葉怡成(1999)。應用類神經網路。儒林圖書公司。
葉怡成(2004)。類神經網路 - 模式應用與實作。儒林圖書公司。
蔡承益(2007)。使用SOM-SVR 混合型系統搭配屬性篩選模式應用於臺灣股票指數期貨預測。國立高雄第一科技大學資訊管理研究所碩士論文。
鄭廳宜(1998)。信用卡授信審核之實證研究。朝陽科技大學財務金融所碩士論文。
鄧聚龍(1992)。灰色系統理論教程。中國華中理工大學。
韓歆儀(2004)。應用兩階段分類法提昇SVM 法之分類準確率。國立成功大學工業與資訊管理研究所碩士論文。
貳、 英文部分
Attoh Okine, N. O. (1997). Rough Set Application to Data Mining Principles in Pavement Management Database. Journal of Computing in Civil Engineering, 11(4), 231-237.
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., and Vanthienen J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627-635.
Bellotti, T., and Crook, J. (2008). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications, In Press, Corrected Proof.
Bellotti, T., Crook, J. (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications, 36(2, Part 2), 3302-3308.
Berkson, J. (1944), Application of The Logistic Function to Bio-Assay, Journal of the American Statistical Association, Vol. 39, pp. 357-365.
Bottou, L., C.Cortes, J. Denker, H.Drucker, I. Guyon, L.Jackel, Y. LeCun, U. Muller, E. Sackinger, P. Simard., and V. Vapnik. Comparision of classifier methods: a case study in handwriting digit recognition. International Conference on Pattern Recognition, pages 77-87. IEEE Computer Society Press, 1994.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. In. Monterey: Wadsworth and Brooks.
Brown, M. Grundy, W. D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares and D. Haussler. (1999). Knowledge-base analysis of microarray gene expression data using support vector machines. Technical report, University of California in Santa Cruz.
Caballero, Y., Alvarez, D., Bel, R., and Garcia, M. M. (2007). Feature selection algorithms using Rough Set Theory. In L. D. Mourelle, N. Nedjah, J. Kacprzyk and A. Abraham (Eds.), Proceedings of the 7th International Conference on Intelligent Systems Design and Applications (pp. 407-411).
Camastra, F. (2007). A SVM-based cursive character recognizer. Pattern Recognition, 40(12), 3721-3727.
Chen W., Chaoqun Ma and Lin Ma, (2009). Mining the customer credit using hybrid support vector machine technique. Expert Systems with Applications, 36(1), 2639-2649.
Chen, Y. W., and Lin, C. J. (2005). Combining SVMs with Various Feature Selection Strategies.
Cho, B. H., Yu, H., Kim, K. W., Kim, T. H., Kim, I. Y., and Kim, S. I. (2008). application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods. Artificial Intelligence in Medicine, 42(1), 37-53.
Chou, C. H., Lin, C. C., Liu, Y. H., and Chang, F. (2006). A prototype classification method and its use in a hybrid solution for multiclass pattern recognition. Pattern Recognition, 39(4), 624-634.
. Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.
Dash, M., and Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1, 131-156.
Davis R. H., D. B. Edelman and Gammerman, A. J. (1992). Machine learning algorithms for credit-card applications. IMA Journal of Mathematics Applied in Business and Industry, (Vol. 4, pp.43-51).
Desai, V. S., Crook, J. N., and Overstreet, G. A. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95(1), 24-37.
Fabio Roli, J. K., Terry Windeatt. (2004, June 9-11). Multiple Classifier Systems. Paper presented at the 5th International Workshop, MCS 2004, Cagliari, Italy.
Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179-188.
Freitas, A., 2002, A survey of evolutionary algorithms for data mining and knowledge discovery, In: Ghosh A, Tsutsui S, editors. Advances in evolutionary computation. Berlin: Springer.
Friedman, J., Another approach to polychotomous classification. Technical report, Department of Statistics, Stanford University, 1996.
Frosyniotis, D., Stafylopatis, A., and Likas, A. (2003). A divide-and-conquer method for multi-net classifiers. Pattern Analysis and Applications, 6(1), 32-40.
Garbe. (1995). Primary Cutaneous Melanoma-identification of prognostic groups and estimation of individual prognosis for 5,093 patients. Cancer, 75, 2484-2491.
Garson, G. D. (1991). Interpreting neural-network connection weights. AI Expert Systems in the Micro-electronic Age, 6(4), 47–51.
Guajardo, J., Miranda, J. and Weber, R. (2005). A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression [Electronic Version]. The Fifth International Conference on Hybrid Intelligent Systems (HIS2005), 341-346.
Harrell, F. E., and Lee, K. L. (1985). A comparison of the discrimination of discriminant analysis and logistic regression. In P. K. Se (Ed.),Biostatistics: Statistics in biomedical, public health, and environmentalsciences. Amsterdam: North-Holland.
Henley, W. E., and Hand, D. j. (1996). A k-nearest neighbor classifier for assessing consumer credit risk. Statistician, (Vol. 44, pp. 77-95.)
Henley, W. E., and Hand, D. j. (1997). Construction of a k-nearest-neighbour credit-scoring system. IMA Journal of Mathematics Applied in Business and Industry, (Vol. 8, pp. 305-321).
Hsieh, N. C. (2005). Hybrid mining approach in the design of credit scoring models. Expert Systems with Applications, 28(4), 655-665.
Huang, C. L., Chen, M. C., and Wang, C. J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33(4), 847-856.
Huang, C. L., Liao, H. C., and Chen, M. C. (2008). Prediction model building and feature selection with support vector machines in breast cancer diagnosis. Expert Systems with Applications, 34(1), 578-587.
Huang, J. J., Tzeng, G. H., and Ong, C. S. (2006). Two-stage genetic programming (2SGP) for the credit scoring model. Applied Mathematics and Computation, 174(2), 1039-1053.
Hunn, P. (1971). Bank credit in the 1970’s new realities and old verities. The Journal of Commercial Bank Lending, 29-34.
Joachims, T. (1998). Text categorization with support vector machines. In Proceedings of European Conference on Machine Learning (ECML).
John, G. H., Kohavi, R., and Pfleger., K. (1994). Irrelevant feature and the subset selection problem. Paper presented at the Proceedings of the Eleventh International Conference on Machine Learning.
Kang, H. J., David Doermann. (2003, Aug. 3–6). Evaluation of the information theoretic construction of multiple classifier systems. Proceedings of the international conference on document analysis and recognition, Edinburgh, Scotland.
Kay, O. W., Warde, A., and Martens, L. (2000). Social differentiation and themarket for eating out in the UK. International Journal of HospitalityManagement, 19(2), 173–190.
Kirkos, S., Spathis, C., and Manolopoulos, Y. (2007). Data Mining techniques for the detection of fraudulent financial statements. Expert Systems with Applications, 32(4), 995-1003.
Kohavi, Ron. (1995). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Appears in the International Joint Conference on Artificial Intelligence(LJCAI), 1137-1145.
Kotsiantis, S., Koumanakos, E., Tzelepis, D., and Tampakas, V. (2006). Forecasting Fraudulent Financial Statements using Data Mining. Transactions on Engineering, Computing and Technology, 12(3), 283-288.
Lee, H., Jo, H., and Han, I. (1997). Bankruptcy prediction using case basedreasoning, neural networks, and discriminant analysis. Expert SystemsWith Applications, 13, 97–108.
Lee, T. S., and Chen, I. F. (2005). A two-stage hybrid credit scoring model usingartificial neural networks and multivariate adaptive regression splines. Expert Systems with Applications, 28(4), 743–752.
Lee, T. S., Chiu, C. C., Lu, C. J., and Chen, I. F. (2002). Credit scoring using the hybrid neural discriminant technique. Expert Systems with Applications, 23(3), 245-254.
Lee, Y. C. (2007). Application of support vector machines to corporate credit rating prediction. Expert Systems with Applications, 28(4), 743–752.
Li, T. S. (2006). Feature Selection for Classification by Using a Ga-Based Neural Network Approach. Journal of the Chinese Institute of Industrial Engineers, 23, 55-64.
Lin, H. T., Lin C. J. (2003). A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods, Technical report, Department of Computer Science & Information Engineering, National Taiwan University.
Lior Rokach and Oded Maimon. (2004), Top-Down Induction of Decision Trees Classifiers - A Survey, IEEE transactions on systems, man, and cybernetics-PART C:Application and Reviews.
Liu, H. M. (1998). Feature Selection for Knowledge Discovery and DataMining. Boston: Kluwer Academic Publishers.
Loris Nanni., Alessandra Lumini. (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36(1), 3028-3033.
Malhotra, R., and Malhotra D. K., (2002). Differentiating between good credits and bad credits using neuro-fuzzy systems. European Journal of Operational Research, (Vol. 136, pp.190-211).
Martens, D., Baesens, B., Van Gestel, T., and Vanthienen, J. (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183(3), 1466-1476.
Melville, P., and Mooney, R. J. (2005). Creating diversity in ensembles using artificial. Information Fusion: Special Issue on Diversity in Multiclassifier Systems, 6 (1), 99–111.
Michael, J. A., and Gordon, B. S., Data Mining Techniques: for marketing, sales, and customer support” John Wiley and Sons, 1997.
Michael, R. P., Travis, E. D., and Michael, L. R. (2005). GA-facilitated classifier optimization with varying similarity measures. Paper presented at the Proceedings of the 2005 conference on Genetic and evolutionary computation.
Mukherjee, S., Tamayo, T., Slonim, D., Verri, A., Golub, T., Mesirov, J., et al. (1999). Support vector machine classification of microarray data. AI Memo 1677, Massachuetts Institute of Technology.
Ong, C. S., Huang, J. J., and Tzeng, G. H. (2005). Building credit scoring models using genetic programming. Expert Systems with Applications, 29(1), 41-47.
Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., and Poggio, T. Pedestrain detection using wavelet templates. Proceedings of the computer vision and pattern recognition on Puerto Rico. pages193–199. 1997.
Osuna E., Freund R., and Girosi F. (1997). Training support vector machines: an application to face detection. In Proceedings of the computer vision and pattern recognition ’97 (pp. 130–136).
Pawlak. (1984). Rough classification (Vol. 20, pp. 469-483): Academic Press Ltd.
Petr, S., Bart, B., Pavel, P., and Jan, V. (2005). Filter- versus wrapper-based feature selection for credit scoring (Vol. 20, pp. 985-999).
Platt, J. C. N. Cristianini, and J. S. Taylor. Large margin DAGs for multiclass classification. In Advances in Neural Information Processing Systems, volume 12, pages 547-553. MIT Press, 2000.
Pontil, M., and Verri A. (1998). Support vector machines for 3D object recognition. IEEE Transaction On PAMI, (Vol. 20, pp.637-646).
Punch, W. F., Goodman, E. D., Min, P., Lai, C. S., Hovland, P., and Enbody, P. (1993). Further research on feature selection and classification using genetic algorithms, San Mateo, CA, USA.
Quinlan, J. R. (1979). Discovering rules from large collections of examples: a case study. Expert Systems in the Micro-electronic Age, 168-201.
Reichert, A. K., Cho, C. C., and Wagner, G. M. (1983). An Examination of the Conceptual Issues Involved in Developing Credit-Scoring Models. Journal of Business and Economic Statistics, 1(2), 101-114.
Schebesch, K. B., and Stecking, R. (2005). Support vector machines for classifying and describing credit applicants: detecting typical and critical regions. Journal of the Operational Research Society, 56(9), 1082-1088.
Schmidt, M. (1996). Identifying speakers with support vector networks. In Interface ’96 Proceedings. Sydney.
Siedlecki, W. and Sklansky, J., (1989), A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters, 10, 335-347.
Srinivasn, V., Kim, Y.H., "Credit Granting: A Comparative Analysis of Classification Procedures Journal of Finance, 42, 1987, pp. 665-683.
Steenackers A. and Goovaerts M.J. (1989). A credit scoring model for Personal loans. Journal of Insurance: Mathematics and Economics, 8(1), 31-34.
Su, C. T., and Yang, C. H. (2008). Feature selection for the SVM: An application to hypertension diagnosis. Expert Systems with Applications, 34(1), 754-763.
Sun, Q., Wang, L. L., Lim, S. H., and DeJong, G. (2007). Robustness through prior knowledge: using explanation-based learning to distinguish handwritten Chinese characters. International Journal on Document Analysis and Recognition, 10(3-4), 175-186.
Tam, K. Y., and Kiang, M. (1992). Managerial applications of neural networks: The case of bank failure predications. Management Science,38(7), 926–947
Thomas, L. C. (2000). A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers. International Journal of Forecasting, 16(2), 149-172.
Tsai, C. F., and Wu, J. W. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34(4), 2639-2649.
Vapnik, V. N. (Ed.). (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Wang, X., Yang, J., Teng, X., Xia, W., and Jensen, R. (2007). Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4), 459-471.
West, D. (2000). Neural network credit scoring models. Computers and OperationsResearch, 27(11-12), 1131-1152.
Westgaard, S., and Vander Wijst, N. (2001). Default probabilities in a corporate bank Portfolio: A logistic model approach. European Journal of Operational Research, 135(2), 7611-7616.
Yun, L., Zhong Fu, W., Jia Min, L., and Yan Yun, T. (2004). Efficient feature selection for high-dimensional data using two-level filter. Proceedings of 2004 International Conference on machine learning and cybernetics. 8, 1711- 1716.
Zhang, G., Hu, M. Y., Patuwo, B. E.,and Indro, D. C. (1999). Artificial neural networks in bankruptcy prediction: General framework and crossvalidation analysis. European Journal of Operational Research, 116,16–32.
Zhang, G. P. (2000). Neural networks for classification: a survey. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications andReviews, 30(4), 11.
Zhao, Y., Yao, Y., and Luo, F. (2007). Data analysis based on discernibility and indiscernibility. Information Sciences, 177(22), 4959-4976.