簡易檢索 / 詳目顯示

研究生: 陳麗妃
Li-Fei Chen
論文名稱: 整合約略集合論、支援向量機與決策樹之資料挖礦架構及其個案研究
A Hybrid Data Mining Framework with Rough Set Theory, Support Vector Machine, and Decision Tree and its Case Studies
指導教授: 簡禎富
Chen-Fu Chien
口試委員:
學位類別: 博士
Doctor
系所名稱: 工學院 - 工業工程與工程管理學系
Department of Industrial Engineering and Engineering Management
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 136
中文關鍵詞: 資料挖礦分類約略集合論支援向量機決策樹人力資源管理
外文關鍵詞: data mining, classification, rough set theory, support vector machine, decision tree, human resource management
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對於分類預測問題,除了要求其精確度之外,是否能提供簡明易懂的規則來探求相關的管理意義,通常也是決策者的重要考量之一。人力資源管理的問題即為其中一例。在資料挖礦(data mining)的分類預測問題中,約略集合論 (RST)、支援向量機 (SVM) 與決策樹 (DT) 等技術受到許多學者的青睞。約略集合論與DT可以產生規則,而SVM則無此能力。另一方面,RST的屬性選取能力令人矚目,而SVM與DT則以其預測能力受到重視。本研究結合此三項技術的優點,發展一四階段的資料挖礦整合架構以改善預測的準確度並提升規則產生的品質。第一階段是利用RST進行重要屬性的選取,在不損失分類資訊的情形下,將多餘的屬性予以剔除;第二階段則是利用SVM,以交互驗證 (cross validation) 的方式,減少樣本的雜質;觀察新樣本的分佈,若有類別不均 (class imbalance) 的現象,則需進行第三階段的類別調整程序,利用RST所產生的規則來調整分佈不均的類別。經由前三階段,能夠得到較具代表性的屬性與樣本,同時類別分佈也較為均勻。最後,將這些樣本透過DT構建預測模式並產生相關規則。另一方面,對於人力資源管理的預測所要處理的資料通常具有高維度、較為複雜且不確定性也高的特性,使得傳統的統計預測方法陷入低檢定力的窘境。本研究利用所提出的整合性方法,分別針對兩家位於新竹的高科技公司的直接人員與間接人員的甄選資料進行實證分析。結果顯示本研究所提出的方法能夠有效改善預測的準確度並提升規則產生的品質,同時其績效較傳統的RST、SVM與DT為佳。


    Support vector machine (SVM), rough set theory (RST) and decision tree (DT) are methodologies applied to various data mining problems, especially for classification prediction tasks. Studies have shown the ability of RST for feature selection while SVM and DT are significantly on their predictive power. This research aims to integrate the advantages of SVM, RST and DT approaches to develop a hybrid framework to enhance the quality of class prediction as well as rule generation. In addition to build up a classification model with acceptable accuracy, the capability to explain and explore how the decision made with simple, understandable and useful rules is a critical issue for human resource management. DT and RST can generate such rules, however, SVM can not offer such function. The major concept consists of four main stages. The first stage is to select most important attributes. RST is applied to eliminate the redundant and irrelative attributes without loss of any information about classification. The second stage is to reduce noisy objects, which can be accomplished by cross validation through using SVM. If the new data set would induce data imbalance problem, the rules generated by RST would be used to adjust the class distribution (stage 3). Through the stages described above, a data set with fewer dimensions and higher degree of purity could be screened out with similar class distribution and is used to generate rules by using DT which complete the last stage. In addition, the decisions concern with personnel selection prediction always involve handling data with highly dimensions, uncertainty and complexity, which cause traditional statistical methods suffering from low power of test. For validation, real cases of personnel selection of two high-tech companies containing direct and indirect labors in Hsinchu, Taiwan are studied using the proposed hybrid data mining framework. Implementation results show that the proposed approach is effective and has a better performance than that of traditional SVM, RST and DT.

    Contents 摘要 i Abstract ii Acknowledgement iv Table of Contents v List of Tables viii List of Figures xii Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research Objectives 4 1.4 Organization of Dissertation 5 Chapter 2 Literature Review 7 2.1 Data Mining 7 2.2 Decision Tree 10 2.3 Rough Set Theory 12 2.3.1 Information System 15 2.3.2 Indiscernibility Relation 16 2.3.3 Approximation of Sets 17 2.3.4 Attribute Reduction and Rules Extraction 20 2.4 Support Vector Machine 23 2.5 Human Resource Management 29 Chapter 3 The Approach 33 3.1 Proposed Hybrid Framework 33 3.1.1 The Concept 33 3.1.2 The Procedure 36 3.2 A Numerical Example 46 3.2.1 The Data Set 46 3.2.2 Implementation 47 3.2.3 Comparison 52 Chapter 4 Case Study for Direct Labor 54 4.1 The Problem 54 4.2 Implementation 56 4.3 Comparison and Discussion 63 Chapter 5 Case Study for Indirect Labor 66 5.1 Background and Significance 66 5.2 Problem Definition and Objective 67 5.3 Prediction of Job Performance 76 5.3.1 Implementation for Job Performance Prediction 76 5.3.2 Comparison and Discussion for Job Performance Prediction 85 5.4 Prediction of Resignation within Three Months 88 5.4.1 Implementation for Prediction of Resignation within Three Months 88 5.4.2 Comparison and Discussion for Prediction of Resignation within Three Months 93 5.5 Prediction of Resignation within One Year 96 5.5.1 Implementation for Prediction of Resignation within One Year 96 5.5.2 Comparison and Discussion for Prediction of Resignation within One Year 102 5.6 Prediction of Turnover Reasons 104 5.6.1 Implementation for Prediction of Turnover Reasons 104 5.6.2 Comparison and Discussion for Prediction of Turnover Reasons 110 5.7 Discussion 113 Chapter 6 Performance Comparison 120 6.1 Comparison in accuracy 121 6.2 Comparison in effective rule number 124 Chapter 7 Conclusion 126 7.1 Summary 126 7.2 Future Research 128 References 130 List of Tables Table 2.1 A comparison of CART, CHAID, ID3, and C4.5 11 Table 2.2 Rough set applications 12 Table 2.3 An example of decision table 16 Table 2.4 Literatures of decision support systems for human resource management 32 Table 3.1 The class distribution for TAE example 47 Table 3.2 The reduced sample through the cross validation by SVM for TAE example 49 Table 3.3 The class distribution in new data set for TAE example 49 Table 3.4 The rules validation results of the proposed approach using testing data set for TAE example 51 Table 3.5 A comparison of the proposed approach with the other algorithms for TAE example 53 Table 4.1 The original class distribution for DL case 56 Table 4.2 The reduced sample through the cross validation by SVM for DL case 57 Table 4.3 The class distribution in new data set for DL case 59 Table 4.4 Validation of the rules and the corresponding instances to adjust the size of class R2 59 Table 4.5 Validation of the rules and the corresponding instances to adjust the size of class R1 60 Table 4.6 The rules validation results of the proposed approach using testing data set for DL case 62 Table 4.7 A comparison of the proposed approach with the other algorithms for DL case 64 Table 5.1 The general work descriptions of different job functions 69 Table 5.2 The distribution of educational background for different job functions 70 Table 5.3 The condition attributes and the corresponding levels for empirical case 74 Table 5.4 The class distribution for job performance prediction in IDL case 76 Table 5.5 The reduced samples through the cross validation by SVM for job performance prediction in IDL case 78 Table 5.6 The class distribution in new data set for job performance prediction in IDL case 78 Table 5.7 Validation of the rules and the corresponding instances to adjust the size of class “outstanding” for job performance prediction in IDL case 80 Table 5.8 Validation of the rules and the corresponding instances to adjust the size of class “standard” for job performance prediction in IDL case 81 Table 5.9 Validation of the rules and the corresponding instances to adjust the size of class “improvement needed” for job performance prediction in IDL case 82 Table 5.10 The rules validation results of the proposed approach using testing data set for job performance prediction in IDL case 84 Table 5.11 A comparison of proposed approach with the other algorithms for job performance prediction in IDL case 86 Table 5.12 The class distribution for prediction of resignation within three months in IDL case 88 Table 5.13 The reduced samples through the cross validation by SVM for prediction of resignation within three months in IDL case 90 Table 5.14 The class distribution in new data set for resignation within 3 months 90 Table 5.15 The rules validation results of the proposed approach using testing data set for prediction of resignation within three months in IDL case 91 Table 5.16 A comparison of the proposed approach with the other algorithms for prediction of resignation within three months in IDL case 94 Table 5.17 The class distribution for prediction of resignation within one year in IDL case 96 Table 5.18 The reduced samples through the cross validation by SVM for prediction of resignation within one year in IDL case 98 Table 5.19 The class distribution in new data set for resignation within 3 months in IDL case 98 Table 5.20 The rules validation results of the proposed approach using testing data set for prediction of resignation within one year in IDL case 99 Table 5.21 A comparison of the proposed approach with the other algorithms for prediction of resignation within one year in IDL case 103 Table 5.22 The class distribution for prediction of turnover reasons in IDL case 105 Table 5.23 The reduced samples through the cross validation by SVM for prediction of turnover reasons in IDL case 107 Table 5.24 The class distribution in new data set for turnover reasons 107 Table 5.25 The rules validation results of the proposed approach using testing data set for prediction of turnover reasons in IDL case 108 Table 5.26 A comparison of the proposed approach with the other algorithms for prediction of turnover reasons in IDL case 111 Table 5.27 Summary of prediction accuracy and the ranking for different algorithm 115 Table 5.28 Summary of number of validated rules and the ranking for different algorithms 116 Table 5.29 Summary of changing of attribute number by using proposed Algorithm 117 Table 5.30 Summary of changing of sample size by using proposed algorithm 117 Table 6.1 Summary of prediction accuracy for different algorithm in each case 120 Table 6.2 Summary of effective rule numbers for different algorithm in each case 121 Table 6.3 The ranking for different algorithm 123 List of Figures Figure 1.1 Conceptual framework 5 Figure 2.1 Environmental effect on personnel selection 31 Figure 3.1 The concept of proposed hybrid approach 36 Figure 3.2 The proposed data mining framework 37 Figure 3.3 Procedure of sample reduction using SVM with cross validation 40 Figure 3.4 The mechanism to adjust imbalance class by RST 44 Figure 3.5 Decision tree constructed by See5 for TAE example 50 Figure 4.1 Decision tree constructed by See5 for DL case 61 Figure 5.1 Analysis structure for the indirect labor example 68 Figure 5.2 Data distribution with respect to decision attributes 72 Figure 5.3 Comparison of prediction accuracy for different algorithms 115 Figure 5.4 Comparison of the number of validated rules for different algorithms in the four experiments 116 Figure 5.5 Comparison of changing of attribute number by using proposed algorithm 118 Figure 5.6 Comparison of changing of sample size by using proposed algorithm 119 Figure 6.1 ANOVA output for ACCURACY in case studies by SPSS 122 Figure 6.2 Duncan’s multiple comparisons output for ACCURACY in case studies by SPSS 122 Figure 6.3 Output of paired samples t test for algorithms DT and RST_DT by SPSS 124 Figure 6.4 ANOVA output for FFECTIVE RULE NUMBER in case studies by SPSS 125

    Ahn, B. S., Cho, S. S. & Kim, C. Y. (2000). The integrated methodology of rough set theory and artificial neural network for business failure prediction, Expert Systems with Applications, 18 (2), 65-74.
    Barbagallo, S., Consoli, S., Pappalardo, N., Greco, S. & Zimbone, S. M. (2006). Discovering reservoir operating rules by a Rough Set approach, Water Resources Management, 20(1), 19-36.
    Batista, G., Prati, R. C. & Monard, M. C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM Special Interest Group on Knowledge Discovering and Data Mining (SIGKDD) Explorations, 6(1), 20-29.
    Beckers, A. M. & Bsat, M. Z. (2002). A DSS classification model for research in Human Resource Information Systems, Information Systems Management, 19(3), 41-50.
    Berry, M. J. & Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons.
    Borman, W. C., Hanson, M. A. & Hedge, J. W. (1997). Personnel selection, Annual Review of Psychology, 48, 299-337.
    Braha, D. & Shmilovici, A. (2002). Data mining for improving a cleaning process in the semiconductor industry, IEEE Transactions on Semiconductor Manufacturing, 15(1), 91-101.
    Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, P. J. (1984). Classification and regression trees, CA: Wadsworth International Group.
    Byun, D. H. and Suh, E. H. (1994). Human resource management expert-systems technology, Expert Systems, 11(2), 109-119.

    Cao, L. J. & Tay, F. E. H. (2003). Support vector machine with adaptive parameters in financial time series forecasting, IEEE Transactions on Neural Network, 14 (6), 1506-1518.
    Cortes, C. & Vapnik, V. (1995). Support-vector networks, Machine Learning, 20 (3): 273-297.
    Chen, M. S., Han, J. & Yu, P. S. (1996). Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, 8(6), 866-883.
    Chien, C. F., Hsiao, A. & Wang, I. (2004). Constructing semiconductor manufacturing performance indexes and applying data mining for manufacturing data analysis, Journal of the Chinese Institute of Industrial Engineers, 21(4), 313-327.
    Chien, C. F., Wang, I. & Chen, L. F. (2005). Using data mining to improve the quality of human resource management of operators in semiconductor manufactures, Journal of Quality, 12(1), 9-28.
    Chien, C. F., Wang, W. C. & Cheng, J. C. (2007). Data mining for yield enhancement in semiconductor manufacturing and an empirical study, Expert Systems with Applications, 33(1).
    Cho, V. & Ngai, E. (2003). Data mining for selection of insurance sales agents, Expert Systems, 20(3), 123-132.
    Cristianini, N. & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines, Cambridge University Press.
    de Souza, J. T., Matwin, S. & Japkowicz, N. (2006). Parallelizing feature selection, Algorithmica, 45(3), 433-456.
    Dimitras, A. I., Slowinski, R., Susmaga, R. & Zopounidis, C. (1999). Business failure prediction using rough sets, European Journal of Operational Research, 114 (2), 263-280.

    Dunham, M. H. (2003). Data mining: Introductory and advanced topics, Prentice Hall.
    Fayyad, U., Piatesky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39, 27-34.
    Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. & Haussler, D. (2000). Bioinformatics, 16 (10), 906-914.
    Ge, M., Du, R., Zhang, G. C. & Xu, Y. S. (2004). Fault diagnosis using support vector machine with an application in sheet metal stamping operations, Mechanical Systems And Signal Processing, 18 (1), 143-159
    Hartigan, J.A. (1975). Clustering Algorithms. New York: John Wiley & Sons
    Hashemi, R. R., Le Blanc, L. A., Rucks, C. T. & Rajaratnam, A., 1998, A hybrid intelligent system for predicting bank holding structures, European Journal of Operational Research, 109 (2), 390-402.
    He, J. Hu, H., Harrison, R., Tai, P. C. & Pan, Y. (2006). Rule generation for protein secondary structure prediction with support vector machines and decision tree, IEEE Transactions on Nanobioscience, 5(1), 46-53.
    Hooper, R. S., Galvin, T. P., Kilmer, R. A., & Liebowitz, J. (1998). Use of an expert system in a personnel selection process, Expert Systems with Applications, 14(4), 425-432.
    Hough, L. M., & Oswald, F. L. (2000). Personnel selection: looking toward the future – remembering the past, Annual Review of Psychology, 51, 631-664.
    Hu, R., Yu, D.& Xie, Z. (2006). Informative-preserving hybrid data reduction based on fuzzy-rough techniques, Pattern Recognition Letters, 27, 414-423.
    Huang, K., Yang, H., King, I. & Lyu, M. (2004). Learning classifiers from imbalanced data based on biased minimax probability machine, Proceedings of the 04’ IEEE computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), 558-563.

    Huang, L.C., Kuo, R.J. & Huang, H. J. (2001). A neural network modeling for human resource talent selection, International Journal of Human Resources Development and Management, 1(2/3/4), 206-219
    Huang, M. J., Tsou, Y. L. & Lee, S. C. (2006). Integrating fuzzy data mining and fuzzy artificial neural networks human discovering implicit knowledge, Knowledge-Based Systems, 19(3), 396-403
    Kass, G.V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119-127
    Kumar, A. & Agrawal, D. P. (2005). Advertising data analysis using rough sets model, International Journal of Information Technology & Decision Making, 4(2), 263-376.
    Kovach, K. A., & Cathcart, C. E. (1999). Human Resource Information Systems (HRIS): providing business with rapid data access, information exchange and strategic advantage, Public Personnel Management, 28(2), 275-282.
    Kusiak, A. (2001). A data mining tool for semiconductor manufacturing, IEEE Transactions on Electronics Packaging Manufacturing, 24 (1), 44-50.
    Kusiak, A. & Kurasek, C. (2001). Data mining of printed-circuit board defects, IEEE Transactions on Robotics and Automation, 17 (2), 191-196.
    Li, P. & Wang, Z. (2004). Mining classification rules using rough sets and neural networks, European Journal of Operational Research, 157, 439-448.
    Li, S.T., Shiue, W. & Huang, M. H. (2006). The evaluation of consumer loans using support vector machines, Expert System with Applications, 30, 772-782.
    Liao, S. H. (2003). Knowledge management technologies and applications - literature review from 1995 to 2002, Expert Systems with Applications, 25, 155-164.
    Lievens, F., Van Dam, K. & Anderson, N. (2002). Recent trends and challenges in personnel selection, Personnel Review, 31(5-6), 580-601.
    Mahmood, M. A., Gowan, M. A. and Wang, S. P. (1995). Developing a prototype job evaluation expert system – a compensation management application, Information & Management, 29(1), 9-28.
    Mak, B. & Munakata, T. (2002). Rule extraction from expert heuristics: A comparative study of rough sets with neural networks and ID3, European Journal of Operational Reasearch, 136, 212-229.
    Mohanty, R. P. & Deshmukh, S. G.. (1997). Evolution of a decision support system for human planning in a petroleum company, International Journal of Production Economics, 51(3) : 251-261.
    Mone, M. A., Mueller, G. C. & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research, Personnel Psychologies, 49, 103-120.
    Ngai, E.W.T. & Wat, F.K.T. (2006). Human resource information systems: a review and empirical analysis, Personnel Review, 35(3), 297-314.
    Ntuen, C. A. & Chestnut J. A., 1995. An expert-system for selecting manufacturing workers for training, Expert Systems with Applications, 9(3): 309-332.
    Nussbaum, M, Singer, M., Rosas, R., Castillo, M., Flies, E., Lara, R. & Sommers, R. (1999). Decision support system for conflict diagnosis in personnel selection. Information & Management, 36(1), 55-62.
    Pawlak, Z. (1982). Rough sets, International Journal of Information and Computer Science, 11, 341-356.
    Pawlak, Z. (2002). Rough set, decision algorithm and Bayes’ theorem, European Journal of Operational Research, 99, 48-57.
    Peng, C. & Chien, C. F. (2003). Data value development to enhance yield and maintain competitive advantage for semiconductor manufacturing, International Journal of Service Technology and Management, 4(4-6), 365-383.
    Peng, J., Chien, C. F. & Tseng, B. (2004). Rough set theory for data mining for fault diagnosis on distribution feeder, IEE Proceedings-Generation, Transmission, and Distributions, 151(6), 689-697.
    Pyle, D. (1999). Data Preparation for Data Mining, Morgan Kaufrnann, San Francisco, CA.
    Quinlan, J. R. (1986). Induction of decision tree, Machine Learning, 1(1), 81-106.
    Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
    Robertson, I. T. & Smith, M. (2001). Personnel selection, Journal of Occupational and Organizational Psychology, 74(4), 441-472.
    Shih, P. C. & Liu, C. J. (2006). Face detection using discriminating feature analysis and Support Vector Machine, Pattern Recognition, 39 (2), 260-276.
    Shih, F. Y. & Zhang, K. (2005). Support vector machine networks for multi-class classification, International Journal of Pattern Recognition and Artificial Intelligence, 19(6), 775-786.
    Shiue, Y., R. & Su, C. T. (2003). An enhanced knowledge representation for decision tree based learning adaptive scheduling. International Journal of Computer Integrated Manufacturing, 16(1), 48-60.
    Shaw, M. J., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001). Knowledge management and data mining for marketing, Decision Support Systems, 31(1), 127-137.
    Su, C. T. & Hsu, J. H. (2006). Precision parameter in the variable precision rough sets model: an application. Omega-International Journal of Management Science, 34 (2), 149-157.
    Su, C. T., Yih, Y. W. & Chen, L. S. (2006). Knowledge acquisition through information granulation for imbalanced data, Expert Systems with Applications, 31(3), 531-541.
    Swiniarski, R.W. & Skowron, A. (2003). Rough set methods in feature selection and recognition, Pattern Recognition Letters, 24 (6), 833-849.
    Tavana, M., Kennedy, D. T. & Joglekar, P. (1996). A group decision support framework for consensus ranking of technical manager candidates, Omega-International Journal of Management Science, 24(5): 523-538.
    Wang, S. Dash, M. & Chia, L.-T. (2006). Efficient data reduction in multimedia data, Applied Intelligence, 25, 359-374.
    Wei, C. & Chiu, I. (2002). Turning telecommunications call details to churn prediction: a data mining approach, Expert Systems with Applications, 23(2), 103-112.
    Wilk, S., Slowinski, R., Michalowski, W. & Greco, S. (2005). Supporting triage of children with abdominal pain in the emergency room. European Journal of Operational Research,160 (3), 696-709.
    Wu, C. H., Kao, S. C., Su, Y. Y., & Wu, C. C. (2005). Targeting customers via discovery knowledge for the insurance industry. Expert Systems with Applications, 29(2), 291-299.
    Zhan, Y. M., Zeng, X. Y. & Sun, J. C. (2005). Rough set-based feature selection method, Progress In Natural Science, 15(3), 280-284.
    Zhang, G. X., Cao, Z. X. & Gu, Y. J. (2005). A hybrid classifier based on rough set theory and support vector machines, Fuzzy Systems And Knowledge Discovery, Pt 1, Proceedings Lecture Notes In Artificial Intelligence, 3613, 1287-1296.
    Zhong, N., Dong, J. & Ohsuga, S. (2001). Using rough sets with heuristics for feature selection, Journal of Intelligent Information Systems, 16, 199-214.
    Warsaw University, Rough Set Exploration System, version 2.2, Logic Group, Institute of Mathematics, Warsaw University, Poland, (2005). (http://logic.mimuw.edu.pl/~rses/).

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE