整合約略集合論、支援向量機與決策樹之資料挖礦架構及其個案研究

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳麗妃 Li-Fei Chen
論文名稱：	整合約略集合論、支援向量機與決策樹之資料挖礦架構及其個案研究 A Hybrid Data Mining Framework with Rough Set Theory, Support Vector Machine, and Decision Tree and its Case Studies
指導教授：	簡禎富 Chen-Fu Chien
口試委員:
學位類別：	博士 Doctor
系所名稱：	工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management
論文出版年：	2007
畢業學年度：	95
語文別：	英文
論文頁數：	136
中文關鍵詞：	資料挖礦、分類、約略集合論、支援向量機、決策樹、人力資源管理
外文關鍵詞：	data mining, classification, rough set theory, support vector machine, decision tree, human resource management
相關次數：	點閱：1 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

對於分類預測問題，除了要求其精確度之外，是否能提供簡明易懂的規則來探求相關的管理意義，通常也是決策者的重要考量之一。人力資源管理的問題即為其中一例。在資料挖礦(data mining)的分類預測問題中，約略集合論 (RST)、支援向量機 (SVM) 與決策樹 (DT) 等技術受到許多學者的青睞。約略集合論與DT可以產生規則，而SVM則無此能力。另一方面，RST的屬性選取能力令人矚目，而SVM與DT則以其預測能力受到重視。本研究結合此三項技術的優點，發展一四階段的資料挖礦整合架構以改善預測的準確度並提升規則產生的品質。第一階段是利用RST進行重要屬性的選取，在不損失分類資訊的情形下，將多餘的屬性予以剔除；第二階段則是利用SVM，以交互驗證 (cross validation) 的方式，減少樣本的雜質；觀察新樣本的分佈，若有類別不均 (class imbalance) 的現象，則需進行第三階段的類別調整程序，利用RST所產生的規則來調整分佈不均的類別。經由前三階段，能夠得到較具代表性的屬性與樣本，同時類別分佈也較為均勻。最後，將這些樣本透過DT構建預測模式並產生相關規則。另一方面，對於人力資源管理的預測所要處理的資料通常具有高維度、較為複雜且不確定性也高的特性，使得傳統的統計預測方法陷入低檢定力的窘境。本研究利用所提出的整合性方法，分別針對兩家位於新竹的高科技公司的直接人員與間接人員的甄選資料進行實證分析。結果顯示本研究所提出的方法能夠有效改善預測的準確度並提升規則產生的品質，同時其績效較傳統的RST、SVM與DT為佳。

Support vector machine (SVM), rough set theory (RST) and decision tree (DT) are methodologies applied to various data mining problems, especially for classification prediction tasks. Studies have shown the ability of RST for feature selection while SVM and DT are significantly on their predictive power. This research aims to integrate the advantages of SVM, RST and DT approaches to develop a hybrid framework to enhance the quality of class prediction as well as rule generation. In addition to build up a classification model with acceptable accuracy, the capability to explain and explore how the decision made with simple, understandable and useful rules is a critical issue for human resource management. DT and RST can generate such rules, however, SVM can not offer such function. The major concept consists of four main stages. The first stage is to select most important attributes. RST is applied to eliminate the redundant and irrelative attributes without loss of any information about classification. The second stage is to reduce noisy objects, which can be accomplished by cross validation through using SVM. If the new data set would induce data imbalance problem, the rules generated by RST would be used to adjust the class distribution (stage 3). Through the stages described above, a data set with fewer dimensions and higher degree of purity could be screened out with similar class distribution and is used to generate rules by using DT which complete the last stage. In addition, the decisions concern with personnel selection prediction always involve handling data with highly dimensions, uncertainty and complexity, which cause traditional statistical methods suffering from low power of test. For validation, real cases of personnel selection of two high-tech companies containing direct and indirect labors in Hsinchu, Taiwan are studied using the proposed hybrid data mining framework. Implementation results show that the proposed approach is effective and has a better performance than that of traditional SVM, RST and DT.

Contents

摘要           i
Abstract           ii
Acknowledgement       iv
Table of Contents        v
List of Tables      viii
List of Figures      xii

Chapter 1  Introduction       1
1.1  Background       1
1.2  Motivation       2
1.3  Research Objectives       4
1.4  Organization of Dissertation       5

Chapter 2  Literature Review       7
2.1  Data Mining       7
2.2  Decision Tree      10
2.3  Rough Set Theory      12
2.3.1  Information System      15
2.3.2  Indiscernibility Relation      16
2.3.3  Approximation of Sets      17
2.3.4  Attribute Reduction and Rules Extraction      20
2.4  Support Vector Machine      23
2.5  Human Resource Management      29

Chapter 3  The Approach      33
3.1  Proposed Hybrid Framework      33
3.1.1  The Concept      33
3.1.2  The Procedure      36
3.2  A Numerical Example      46
3.2.1  The Data Set      46
3.2.2  Implementation      47
3.2.3  Comparison      52

Chapter 4  Case Study for Direct Labor      54
4.1  The Problem      54
4.2  Implementation      56
4.3  Comparison and Discussion      63

Chapter 5  Case Study for Indirect Labor      66
5.1  Background and Significance      66
5.2  Problem Definition and Objective     67
5.3  Prediction of Job Performance      76
5.3.1  Implementation for Job Performance Prediction      76
5.3.2  Comparison and Discussion for Job Performance Prediction      85
5.4  Prediction of Resignation within Three Months      88
5.4.1  Implementation for Prediction of Resignation within Three
Months      88
5.4.2  Comparison and Discussion for Prediction of Resignation
within Three Months      93
5.5  Prediction of Resignation within One Year      96
5.5.1  Implementation for Prediction of Resignation within One Year
          96
5.5.2  Comparison and Discussion for Prediction of Resignation
within One Year     102
5.6  Prediction of Turnover Reasons     104
5.6.1  Implementation for Prediction of Turnover Reasons     104
5.6.2  Comparison and Discussion for Prediction of Turnover Reasons
        110
5.7  Discussion     113

Chapter 6  Performance Comparison     120
6.1  Comparison in accuracy     121
6.2  Comparison in effective rule number     124

Chapter 7  Conclusion     126
7.1  Summary     126
7.2  Future Research     128

References      130



List of Tables

Table 2.1  A comparison of CART, CHAID, ID3, and C4.5      11
Table 2.2  Rough set applications      12
Table 2.3  An example of decision table      16
Table 2.4  Literatures of decision support systems for human resource management
         32
Table 3.1  The class distribution for TAE example      47
Table 3.2  The reduced sample through the cross validation by SVM for TAE
    example      49
Table 3.3  The class distribution in new data set for TAE example      49
Table 3.4  The rules validation results of the proposed approach using testing
    data set for TAE example      51
Table 3.5  A comparison of the proposed approach with the other algorithms
    for TAE example       53
Table 4.1  The original class distribution for DL case      56
Table 4.2  The reduced sample through the cross validation by SVM for DL case
         57
Table 4.3  The class distribution in new data set for DL case      59
Table 4.4  Validation of the rules and the corresponding instances to adjust the
    size of class R2      59
Table 4.5  Validation of the rules and the corresponding instances to adjust the
    size of class R1      60
Table 4.6  The rules validation results of the proposed approach using testing
    data set for DL case      62
Table 4.7  A comparison of the proposed approach with the other algorithms
    for DL case      64
Table 5.1  The general work descriptions of different job functions      69
Table 5.2  The distribution of educational background for different job functions
         70
Table 5.3 The condition attributes and the corresponding levels for empirical case
         74
Table 5.4  The class distribution for job performance prediction in IDL case      76
Table 5.5  The reduced samples through the cross validation by SVM
    for job performance prediction in IDL case      78
Table 5.6  The class distribution in new data set for job performance prediction
    in IDL case      78
Table 5.7  Validation of the rules and the corresponding instances to adjust the
    size of class “outstanding” for job performance prediction in IDL case
         80
Table 5.8  Validation of the rules and the corresponding instances to adjust the
    size of class “standard” for job performance prediction in IDL case      81
Table 5.9  Validation of the rules and the corresponding instances to adjust the
    size of class “improvement needed” for job performance prediction
    in IDL case      82
Table 5.10  The rules validation results of the proposed approach using testing
    data set for job performance prediction in IDL case      84
Table 5.11  A comparison of proposed approach with the other algorithms
    for job performance prediction in IDL case      86
Table 5.12  The class distribution for prediction of resignation within three
    months in IDL case      88
Table 5.13  The reduced samples through the cross validation by SVM for
    prediction of resignation within three months in IDL case      90
Table 5.14  The class distribution in new data set for resignation within 3 months
         90
Table 5.15  The rules validation results of the proposed approach using testing
    data set for prediction of resignation within three months in IDL case
         91
Table 5.16  A comparison of the proposed approach with the other algorithms
    for prediction of resignation within three months in IDL case      94
Table 5.17  The class distribution for prediction of resignation within one year
    in IDL case      96
Table 5.18  The reduced samples through the cross validation by SVM for
    prediction of resignation within one year in IDL case      98
Table 5.19  The class distribution in new data set for resignation within 3 months
    in IDL case      98
Table 5.20  The rules validation results of the proposed approach using testing
    data set for prediction of resignation within one year in IDL case      99
Table 5.21  A comparison of the proposed approach with the other algorithms
    for prediction of resignation within one year in IDL case     103
Table 5.22  The class distribution for prediction of turnover reasons in IDL case
        105
Table 5.23  The reduced samples through the cross validation by SVM for
    prediction of turnover reasons in IDL case     107
Table 5.24  The class distribution in new data set for turnover reasons     107
Table 5.25  The rules validation results of the proposed approach using testing
    data set for prediction of turnover reasons in IDL case     108
Table 5.26  A comparison of the proposed approach with the other algorithms
    for prediction of turnover reasons in IDL case     111
Table 5.27  Summary of prediction accuracy and the ranking for different
    algorithm     115
Table 5.28  Summary of number of validated rules and the ranking for
    different algorithms     116
Table 5.29  Summary of changing of attribute number by using proposed
    Algorithm     117
Table 5.30  Summary of changing of sample size by using proposed algorithm     117
Table 6.1  Summary of prediction accuracy for different algorithm in each case    120
Table 6.2  Summary of effective rule numbers for different algorithm in each
    case     121
Table 6.3  The ranking for different algorithm    123



List of Figures

Figure 1.1  Conceptual framework       5
Figure 2.1  Environmental effect on personnel selection      31
Figure 3.1  The concept of proposed hybrid approach      36
Figure 3.2  The proposed data mining framework      37
Figure 3.3  Procedure of sample reduction using SVM with cross validation      40
Figure 3.4  The mechanism to adjust imbalance class by RST      44
Figure 3.5  Decision tree constructed by See5 for TAE example      50
Figure 4.1  Decision tree constructed by See5 for DL case      61
Figure 5.1  Analysis structure for the indirect labor example      68
Figure 5.2  Data distribution with respect to decision attributes      72
Figure 5.3  Comparison of prediction accuracy for different algorithms     115
Figure 5.4  Comparison of the number of validated rules for different
    algorithms in the four experiments     116
Figure 5.5  Comparison of changing of attribute number by using proposed
    algorithm     118
Figure 5.6  Comparison of changing of sample size by using proposed
    algorithm     119
Figure 6.1  ANOVA output for ACCURACY in case studies by SPSS    122
Figure 6.2  Duncan’s multiple comparisons output for ACCURACY in case
studies by SPSS     122
Figure 6.3  Output of paired samples t test for algorithms DT and RST_DT by
SPSS     124
Figure 6.4  ANOVA output for FFECTIVE RULE NUMBER in case studies
by SPSS    125

                                

Ahn, B. S., Cho, S. S. & Kim, C. Y. (2000). The integrated methodology of rough set theory and artificial neural network for business failure prediction, Expert Systems with Applications, 18 (2), 65-74.
Barbagallo, S., Consoli, S., Pappalardo, N., Greco, S. & Zimbone, S. M. (2006). Discovering reservoir operating rules by a Rough Set approach, Water Resources Management, 20(1), 19-36.
Batista, G., Prati, R. C. & Monard, M. C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM Special Interest Group on Knowledge Discovering and Data Mining (SIGKDD) Explorations, 6(1), 20-29.
Beckers, A. M. & Bsat, M. Z. (2002). A DSS classification model for research in Human Resource Information Systems, Information Systems Management, 19(3), 41-50.
Berry, M. J. & Linoff, G. (1997). Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons.
Borman, W. C., Hanson, M. A. & Hedge, J. W. (1997). Personnel selection, Annual Review of Psychology, 48, 299-337.
Braha, D. & Shmilovici, A. (2002). Data mining for improving a cleaning process in the semiconductor industry, IEEE Transactions on Semiconductor Manufacturing, 15(1), 91-101.
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, P. J. (1984). Classification and regression trees, CA: Wadsworth International Group.
Byun, D. H. and Suh, E. H. (1994). Human resource management expert-systems technology, Expert Systems, 11(2), 109-119.

Cao, L. J. & Tay, F. E. H. (2003). Support vector machine with adaptive parameters in financial time series forecasting, IEEE Transactions on Neural Network, 14 (6), 1506-1518.
Cortes, C. & Vapnik, V. (1995). Support-vector networks, Machine Learning, 20 (3): 273-297.
Chen, M. S., Han, J. & Yu, P. S. (1996). Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, 8(6), 866-883.
Chien, C. F., Hsiao, A. & Wang, I. (2004). Constructing semiconductor manufacturing performance indexes and applying data mining for manufacturing data analysis, Journal of the Chinese Institute of Industrial Engineers, 21(4), 313-327.
Chien, C. F., Wang, I. & Chen, L. F. (2005). Using data mining to improve the quality of human resource management of operators in semiconductor manufactures, Journal of Quality, 12(1), 9-28.
Chien, C. F., Wang, W. C. & Cheng, J. C. (2007). Data mining for yield enhancement in semiconductor manufacturing and an empirical study, Expert Systems with Applications, 33(1).
Cho, V. & Ngai, E. (2003). Data mining for selection of insurance sales agents, Expert Systems, 20(3), 123-132.
Cristianini, N. & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines, Cambridge University Press.
de Souza, J. T., Matwin, S. & Japkowicz, N. (2006). Parallelizing feature selection, Algorithmica, 45(3), 433-456.
Dimitras, A. I., Slowinski, R., Susmaga, R. & Zopounidis, C. (1999). Business failure prediction using rough sets, European Journal of Operational Research, 114 (2), 263-280.

Dunham, M. H. (2003). Data mining: Introductory and advanced topics, Prentice Hall.
Fayyad, U., Piatesky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39, 27-34.
Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. & Haussler, D. (2000). Bioinformatics, 16 (10), 906-914.
Ge, M., Du, R., Zhang, G. C. & Xu, Y. S. (2004). Fault diagnosis using support vector machine with an application in sheet metal stamping operations, Mechanical Systems And Signal Processing, 18 (1), 143-159
Hartigan, J.A. (1975). Clustering Algorithms. New York: John Wiley & Sons
Hashemi, R. R., Le Blanc, L. A., Rucks, C. T. & Rajaratnam, A., 1998, A hybrid intelligent system for predicting bank holding structures, European Journal of Operational Research, 109 (2), 390-402.
He, J. Hu, H., Harrison, R., Tai, P. C. & Pan, Y. (2006). Rule generation for protein secondary structure prediction with support vector machines and decision tree, IEEE Transactions on Nanobioscience, 5(1), 46-53.
Hooper, R. S., Galvin, T. P., Kilmer, R. A., & Liebowitz, J. (1998). Use of an expert system in a personnel selection process, Expert Systems with Applications, 14(4), 425-432.
Hough, L. M., & Oswald, F. L. (2000). Personnel selection: looking toward the future – remembering the past, Annual Review of Psychology, 51, 631-664.
Hu, R., Yu, D.& Xie, Z. (2006). Informative-preserving hybrid data reduction based on fuzzy-rough techniques, Pattern Recognition Letters, 27, 414-423.
Huang, K., Yang, H., King, I. & Lyu, M. (2004). Learning classifiers from imbalanced data based on biased minimax probability machine, Proceedings of the 04’ IEEE computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), 558-563.

Huang, L.C., Kuo, R.J. & Huang, H. J. (2001). A neural network modeling for human resource talent selection, International Journal of Human Resources Development and Management, 1(2/3/4), 206-219
Huang, M. J., Tsou, Y. L. & Lee, S. C. (2006). Integrating fuzzy data mining and fuzzy artificial neural networks human discovering implicit knowledge, Knowledge-Based Systems, 19(3), 396-403
Kass, G.V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119-127
Kumar, A. & Agrawal, D. P. (2005). Advertising data analysis using rough sets model, International Journal of Information Technology & Decision Making, 4(2), 263-376.
Kovach, K. A., & Cathcart, C. E. (1999). Human Resource Information Systems (HRIS): providing business with rapid data access, information exchange and strategic advantage, Public Personnel Management, 28(2), 275-282.
Kusiak, A. (2001). A data mining tool for semiconductor manufacturing, IEEE Transactions on Electronics Packaging Manufacturing, 24 (1), 44-50.
Kusiak, A. & Kurasek, C. (2001). Data mining of printed-circuit board defects, IEEE Transactions on Robotics and Automation, 17 (2), 191-196.
Li, P. & Wang, Z. (2004). Mining classification rules using rough sets and neural networks, European Journal of Operational Research, 157, 439-448.
Li, S.T., Shiue, W. & Huang, M. H. (2006). The evaluation of consumer loans using support vector machines, Expert System with Applications, 30, 772-782.
Liao, S. H. (2003). Knowledge management technologies and applications - literature review from 1995 to 2002, Expert Systems with Applications, 25, 155-164.
Lievens, F., Van Dam, K. & Anderson, N. (2002). Recent trends and challenges in personnel selection, Personnel Review, 31(5-6), 580-601.
Mahmood, M. A., Gowan, M. A. and Wang, S. P. (1995). Developing a prototype job evaluation expert system – a compensation management application, Information & Management, 29(1), 9-28.
Mak, B. & Munakata, T. (2002). Rule extraction from expert heuristics: A comparative study of rough sets with neural networks and ID3, European Journal of Operational Reasearch, 136, 212-229.
Mohanty, R. P. & Deshmukh, S. G.. (1997). Evolution of a decision support system for human planning in a petroleum company, International Journal of Production Economics, 51(3) : 251-261.
Mone, M. A., Mueller, G. C. & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research, Personnel Psychologies, 49, 103-120.
Ngai, E.W.T. & Wat, F.K.T. (2006). Human resource information systems: a review and empirical analysis, Personnel Review, 35(3), 297-314.
Ntuen, C. A. & Chestnut J. A., 1995. An expert-system for selecting manufacturing workers for training, Expert Systems with Applications, 9(3): 309-332.
Nussbaum, M, Singer, M., Rosas, R., Castillo, M., Flies, E., Lara, R. & Sommers, R. (1999). Decision support system for conflict diagnosis in personnel selection. Information & Management, 36(1), 55-62.
Pawlak, Z. (1982). Rough sets, International Journal of Information and Computer Science, 11, 341-356.
Pawlak, Z. (2002). Rough set, decision algorithm and Bayes’ theorem, European Journal of Operational Research, 99, 48-57.
Peng, C. & Chien, C. F. (2003). Data value development to enhance yield and maintain competitive advantage for semiconductor manufacturing, International Journal of Service Technology and Management, 4(4-6), 365-383.
Peng, J., Chien, C. F. & Tseng, B. (2004). Rough set theory for data mining for fault diagnosis on distribution feeder, IEE Proceedings-Generation, Transmission, and Distributions, 151(6), 689-697.
Pyle, D. (1999). Data Preparation for Data Mining, Morgan Kaufrnann, San Francisco, CA.
Quinlan, J. R. (1986). Induction of decision tree, Machine Learning, 1(1), 81-106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Robertson, I. T. & Smith, M. (2001). Personnel selection, Journal of Occupational and Organizational Psychology, 74(4), 441-472.
Shih, P. C. & Liu, C. J. (2006). Face detection using discriminating feature analysis and Support Vector Machine, Pattern Recognition, 39 (2), 260-276.
Shih, F. Y. & Zhang, K. (2005). Support vector machine networks for multi-class classification, International Journal of Pattern Recognition and Artificial Intelligence, 19(6), 775-786.
Shiue, Y., R. & Su, C. T. (2003). An enhanced knowledge representation for decision tree based learning adaptive scheduling. International Journal of Computer Integrated Manufacturing, 16(1), 48-60.
Shaw, M. J., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001). Knowledge management and data mining for marketing, Decision Support Systems, 31(1), 127-137.
Su, C. T. & Hsu, J. H. (2006). Precision parameter in the variable precision rough sets model: an application. Omega-International Journal of Management Science, 34 (2), 149-157.
Su, C. T., Yih, Y. W. & Chen, L. S. (2006). Knowledge acquisition through information granulation for imbalanced data, Expert Systems with Applications, 31(3), 531-541.
Swiniarski, R.W. & Skowron, A. (2003). Rough set methods in feature selection and recognition, Pattern Recognition Letters, 24 (6), 833-849.
Tavana, M., Kennedy, D. T. & Joglekar, P. (1996). A group decision support framework for consensus ranking of technical manager candidates, Omega-International Journal of Management Science, 24(5): 523-538.
Wang, S. Dash, M. & Chia, L.-T. (2006). Efficient data reduction in multimedia data, Applied Intelligence, 25, 359-374.
Wei, C. & Chiu, I. (2002). Turning telecommunications call details to churn prediction: a data mining approach, Expert Systems with Applications, 23(2), 103-112.
Wilk, S., Slowinski, R., Michalowski, W. & Greco, S. (2005). Supporting triage of children with abdominal pain in the emergency room. European Journal of Operational Research,160 (3), 696-709.
Wu, C. H., Kao, S. C., Su, Y. Y., & Wu, C. C. (2005). Targeting customers via discovery knowledge for the insurance industry. Expert Systems with Applications, 29(2), 291-299.
Zhan, Y. M., Zeng, X. Y. & Sun, J. C. (2005). Rough set-based feature selection method, Progress In Natural Science, 15(3), 280-284.
Zhang, G. X., Cao, Z. X. & Gu, Y. J. (2005). A hybrid classifier based on rough set theory and support vector machines, Fuzzy Systems And Knowledge Discovery, Pt 1, Proceedings Lecture Notes In Artificial Intelligence, 3613, 1287-1296.
Zhong, N., Dong, J. & Ohsuga, S. (2001). Using rough sets with heuristics for feature selection, Journal of Intelligent Information Systems, 16, 199-214.
Warsaw University, Rough Set Exploration System, version 2.2, Logic Group, Institute of Mathematics, Warsaw University, Poland, (2005). (http://logic.mimuw.edu.pl/~rses/).

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文