應用資料挖礦方法於健檢資料之實證研究｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳易妏 Chen, Yi-Wen
論文名稱：	應用資料挖礦方法於健檢資料之實證研究 Applying Data Mining Approaches to an Empirical Study of Physical Examination Dataset
指導教授：	邱銘傳 Chiu, Ming-Chuan
口試委員:	王志軒 Wang, Chih-Hsuan 許嘉裕 Hsu, Chia-Yu
學位類別：	碩士 Master
系所名稱：	工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management
論文出版年：	2017
畢業學年度：	105
語文別：	英文
論文頁數：	65
中文關鍵詞：	健康檢查、隨機森林演算法、不完整資料、模糊c-medoids分群演算法、部分距離策略
外文關鍵詞：	Physical Examination, Random Forest Algorithm, Incomplete Dataset, Fuzzy C-medoids Clustering, Partial Distance Strategy
相關次數：	點閱：80 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

健康檢查為醫療機構內常見的一項醫療服務，透過一系列的身體檢查，受試者得以了解自己的健康狀況，醫生也可以給予受試者更嚴謹而完整的建議以及治療，進而提升受試者的健康程度。一般而言，一份健檢報告只有益於其所屬的受試者，若能將數個健檢受試者的健檢結果集結成一筆資料集，進而執行分析與比較的話，健檢受益者便不再受限於個體。
因此，本研究欲利用兩種資料挖礦方法來挖掘一組健檢資料內隱含的有效訊息，在第一組方法中，首先利用隨機森林演算法了解對受試者健康狀況影響程度大的危險因子，之後透過統計檢定與視覺化比較，了解不同族群受試者在危險因子異常與罹病風險上的差別；第二組方法則是結合模糊c-medoids演算法與部分距離策略發展一組新演算法於不完整資料之分群，將受試者分類至不同風險等級，以上兩種方法所得之實證結果可協助健檢中心辨識出潛在病患，分屬不同族群的受試者得以有各自的健康管理或治療方式，可藉此提升醫療機構的執行效率來避免不必要的醫療支出。

Physical examination (PE) is a common service in medical organizations. Through a series of test, not only can the examinees recognize their health status, but the doctors can also provide healthy suggestions or treatments in a more conscientious and comprehensive way. If the PE results of an amount of examinees can be collected to be under analysis and comparison together for recognizing the abnormal trends as a whole, the beneficiary of PE is supposed to be not limited to individual examinee.
Therefore, this study aims to apply two data mining approaches to explore potential information in a PE dataset. In the first approach, random forest algorithm is first applied to identify the influence of risk factors on examinees’ health status. Abnormality and morbidity of examinees are next compared based on their workplaces. The second approach separates examinees into groups with different risk levels via incomplete dataset clustering. The experimental results of both approaches can assist medical organizations to identify potential patients. Respective health management for different kinds of examinees can also be suggested to rise the service efficiency and consequently lowering unnecessary medical costs.

Abstract    III
Table of Contents    IV
List of Figures    VI
List of Tables    VII
  Introduction    8
  Literature Review    10
1    Data Mining    10
2    Variable Importance    11
3    Incomplete Dataset    12
4    Clustering    13
5    Summary    16
  Methodology    18
1    Description of the PE Dataset    18
2    Description of Selected Risk Factors and Symptoms    18
3    Approach I: Significant Factor Identification and Visual Comparison    21
4    Approach II: Cluster Analysis for Risk Level Classification    24
  Experiment results    31
1    Approach I    31
1.1    Result    31
1.2    Discussion    37
2    Approach II    39
2.1    Result    39
2.2    Discussion    48
  Conclusion    51
  Reference    53
Appendix Mackay IRB Certification    65
                                

[1] Fayyad, U., Piatetsky-Shapiro, G., and Smyth P. (1996). From data mining to knowledge discovery in databases. AI Magizine, 17(3), 37-54.
[2] Cortes, C. and Vapnik V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
[3] McCulloch, W. S., and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115-133.
[4] Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21-27.
[5] MacQueen J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14), 281-297.
[6] Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by Means of Medoids. North-Holland.
[7] Bezdek J. C. (1981). Pattern recognition with fuzzy objective function algorithms. Plenum.
[8] Agrawal, R., Imielinski, T, and Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2), 207-216.
[9] Huang, W., Nakamori, Y., and Wang, S.-Y. (2005). Forecasting stock market movement direction with support vector machine. Computers & Operations Research, 32(10), 2513-2522.
[10] Kara, Y., Boyacioglu, M. A., and Baykan, Ö. K. (2011). Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the Istanbul Stock Exchange. Expert Systems with Applications, 38(5), 5311-5319.
[11] Paranjape-Voditel, P. and Deshpande, U. (2013). A stock market portfolio recommender system based on association rule mining. Applied Soft Computing, 13(2), 1055-1063.
[12] Sahin, Y., Bulkan, S., and Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Systems with Applications, 40(15), 5916-5923.
[13] Nicola, Š. (2014). Hierarchical clustering of tax burden in the European Union-27. Journal of Advanced Research in Management, 2(10), 92-101.
[14] Chen, J., Zhao, S., and Wang, H. (2011). Risk analysis of flood disaster based on fuzzy clustering method. Energy Procedia, 5, 1915-1919.
[15] Preethi, G. and Santhi, B. (2011). Study on techniques of earthquake prediction. International Journal of Computer Applications, 29(4), 55-58.
[16] Zhang, L. J., Zhu, H. Y., and Sun, X. J. (2014). China’s tropical cyclone disaster risk source analysis based on the gray density clustering. Natural Hazards, 71(2), 1053-1065.
[17] Muniyandi, A. P., Rajeswari, R., and Rajaram, R. (2012). Network anomaly detection by cascading k-means clustering and C4.5 decision tree algorithm. Procedia Engineering, 30, 174-182.
[18] Bae, J.-H., Son, J.-E., and Song, M. (2013). Analysis of Twitter for 2012 South Korea presidential election by text mining techniques. Journal of Intelligence and Information Systems, 19(3), 141-156.
[19] Sarno, R., Dewandono, R. D., Ahmad, T., Naufal, M. F., and Sinaga, F. (2015). Hybrid association rule learning and process mining for fraud detection. International Journal of Computer Science, 42(2), 59-72.
[20] Vijayarani, S. and Dhayanand, S. (2015). Kidney disease prediction using SVM and ANN algorithm. International Journal of Computing and Business Research, 6(2).
[21] Krishnaiah, V., Narsimha, G., and Chandra, N. S. (2013). Diagnosis of lung cancer prediction system using data mining classification techniques. International Journal of Computer Science and Information Technologies, 4(1), 39-45.
[22] Yadav, C., Lade, S., and Suman, M. K. (2014). Predictive analysis for the diagnosis of coronary artery disease using association rule mining. International Journal of Computer Applications, 87(4).
[23] Guyon, I. and Elisseeff, A. (2003). An introduction of variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
[24] Wei, P., Lu, Z., and Song, J. (2015). Variable importance analysis: A comprehensive review. Reliability Engineering and System Safety, 142, 399-432.
[25] Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
[26] Jiang, P., Wu, H., Wang, W., Ma, W., Sun, X., and Lu, Z. (2007). MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Research, 35, 339-344.
[27] Kandaswamy, K. K., Chou, K.-C., Martinetz, T. Moller, S. Suganthan, P. N., Sridharan, S., and Pugalenthi, G. (2011). AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology, 270(1), 56-62.
[28] Rodriguez-Galiano, V. F., Ghimire, B., Rogan, J., Chica-Olmo, M., and Rigol-Sanchez, J. P. (2012). An assessment of the effectiveness of a random forest classifier for land-cover classification. Journal of Photogrammetry and Remote Sensing, 67, 93-104.
[29] Gantayat, S. S., Misra, A. and Panda, B. S. (2013). A study of incomplete data – a review. Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications, 247, 401-408.
[30] Rubin D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley & Sons.
[31] Dempster A. P., Laird N. M., and Rubin D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society: Series B (Statistical Methodology), 39(1), 1-38.
[32] Dixon, J. K. (1979). Pattern recognition with partly missing data. IEEE Transactions on Systems, Man, and Cybernetics, 9(10), 617-621.
[33] Hathaway, R. J. & Bezdek, J. C. (2001). Fuzzy c-means clustering of incomplete data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(5), 735-744.
[34] Aktepe, A. & Ersoz, S. (2012). A quantitative performance evaluation model based on a job satisfaction-performance matrix and application in a manufacturing company. International Journal of Industrial Engineering, 19(6), 264-277.
[35] Al-Mohair, H. K., Saleh, J. M. and Suandi, S. A. (2015). Hybrid human skin detection using neural network and k-means clustering technique. Applied Soft Computing, 33, 337-347.
[36] Carvalho, M. J., Melo-Goncalves, P., Teixeira, J. C. and Rocha, A. (2016). Regionalization of Europe based on k-means cluster analysis of the climate change of temperatures and precipitation. Physics and Chemistry of the Earth, Part A/B/C, 94, 22-28.
[37] Velmurugan, T. & Santhanam, T. (2010). Computational complexity between k-means and k-medoids clustering algorithms for normal and uniform distributions of data points. Journal of Computer Science, 6(3), 363-368.
[38] Yerpude, A. and Dubey S. (2012). Colour image segmentation using k-medoids clustering. International Journal of Computer Technology and Applications, 3(1), 152-154.
[39] Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353.
[40] Ruspini, E. H. (1969). A new approach to clustering. Information and Control, 15(1), 22-32.
[41] Rustempasic, I. and Can, M. (2013). Diagnosis of Parkinson’s disease using fuzzy c-means clustering and pattern recognition. Southeast Europe Journal of Soft Computing, 2(1), 42-49.
[42] Ozkan, C., Keskin, G. A., and Omurca, S. I. (2014). A variant perspective to performance appraisal system: fuzzy c-means algorithm. International Journal of Industrial Engineering, 21(3), 168-178.
[43] Zhao, Q. (2012). Cluster validity in clustering methods. Publications of the University of Eastern Finland, Dissertations in Forestry and Natural Sciences, 77, 1-87.
[44] Wang, W. & Zhang, Y. (2007). On fuzzy cluster validity indices. Fuzzy Sets and System, 158, 2095-2117.
[45] Colditz, G. A., Willett, W. C., Rotnitzky, A. and Manson, J. E. (1995). Weight gain as a risk factor for clinical diabetes mellitus in women. Annals of Internal Medicine, 122(7), 481-486.
[46] Slusarska, B., Krzyszycha, R., Zarzycka, D., Kulik, T. B., Dobrowolska, B. and Brzozowska, A. (2012). The importance of BMI in early prevention of cardiovascular risk in young adult Poles. Journal of Pre-Clinical and Clinical Research, 6(1), 35-41.
[47] Park, S. L., Goodman, M. T., Zhang, Z.-F., Kolonel, L. N., Henderson, B. E. and Setiawan, V. W. (2010). Body size, adult BMI gain and endometrial cancer risk: the multiethnic cohort. International Journal of Cancer, 126(2), 490-499.
[48] Hadaegh, F., Harati, H., Ghanbarian, A., and Azizi, F. (2006). Association of total cholesterol versus other serum lipid parameters with the short-term prediction of cardiovascular outcomes: Tehran Lipid and Glucose Study. European Journal of Cardiovascular Prevention & Rehabilitation, 13(4), 571-577.
[49] Prospective Studies Collaboration. (2007). Blood cholesterol and vascular mortality by age, sex, and blood pressure: a meta-analysis of individual data from 61 prospective studies with 55,000 vascular deaths. Lancet, 370(9602), 1829-1839.
[50] Nagasawa, S-Y., Okamura, T., Iso, H., Tamakoshi, A., Yamada, M., Watanabe, M., Murakami, Y., Miura, K., Ueshima, H. and the Evidence for Cardiovascular Prevention from Observational Cohorts in Japan Research Group. (2012). Relation between serum total cholesterol level and cardiovascular disease stratified by sex and age group: a pooled analysis of 65,594 individuals from 10 cohort studies in Japan. Journal of the American Heart Association, 1(e001974), 1-10.
[51] Potischman, N., McCulloch, C., Byers, T., Houghton, L., Nemoto, T., Graham, S. and Campbell, T. C. (1991). Associations between breast cancer and plasma triglycerides and cholesterol. Nutrition and Cancer, 15(3-4), 205-215.
[52] Ulmer, H., Borena, W., Rapp, K., Klenk, J., Strasak, A., Diem, G., Concin, H. and Nagel, G. (2009). Serum triglyceride concentrations and cancer risk in a large cohort study in Austria. British Journal of Cancer, 101(7), 1202-1206.
[53] Muti, P., Quattrin, T., Grant, B., Krogh, V., Micheli, A., Schunemann, H. J., Ram, M., Freudenheim, J. L., Sieri, S., Trevisan, M. and Berrino, F. (2002). Fasting glucose is a risk factor for breast cancer – A prospective study. Cancer Epidemiology Biomarkers & Prevention, 11(11), 1361-1368.
[54] Haseen, S. D., Khanam, A., Sultan, N., Idrees, F., Akhtar, N. and Imtiaz, F. (2015). Elevated fasting blood glucose is associated with increased risk of breast cancer: Outcome of case-control study conducted in Karachi, Pakistan. Asian Pacific Journal of Cancer Prevention, 16(2), 675-678.
[55] Rapp, K., Schroeder, J., Klenk, J., Ulmer, H., Concin, H., Diem, G., Oberaigner, W. and Weiland, S. K. (2006). Fasting blood glucose and cancer risk in a cohort of more than 140,000 adults in Austria. Diabetologia, 49, 945-952.
[56] Shin, H-Y., Jung, K. J., Linton, J. A. and Jee, S. H. (2014). Association between fasting serum glucose levels and incidence of colorectal cancer in Korean men: The Korean Cancer Prevention Study-II. Metabolism, 63(10), 1250-1256.
[57] Qiu, C., Hu, G., Kivipelto, M., Laatikainen, T., Antikainen, R., Fratiglioni, L., Jousilahti, P., and Tuomilehto, J. (2011). Association of blood pressure and hypertension with the risk of Parkinson disease. Hypertension, 57(6), 1094-1100.
[58] Stocks, T., Van Hemelrijck, M., Manjer, J., Bjørge, T., Ulmer, H., Hallmans, G., Lindkvist, B., Selmer, R., Nagel, G., Tretil, S., Concin, H., Engeland, A., Jonsson, H., and Stattin, P. (2012). Blood pressure and risk of cancer incidence and mortality in the metabolic syndrome and cancer project. Hypertension, 59(4), 802-810.
[59] Grundy, S. M., Cleeman, J. I., Daniels, S. R., Donato, K. A., Eckel, R. H., Franklin, B. A., Gordon, D. J., Krauss, R. M., Savage, P. J., Smith, S. C., Spertus, J. A., and Costa, F. (2005). Diagnosis and management of the metabolic syndrome. Circulation, 112(17), 2735-2752.
[60] Alberti, G., Zimmet, P., and Shaw, J. (2006). The IDF consensus worldwide definition of the metabolic syndrome. Brussels: International Diabetes Federation, 1-23.
[61] Isomaa, B., Almgren, P., Tuomi, T., Forsen, B., Lahti, K., Nissen, M., Taskinen, M. R., and Groop, L. (2001). Cardiovascular morbidity and mortality associated with the metabolic syndrome. Diabetes, 24(4), 683-689.
[62] Lorenzo, C., Okoloise, M., Williams, K., Stern, M. P., and Haffner, S. M. (2003). The metabolic syndrome as predictor of type 2 diabetes. Diabetes Care, 26(11), 3153-3159.
[63] Chen, J., Muntner, P., Hamm, L. L., Jones, D. W., Batuman, V., Fonseca, V., Whelton, P. K., and He, J. (2004). The metabolic syndrome and chronic kidney disease in U.S. adults. Annals of Internal Medicine, 140(3), 167-174.
[64] Pasanisi, P., Berrino, F., De Petris, M., Venturelli, E., Mastroianni, A., and Panico, S. (2006). Metabolic syndrome as a prognostic factor for breast cancer recurrences. International Journal of Cancer, 119(1), 236-238.
[65] National Institutes of Health. (2001). National cholesterol education program: ATP III guidelines at-a-glance quick desk reference. NIH Publication.
[66] Chung, C. P., Oeser, A., Avalos, I., Gebretsadik, T., Shintani, A., Raggi, P., Sokka, T., Pincus, T., Stein, C. M. (2006). Utility of the Framingham risk score to predict the presence of coronary atherosclerosis in patients with rheumatoid arthritis. Arthritis Research & Therapy, 8(6), R186.
[67] Treeprasertsuk, S., Leverage, S., Adams, L. A., Lindor, K. D., St Sauver, J., and Angulo, P. (2012) The Framingham risk score and heart disease in nonalcoholic fatty liver disease. Liver International, 32(6), 945-950.
[68] Hamaguchi, M., Kojima, T., Takeda, N., Nakagawa, T., Taniguchi, H., Fujii, K., Omatsu, T., Nakajima, T., Sarui, H., Shimazaki, M., Kato, T., Okuda, J., and Ida, K. (2005). The metabolic syndrome as a predictor of nonalcoholic fatty liver disease. Annals of Internal Medicine, 143(10), 722-728.
[69] Targher, G. and Arcaro, G. (2007). Non-alcoholic fatty liver disease and increased risk of cardiovascular disease. Atherosclerosis, 191(2), 235-240.
[70] Pearson, K. (1992). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Breakthroughs in Statistics. Springer New York, 11-28.
[71] Dave, R. N. (1996). Validating fuzzy partition obtained through c-shells clustering. Pattern Recognition Letters, 17, 613-623.
[72] Chen, M. Y. and Linkens, D. A. (2004). Rule-base self-generation and simplification for data-driven fuzzy models. Fuzzy Sets and System, 142, 243-265.
[73] Pakhira, M. K., Bandyopadhyay, S. and Maulik, U. (2004) Validity index for crisp and fuzzy clusters. Pattern Recognition, 37, 481-501.
[74] Trauwaet, E. (1988). On the meaning of Dunn’s partition coefficient for fuzzy clusters. Fuzzy Sets and Systems, 25, 217-242.

簡易檢索 / 詳目顯示

相關論文