研究生: |
劉惟昇 Liu, Wei-Sheng |
---|---|
論文名稱: |
應用馬氏田口系統與機器學習於肺部結節分類之比較研究 Applying MTS and Machine Learning for the Classification of Lung Nodules: A Comparative Study |
指導教授: |
蘇朝墩
Su, Chao-Ton |
口試委員: |
蕭宇翔
許俊欽 |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 63 |
中文關鍵詞: | 肺結節 、數位健康 、分類預測 、特徵篩選 、SMOTETomek 、MTS 、裝袋法 、隨機森林 、XGboost 、LightGBM |
外文關鍵詞: | lung nodules, digital health, classification, feature selection, SMOTETomek, MTS, bagging, random forest, XGboost, LightGBM |
相關次數: | 點閱:68 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,肺癌為最常見的癌症之一,位居全球癌症死因首位,肺癌初期症狀不明顯,故相當難以辨認與診斷,目前常見以低劑量電腦斷層進行檢查,能精確檢測肺結節的位置及數量,但也容易發生過度診斷的問題,使民眾對檢查結果過於緊張與焦慮。現今醫療發展著重於數位健康與智慧醫療,本研究結合肺部醫療數據與資料分析的應用,期望以資料分析的方式,早期診斷病症,有效對肺部結節進行分類預測分析,提升肺癌檢查之醫療成效。
本研究應用馬氏田口系統與機器學習模型,對肺部結節之低劑量電腦斷層檢查前的問卷資料,進行分類預測分析。在機器學習方面,結合Python套件預先篩選模型,並以SMOTETomek演算法對不平衡資料重抽樣處理。應用MTS、裝袋法、隨機森林、XGboost、LightGBM五種模型進行比較分析,經特徵篩選後建立縮減模型,並彙整出重要特徵屬性。比較五種分類預測模型之縮減模型績效,結果顯示在資料不平衡時MTS的穩健性與少量類別的分類表現,優於其他四種機器學習模型。此外,縮減模型維持不錯的績效表現,顯示篩選特徵後仍能有效對資料分類預測,將原先14項特徵屬性,篩選出6項肺部結節風險因子,提供建議給實際具肺結節風險因子的民眾,減少因低劑量電腦斷層檢查導致過度診斷的問題,以資料科學增進肺癌診斷之醫療品質。
In recent years, lung cancer is one of the most common cancers and ranks first in the cause of cancer death in the world. The early symptoms of lung cancer are not obvious, so it is quite difficult to identify and diagnose. Currently, low-dose computed tomography is commonly used to detect the location and number of lung nodules. However, it is also prone to overdiagnosis, which makes people too nervous and anxious about the test results. Today's medical development focuses on digital health and smart medical care. This study combines the application of lung medical data and data analysis. It is expected to use data analysis to diagnose diseases early, effectively predict and classify lung nodules, and improve the medical treatment of lung cancer.
In this study, Mahalanobis-Taguchi System and machine learning were used to classify and predict the questionnaire data before low-dose computed tomography examination of lung nodules. In terms of machine learning, the Python package is used to select the model of machine learning, and SMOTETomek is used to resample the imbalanced data. MTS, bagging, random forest, XGboost, and LightGBM were used for comparative analysis, reduced models were established after feature selection, and important features were collected. Comparing the reduced model performance of the five classification prediction models, the results show that the robustness of MTS and the classification performance of a small number of categories are better than the other four machine learning models on imbalanced data. In addition, the reduced model maintains a good performance, showing that after selecting the features, it can still classify and predict the data effectively. We select six lung nodule risk factors from fourteen original attributes, which are provided to the people who have lung nodule risk factors, reduce the problem of over-diagnosis caused by low-dose computed tomography, and improve the medical quality of lung cancer diagnosis with data science.
[1]Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019, July). Optuna: A next- generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (2623-2631).
[2]Asomaning, K., Miller, D. P., Liu, G., Wain, J. C., Lynch, T. J., Su, L., & Christiani, D. C.(2008). Second hand smoke, age of exposure and lung cancer risk. Lung cancer, 61(1), 13-20.
[3]Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003, December). Balancing training data for automated annotation of keywords: a case study. In WOB ( 10-18).
[4]Becker, N., Motsch, E., Trotter, A., Heussel, C. P., Dienemann, H., Schnabel, P. A., ... & Delorme, S. (2020). Lung cancer mortality reduction by LDCT screening - Results from the randomized German LUSI trial. International Journal of Cancer, 146(6), 1503-1513.
[5]Boroczky, L., Zhao, L., & Lee, K. P. (2006). Feature subset selection for improving the performance of false positive reduction in lung nodule CAD. IEEE Transactions on Information Technology in Biomedicine, 10(3), 504-511.
[6]Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123-140.
[7]Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
[8]Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
[9]Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
[10]Chen, X., Zhang, L. W., Huang, J. J., Song, F. J., Zhang, L. P., Qian, Z. M., ... & Tang, N. J. (2016). Long-term exposure to urban air pollution and lung cancer mortality: A 12-year cohort study in Northern China. Science of the Total Environment, 571, 855-861.
[11]Fatima, M., & Pasha, M. (2017). Survey of machine learning algorithms for disease diagnostic. Journal of Intelligent Learning Systems and Applications, 9(01), 1.
[12]Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-37.
[13]Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11), 27-34.
[14]Fayyad, U., & Stolorz, P. (1997). Data mining and KDD: Promise and challenges. Future generation computer systems, 13(2-3), 99-115.
[15]Fayyad, U. (1997, August). Data mining and knowledge discovery in databases: implications for scientific databases. In Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No. 97TB100150) ( 2-11). IEEE.
[16]Ferlay J, Ervik M, Lam F, Colombet M, Mery L, Piñeros M, et al. Global Cancer Observatory: Cancer Today. Lyon: International Agency for Research on Cancer; 2020
[17]Fu C, Liu Z, Li S, Jiang L. (2016) A meta‐analysis: is low‐dose computed tomography a superior method for risky lung cancers screening population? The Clinical Respiratory Journal, 10(3), 333-341.
[18]Gavelli, G., & Giampalma, E. (2000). Sensitivity and specificity of chest x‐ray screening for lung cancer. Cancer, 89(S11), 2453-2456.
[19]Henschke, C. I., Yip, R., Yankelevitz, D. F., & Smith, J. P. (2013). Definition of a positive test result in computed tomography screening for lung cancer: a cohort study. Annals of internal medicine, 158(4), 246-252.
[20]Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.
[21]Kenfield, S. A., Wei, E. K., Stampfer, M. J., Rosner, B. A., & Colditz, G. A. (2008). Comparison of aspects of smoking among the four histological types of lung cancer. Tobacco control, 17(3), 198-204.
[22]Ko, Y. C., Lee, C. H., Chen, M. J., Huang, C. C., Chang, W. Y., Lin, H. J., ... & Chang, P.Y. (1997). Risk factors for primary lung cancer among non-smoking women in Taiwan. International journal of epidemiology, 26(1), 24-31.
[23]Kostkova, P. (2015). Grand challenges in digital health. Frontiers in public health, 3, 134.
[24]Kovalchik, S. A., Tammemagi, M., Berg, C. D., Caporaso, N. E., Riley, T. L., Korch, M.& Katki, H. A. (2013). Targeting of low-dose CT screening according to the risk of lung-cancer death. New England Journal of Medicine, 369(3), 245-254.
[25]Malhotra, J., Malvezzi, M., Negri, E., La Vecchia, C., & Boffetta, P. (2016). Risk factors for lung cancer worldwide. European Respiratory Journal, 48(3), 889-902.
[26]McLachlan, G. J. (1999). Mahalanobis distance. Resonance, 4(6), 20-26.
[27]Oudkerk, M., Liu, S., Heuvelmans, M. A., Walter, J. E., & Field, J. K. (2021). Lung cancer LDCT screening and mortality reduction—evidence, pitfalls and future perspectives. Nature reviews Clinical oncology, 18(3), 135-151.
[28]Peto, R., Darby, S., Deo, H., Silcocks, P., Whitley, E., & Doll, R. (2000). Smoking, smoking cessation, and lung cancer in the UK since 1950: combination of national statistics with two case-control studies. Bmj, 321(7257), 323-329.
[29]Sajda, P. (2006). Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng., 8, 537-565.
[30]Stapley, S., Sharp, D., & Hamilton, W. (2006). Negative chest X-rays in primary care patients with lung cancer. British Journal of General Practice, 56(529), 570-573.
[31]Su, C. T., & Hsiao, Y. H. (2007). An evaluation of the robustness of MTS for imbalanced data. Knowledge and Data Engineering, IEEE Transactions on Knowledge and Data Engineering, 19(10), 1321-1332.
[32]Su, C. T. (2013). Quality engineering: off-line methods and applications. CRC press.
[33]Taguchi, G. and Jugulum, R., (2002). “The Mahalanobis-Taguchi strategy”
[34]Taguchi, G., & Jugulum, R. (2002). The Mahalanobis-Taguchi strategy: A pattern technology system. John Wiley & Sons.
[35]Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10), 6567-6572.
[36]Toumazis, I., Bastani, M., Han, S. S., & Plevritis, S. K. (2020). Risk-Based lung cancer screening: A systematic review. Lung Cancer, 147, 154-186.
[37]Vinikoor-Imler, L. C., Davis, J. A., & Luben, T. J. (2011). An ecologic analysis of county-level PM2.5 concentrations and lung cancer incidence and mortality. International journal of environmental research and public health, 8(6), 1865-1871.
[38]Woodall, W. H., Koudelik, R., Tsui, K. L., Kim, S. B., Stoumbos, Z. G., & Carvounis, C. P. (2003). A review and analysis of the Mahalanobis—Taguchi system. Technometrics, 45(1), 1-15.
[39]衛生福利部:民國109年國人死因統計結果
[40]陳晉興,梁惠雯 (2022). 肺癌的預防與治療:全面贏戰臺灣新國病
[41]李航 (2023). 理論到實作都一清二楚:機器學習原理深究
[42]楊維忠,張甜 (2023). Python機器學習原理與算法實踐
[43]何龍 (2023) 深入理解XGBoost高效機器學習算法與進階