研究生: |
陳奕安 Chen, Yi An |
---|---|
論文名稱: |
基於健保資料預測中風之研究並以Hadoop作為一種快速擷取特徵工具 Predicting Stroke based on Health Insurance Records by Using Hadoop as a Fast Feature Extraction Tool |
指導教授: |
李祈均
Lee, Chi Chun |
口試委員: |
藍祚鴻
Lan, Tsuo Hung 林敬恒 Lin, Ching Heng 曹昱 Tsao, Yu |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2016 |
畢業學年度: | 104 |
語文別: | 中文 |
論文頁數: | 104 |
中文關鍵詞: | Hadoop 、GBDT 、全民健康保險研究資料庫 、醫療衛生資料分析 |
外文關鍵詞: | Hadoop, GBDT, National Health Insurance Research Database, analysis of healthcare data |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著電腦運行速度的提升、儲存技術的進步以及通訊技術的發展等原因,使人類可以使用的數據量大為增加,因而使得大數據的研究興起。在大數據研究興起的同時,同樣也造就了資料探勘領域的發展,讓人類得以從大數據中提取到有用的資訊。若能將大數據研究用於醫學領域,將會是可以達成改善照護、拯救生命以及降低開支等裨益人類甚多的研究。但隨著資料量不斷日益增長,使用一般單台機器循序式資料處理工具會耗費掉大量的時間。在耗費大量時間的同時,又會衍生其他因時間過慢產生的問題。若可使用分散式的平行運算框架,讓多台機器一起運算資料,將可以大幅減少運算時間。過去諸多研究表明,服用常被開立於治療憂鬱症或相關精神健康症狀的SSRI藥物,會增加中風的風險。本論文研究基於此些研究結果,使用全民健康保險研究資料庫進行醫療衛生資料分析,對曾經服用過SSRI相關藥物的人使用機器學習建模預測中風,其中資料處理使用分散式運算工具Hadoop加速處理速度。對比本實驗室之前的方法,同一組資料在預處理提升了約35倍的速度,擷取特徵提升了約585倍的速度,提升效果顯著。處理完的資料使用GBDT為分類器進行資料分析,因處理速度大為提升的情況下,得以擷取更多的特徵。藉由檢驗前20位最重要的特徵,最終結果顯現了我們的模型對比於本實驗室之前的方法,可以呈現更多的危險因子,此結果或為有價值的臨床資訊。
The amount of data that human beings can use increase numerously and lead to the rising of research for big data with many results like the enhancement of operating speed in computers, the advance of storage technology, the development of communications technology, etc. With the rising of research for big data, it also results in the development of data mining, making human beings get valuable information from big data. If the research for big data can be applied to medical field, the research that would achieve to improve care, save lives and lower costs benefits human beings a lot. However, using sequential data processing tool of one general machine costs numerous time with the increasingly growing amount of data. It leads to other problems for too slow time with numerous time costs. If a framework for distributed parallel computing can be used to process data with lots of machines, it will reduce computing time sharply. In the past, much research points out that intake of SSRIs which is commonly prescribed for treatment of depression or related mental health conditions has increased risk of stroke. Based on these research results, the thesis research uses National Health Insurance Research Database to analyze the healthcare data and builds machine learning model for predicting stroke with people of SSRI intake in the past. Using Hadoop, a tool of distributed computing, speeds up data processing. Compared to the previous work in our lab, the same group enhances approximately 35 times in preprocess and approximately 585 times in extracting features for speed. The effect of enhancement is obvious. The processed data use GBDT as classifier for analysis to build machine learning model. It is able to extract more features with the obvious enhancement of processing speed. By examining the top 20 most important features, the final result demonstrates that our model show more risk factors compared to the previous work in our lab, and it may possess valuable clinical information.
[1] K. Cukier and V. Mayer-Schoenberger, "Rise of Big Data: How it’s Changing the Way We Think about the World," Foreign Aff., vol. 92, p. 28, 2013.
[2] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, "Knowledge discovery in databases: An overview," AI Magazine, vol. 13, no. 3, p. 57, 1992.
[3] D. J. Hand, H. Mannila, and P. Smyth, Principles of data mining. MIT Press, 2001.
[4] F. D. Bushman et al., "Bringing it all together: Big data and HIV research," AIDS (London, England), vol. 27, no. 5, p. 835, 2013.
[5] V. Swarup and D. H. Geschwind, "Alzheimer’s disease: From big data to mechanism," Nature, vol. 500, no. 7460, pp. 34–35, 2013.
[6] J. Ginsberg et al., "Detecting influenza epidemics using search engine query data," Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.
[7] T. B. Murdoch and A. S. Detsky, "The inevitable application of big data to health care," JAMA, vol. 309, no. 13, pp. 1351–1352, 2013.
[8] P. Groves, B. Kayyali, D. Knott, and S. Van Kuiken, "The ‘big data’revolution in healthcare," McKinsey Quarterly, vol. 2, 2013.
[9] W. Raghupathi and V. Raghupathi, "Big data analytics in healthcare: Promise and potential," Health Information Science and Systems, vol. 2, no. 1, p. 3, 2014.
[10] IBM: Data Driven Healthcare Organizations Use Big Data Analytics for Big Gains; 2013. http://www03.ibm.com/industries/ca/en/healthcare/documents/Data_driven_healthcare_organizations_use_big_data_analytics_for_big_gains.pdf.
[11] M. Cottle et al., "Transforming Health Care Through Big Data Strategies for leveraging big data in the health care industry," Institute for Health Technology Transformation, http://ihealthtran. com/big-data-in-healthcare, 2013.
[12] National Health Insurance Administration, Ministry of Health and Welfare, Taiwan, R.O.C. (2014). National Health Insurance Annual Report 2014-2015.
[13] G. Trifirò, J. Dieleman, E. F. Sen, G. Gambassi, and M. C. J. M. Sturkenboom, "Risk of Ischemic stroke associated with antidepressant drug use in elderly persons," Journal of Clinical Psychopharmacology, vol. 30, no. 3, pp. 252–258, 2010.
[14] J. W. Smoller et al., "Antidepressant use and risk of incident cardiovascular morbidity and mortality among postmenopausal women in the Women’s Health Initiative study," Archives of Internal Medicine, vol. 169, no. 22, pp. 2128–2139, 2009.
[15] C.-C. Hung, C.-H. Lin, T.-H. Lan, and C.-H. Chan, "The association of selective serotonin reuptake inhibitors use and stroke in geriatric population," The American Journal of Geriatric Psychiatry, vol. 21, no. 8, pp. 811–815, 2013.
[16] C.-S. Wu, S.-C. Wang, Y.-C. Cheng, and S. S.-F. Gau, "Association of cerebrovascular events with antidepressant use: A case-crossover study," American Journal of Psychiatry, vol. 168, no. 5, pp. 511–521, 2011.
[17] D. Shin, Y. H. Oh, C.-S. Eom, and S. M. Park, "Use of selective serotonin reuptake inhibitors and risk of stroke: A systematic review and meta-analysis," Journal of Neurology, vol. 261, no. 4, pp. 686–695, 2014.
[18] F. Angeleri, V. A. Angeleri, N. Foschi, S. Giaquinto, and G. Nolfe, "The influence of depression, social activity, and family stress on functional outcome after stroke," Stroke, vol. 24, no. 10, pp. 1478–1483, 1993.
[19] A. Patil, D. Huard, and C. J. Fonnesbeck, "PyMC: Bayesian stochastic modelling in python," Journal of Statistical Software, vol. 35, no. 4, p. 1, 2010.
[20] W. McKinney, Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc., 2012.
[21] Frost & Sullivan: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations. http://www.emc.com/collateral/analyst-reports/frost-sullivan-reducing-information-technologycomplexities-ar.pdf.
[22] J. H. Friedman, "Greedy function approximation: A gradient boosting machine," Annals of Statistics, pp. 1189–1232, 2001.
[23] P. Viola and M. J. Jones, "Robust real-time face detection," International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
[24] B. Schuller et al., "Speaker independent speech emotion recognition by ensemble classification," in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, 2005, pp. 864–867.
[25] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiers-A survey," IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), vol. 35, no. 4, pp. 476–487, 2005
[26] Z. Zheng, K. Chen, G. Sun, and H. Zha, "A regression framework for learning ranking functions using relative relevance judgments," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 287–294.
[27] D. Borthakur, "The hadoop distributed file system: Architecture and design," Hadoop Project Website, vol. 11, no. 2007, p. 21, 2007.
[28] R. C. Taylor, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics," BMC Bioinformatics, vol. 11, no. Suppl 12, p. S1, 2010.
[29] F. Pedregosa et al., "Scikit-learn: Machine learning in Python," The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[30] J. Burn et al., "Long-term risk of recurrent stroke after a first-ever stroke. The Oxfordshire Community Stroke Project [published erratum appears in stroke 1994 Sep;25(9):1887]," Stroke, vol. 25, no. 2, pp. 333–337, 1994.
[31] P. A. Wolf, R. B. D’Agostino, A. J. Belanger, and W. B. Kannel, "Probability of stroke: A risk profile from the Framingham study," Stroke, vol. 22, no. 3, pp. 312–318, 1991.
[32] T. B. Wyller, "Stroke and gender," The journal of gender-specific medicine: JGSM: the official journal of the Partnership for Women’s Health at Columbia, vol. 2, no. 3, pp. 41–45, 1998.
[33] S. J. Kittner et al., "Pregnancy and the risk of stroke," New England Journal of Medicine, vol. 335, no. 11, pp. 768–774, 1996.
[34] C. Meune, E. Touzé, L. Trinquart, and Y. Allanore, "High risk of clinical cardiovascular events in rheumatoid arthritis: Levels of associations of myocardial infarction and stroke through a systematic review and meta-analysis," Archives of Cardiovascular Diseases, vol. 103, no. 4, pp. 253–261, 2010.
[35] D. H. Solomon et al., "Patterns of cardiovascular risk in rheumatoid arthritis," Annals of the Rheumatic Diseases, vol. 65, no. 12, pp. 1608–1612, 2006.
[36] E. F. Wijdicks, J. R. Fulgham, and K. P. Batts, "Gastrointestinal bleeding in stroke," Stroke, vol. 25, no. 11, pp. 2146–2148, 1994.
[37] R. J. Davenport, M. S. Dennis, and C. P. Warlow, "Gastrointestinal hemorrhage after acute stroke," Stroke, vol. 27, no. 3, pp. 421–424, 1996.
[38] G. S. Sfyroeras, N. Roussas, V. G. Saleptsis, C. Argyriou, and A. D. Giannoukas, "Association between periodontal disease and stroke," Journal of Vascular Surgery, vol. 55, no. 4, pp. 1178–1184, 2012.
[39] S.-J. Janket, A. E. Baird, S.-K. Chuang, and J. A. Jones, "Meta-analysis of periodontal disease and risk of coronary heart disease and stroke," Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology, vol. 95, no. 5, pp. 559–569, 2003.
[40] E. Agostoni, L. Fumagalli, P. Santoro, and C. Ferrarese, "Migraine and stroke," Neurological Sciences, vol. 25, no. S3, pp. s123–s125, 2004.
[41] S. Sacco, R. Ornello, P. Ripa, F. Pistoia, and A. Carolei, "Migraine and hemorrhagic stroke: A Meta-analysis," Stroke, vol. 44, no. 11, pp. 3032–3038, 2013.
[42] E. Barrett-Connor and K.-T. Khaw, "Diabetes mellitus: An independent risk factor for stroke?," American Journal of Epidemiology, vol. 128, no. 1, pp. 116–123, 1988.
[43] V. Mohsenin, "Sleep-related breathing disorders and risk of stroke editorial comment," Stroke, vol. 32, no. 6, pp. 1271–1278, 2001.
[44] E. Shahar et al., "Sleep-disordered breathing and cardiovascular disease: Cross-sectional results of the Sleep Heart Health Study," American Journal of Respiratory and Critical Care Medicine, vol. 163, no. 1, pp. 19–25, 2001.
[45] X. Gong and N. J. Sucher, "Stroke therapy in traditional Chinese medicine (TCM): Prospects for drug discovery and development," Trends in Pharmacological Sciences, vol. 20, no. 5, pp. 191–196, 1999.
[46] H. Kim, "Neuroprotective herbs for stroke therapy in traditional eastern medicine," Neurological Research, vol. 27, no. 3, pp. 287–301, 2005.
[47] P. Langhorne et al., "Medical complications after stroke: A multicenter study," Stroke, vol. 31, no. 6, pp. 1223–1229, 2000.
[48] T. S. Olsen, "Post-stroke epilepsy," Current Atherosclerosis Reports, vol. 3, no. 4, pp. 340–344, 2001.
[49] M. M. Najafabadi et al., "Deep learning applications and challenges in big data analytics," Journal of Big Data, vol. 2, no. 1, pp. 1–21, 2015.
[50] A. Coates et al., "Deep learning with COTS HPC systems," Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1337–1345.
[51] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[52] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[53] D. Silver et al., "Mastering the game of go with deep neural networks and tree search," Nature, vol. 529, no. 7587, pp. 484–489, 2016.