研究生: |
張庭瑋 Chang, Ting-Wei |
---|---|
論文名稱: |
一個用於計算一體化醫學語言系統中概念識別碼相關性的向量模型 A Vector Model for Relatedness Computation of UMLS CUI |
指導教授: |
林華君
Lin, Hwa-Chun |
口試委員: |
陳俊良
蔡榮宗 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 49 |
中文關鍵詞: | 生物醫學資訊 、一體化醫學語言系統 、生物醫學概念 、詞向量 、相關性 |
外文關鍵詞: | bioinformatics, unified medical language system, biomedical concept, word embedding, relatedness |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
UMLS CUI之間的相關性,能夠用於許多生物醫學領域的自然語言處理(Natural Language Processing, NLP)中。在目前已發表的相關研究中,UMLS CUI相關性的計算方法主要分為兩種類型:基於路徑的方法(Path-Based Approach)以及基於文本的方法(Corpus-Driven Approach)。不管是哪一類型的方法,都無法計算任兩個UMLS CUI之間的相關性,而人類撰寫的生物醫學文本(例如:病歷、生物醫學文獻等等)中,通常會包含許多UMLS CUI代表的生物醫學概念,如果無法計算出人類撰寫的生物醫學文本中,所有UMLS CUI之間的相關性,可能會對後續的自然語言處理造成負面的影響。為了解決此問題,本篇論文提出了一個CUI向量模型,包含了UMLS中,所有非過時(non-obsolete)CUI的向量。我們使用三個資料集來檢測我們的CUI向量模型,在計算CUI相關性時的表現,其中,最可靠的資料集(MiniMayoSRS)包含了醫師以及編碼人員對CUI相關性的判斷。利用我們最佳的CUI向量模型計算的相關性,與醫師的判斷之間的Spearman Correlation為0.759;與編碼人員的判斷之間的Spearman Correlation為0.842。最後,我們利用Correlation以及自創的方法,來比較我們的CUI向量模型以及其他研究團隊提出的向量模型。比較的結果顯示,本篇論文提出的CUI向量模型不僅達到了相當高的CUI覆蓋率(coverage),同時也有不錯的表現。
The relatedness between UMLS (Unified Medical Language System) CUI (Concept Unique Identifier) can be used in multiple NLP (Natural Language Processing) tasks. The reported research in this field can be classified into 2 types: Path-Based Approach and Corpus-Driven Approach. There is a common disadvantage in both 2 types that they are not available to compute the relatedness of all possible pairs of UMLS CUI. The human-written biomedical text commonly includes biomedical concepts represented by multiple UMLS CUI, and it may cause an undesirable effect for the following NLP tasks if the relatedness of all possible pairs of UMLS CUI in human-written biomedical text can’t be computed. To solve the problem, this paper presents a vector model of CUI which includes all non-obsolete CUI in UMLS. We use 3 datasets to evaluate the performance of relatedness computation, the most reliable one (MiniMayoSRS) includes judgements made by physicians and biomedical coders. The Spearman correlation between the relatedness computed by our best model and physician’s judgement is 0.759, and it is 0.842 for biomedical coder’s judgement’s. We also compare the performance of our models and others using correlation and a new evaluation by us. The result shows that our best model not only achieves a high CUI coverage, but also maintains a decent performance。
[1] Humphreys, Betsy L., and D. A. Lindberg. "The UMLS project: making the conceptual connection between users and the information they need." Bulletin of the Medical Library Association 81.2 (1993): 170.
[2] Bodenreider, Olivier. "The unified medical language system (UMLS): integrating biomedical terminology." Nucleic acids research 32.suppl_1 (2004): D267-D270.
[3] Hersh, William. Information retrieval: a health and biomedical perspective. Springer Science & Business Media, 2008.
[4] Aronson, Alan R. "Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program." Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001.
[5] Savova, Guergana K., et al. "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications." Journal of the American Medical Informatics Association 17.5 (2010): 507-513.
[6] Soldaini, Luca, and Nazli Goharian. "Quickumls: a fast, unsupervised approach for medical concept extraction." MedIR workshop, sigir. 2016..
[7] Hersh, William, et al. "OHSUMED: an interactive retrieval evaluation and new large test collection for research." SIGIR’94. Springer, London, 1994.
[8] "Abbreviations Used in Data Elements", May 2019, [online] Available: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#REL
[9] Bodenreider, Olivier, and Alexa T. McCray. "Exploring semantic groups through visual approaches." Journal of biomedical informatics 36.6 (2003): 414-432.
[10] McCray, Alexa T., Anita Burgun, and Olivier Bodenreider. "Aggregating UMLS semantic types for reducing conceptual complexity." Studies in health technology and informatics 84.0 1 (2001): 216.
[11] McCray, Alexa T., Suresh Srinivasan, and Allen C. Browne. "Lexical methods for managing variation in biomedical terminologies." Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association, 1994.
[12] "word2vec", Jul 2013, [online] Available:
https://code.google.com/archive/p/word2vec/
[13] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781(2013).
[14] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
[15] Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic regularities in continuous space word representations." Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
[16] Beam, Andrew L., et al. "Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data." arXiv preprint arXiv:1804.01486 (2018).
[17] Cote, Roger A. "Architecture of SNOMED: its contribution to medical language processing." Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association, 1986.
[18] Donnelly, Kevin. "SNOMED-CT: The advanced terminology and coding system for eHealth." Studies in health technology and informatics 121 (2006): 279.
[19] Levy, Omer, and Yoav Goldberg. "Neural word embedding as implicit matrix factorization." Advances in neural information processing systems. 2014.
[20] Bullinaria, John A., and Joseph P. Levy. "Extracting semantic representations from word co-occurrence statistics: A computational study." Behavior research methods 39.3 (2007): 510-526.
[21] Choi, Youngduck, Chill Yi-I. Chiu, and David Sontag. "Learning low-dimensional representations of medical concepts." AMIA Summits on Translational Science Proceedings 2016 (2016): 41.
[22] De Vine, Lance, et al. "Medical semantic similarity with a neural language model." Proceedings of the 23rd ACM international conference on conference on information and knowledge management. ACM, 2014.
[23] Nguyen, Khai, and Ryutaro Ichise. "Learning Effective Distributed Representation of Complex Biomedical Concepts." 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 2018.
[24] Pakhomov, Serguei, et al. "Semantic similarity and relatedness between clinical terms: an experimental study." AMIA annual symposium proceedings. Vol. 2010. American Medical Informatics Association, 2010.
[25] Pedersen, Ted, et al. "Measures of semantic similarity and relatedness in the biomedical domain." Journal of biomedical informatics 40.3 (2007): 288-299.
[26] McInnes, Bridget T., and Ted Pedersen. "Evaluating semantic similarity and relatedness over the semantic grouping of clinical term pairs." Journal of biomedical informatics 54 (2015): 329-336.
[27] Rada, Roy, et al. "Development and application of a metric on semantic nets." IEEE transactions on systems, man, and cybernetics 19.1 (1989): 17-30.
[28] Wu, Zhibiao, and Martha Palmer. "Verbs semantics and lexical selection." Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1994.
[29] Leacock, Claudia, and Martin Chodorow. "Combining local context and WordNet similarity for word sense identification." WordNet: An electronic lexical database 49.2 (1998): 265-283.
[30] Lesk, Michael. "Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone." Proceedings of the 5th annual international conference on Systems documentation. ACM, 1986.
[31] Banerjee, Satanjeev, and Ted Pedersen. "Extended gloss overlaps as a measure of semantic relatedness." Ijcai. Vol. 3. 2003.
[32] Koopman, Bevan, et al. "An evaluation of corpus-driven measures of medical concept similarity for information retrieval." Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012.
[33] Minarro-Giménez, José Antonio, Oscar Marin-Alonso, and Matthias Samwald. "Exploring the application of deep learning techniques on medical text corpora." Studies in health technology and informatics 205 (2014): 584-588.
[34] Zhu, Yongjun, Erjia Yan, and Fei Wang. "Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec." BMC medical informatics and decision making 17.1 (2017): 95.
[35] Moen, S. P. F. G. H., and Tapio Salakoski2 Sophia Ananiadou. "Distributional semantics resources for biomedical text processing." Proceedings of LBM (2013): 39-44.
[36] Chiu, Billy, et al. "How to train good word embeddings for biomedical NLP." Proceedings of the 15th workshop on biomedical natural language processing. 2016.
[37] McInnes, Bridget T., Ted Pedersen, and Serguei VS Pakhomov. "UMLS-Interface and UMLS-Similarity: open source software for measuring paths and semantic similarity." AMIA Annual Symposium Proceedings. Vol. 2009. American Medical Informatics Association, 2009.
[38] Harispe, Sébastien, et al. "The semantic measures library and toolkit: fast computation of semantic similarity and relatedness using biomedical ontologies." Bioinformatics 30.5 (2013): 740-742.
[39] "Fielded MetaMap Indexing (MMI) Output Explained", Dec 2015, [online] Available: https://metamap.nlm.nih.gov/Docs/MMI_Output_2016.pdf
[40] "Supplemental Section-revision", [online] Available:
https://cs.stanford.edu/people/sonal/gupta14jamia_supl.pdf?fbclid=IwAR3AL34axjC3jhESI46MzQFzvGXsNr-Rsi8I_4AyoHDv7Dgy0wzGGHHgp68
[41] Loper, Edward, and Steven Bird. "NLTK: the natural language toolkit." arXiv preprint cs/0205028 (2002).
[42] "Metathesaurus - Rich Release Format (RRF)", Sep 2009, [online] Available:
https://www.ncbi.nlm.nih.gov/books/NBK9685/
[43] "Concept Names and Sources (File = MRCONSO.RRF)", Sep 2009, [online] Available:
https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/?report=objectonly
[44] "Related Concepts (File = MRREL.RRF)", Sep 2009, [online] Available:
https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.related_concepts_file_mrrel_rrf/?report=objectonly
[45] "genism: models.word2vec – Word2vec embeddings", Jul 2019, [online] Available:
https://radimrehurek.com/gensim/models/word2vec.html
[46] Pakhomov, Serguei VS, et al. "Towards a framework for developing semantic relatedness reference standards." Journal of biomedical informatics 44.2 (2011): 251-265.
[47] "VCU NLP Lab: Data", 2017, [online] Available:
https://nlp.cs.vcu.edu/data.html
[48] "GitHub - nhkhaivn/Biomed2Vec: Learning Effective Distributed Representation of Complex Biomedical Concepts", Sep 2018, [online] Available:
https://github.com/nhkhaivn/Biomed2Vec
[49] "gensim.utils.tokenize()", 2016, [online] Available:
https://tedboy.github.io/nlps/generated/generated/gensim.utils.tokenize.html
[50] "GitHub - clinicalml/embeddings: Code for AMIA CRI 2016 paper "Learning Low-Dimensional Representations of Medical Concepts"", Feb 2016, [online] Available:
https://github.com/clinicalml/embeddings
[51] "Pre-trained cui2vec embeddings", Apr 2018, [online] Available:
https://figshare.com/articles/Pre-trained_cui2vec_embeddings/6082922
[52] "biomedical-text-exploring-tools - default", Oct 2013, [online] Available:
https://code.google.com/archive/p/biomedical-text-exploring-tools/source/default/source
[53] "Biomedical natural language processing", 2013, [online] Available:
http://bio.nlplab.org/
[54] Luo, Xiao, and Setu Shah. "Concept embedding-based weighting scheme for biomedical text clustering and visualization." Applied Informatics. Vol. 5. No. 1. Springer Berlin Heidelberg, 2018.