研究生: |
梁芷蘋 Liang, Fanny C.P. |
---|---|
論文名稱: |
機器學習應用於碳吸收技術智慧財產權分析之圖形摘要系統 IP Analytics and Machine Learning Applied to Create Graph Summarization for Carbon Absorption Utility Patents |
指導教授: |
張瑞芬
Trappey, Amy J. C |
口試委員: |
樊晉源
Fan, Chin-Yuan 張力元 Trappey, Charles V. |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 68 |
中文關鍵詞: | 自動摘要 、文本資料視覺化 、自然語言處理 |
外文關鍵詞: | Graph Visualization, Natural Language Processing, Automatic Summarization |
相關次數: | 點閱:70 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在材料和化工研究領域,由於行業競爭,加速研究進展以提早進入市場進行商業佈局是研究中心首要目標。為了探索研究領域,研究人員需要閱讀大量的專利文獻來掌握市場,因此,如何有效的減少閱讀和分析專利文件的時間是一個重要的問題。為了減少專利文獻的閱讀時間,並提高文本分類的效率,本研究提出了一個專利文本知識圖建構系統,將專利文獻的知識以結構化的方式呈現給用戶。本研究以美國專利商標局 (United States Patent and Trademark Office , USPTO) 的碳吸收專利為案例研究。本研究架構主要可分為兩系統,第一是針對實驗程序進行的知識圖譜視覺化,二則是文本摘要的視覺化圖譜。首先使用ALBERT (A Lite Bidirectional Encoder Representations from Transformers)進行文本分類訓練模型,主要用於將實驗程序相關的內容提取出來,並使用Python工具ChemDataExtractor將與程序相關的化學數據提取出來進行視覺化。而針對文本摘要的視覺化圖譜,首先使用SBERT (Sentence Bidirectional Encoder Representations from Transformers) 將文本中的句子向量化,並使用LexRank演算法提取重要關鍵句子得到文本摘要,並使用KeyBERT提取摘要中的關鍵字詞作為圖譜中的節點 (node),並關鍵字詞在文句中的關係計算節點與節點關係來得到圖譜中的線段 (edge)。最後使用Cytoscape來建構可視化圖譜,將整篇文獻的整體知識以網絡的形式呈現出來,為使用者提供視覺化的文本摘要圖,輔助使用者進行閱讀與修正,以便更迅速的掌握文獻的概況。
In the field of materials and chemical research, due to industry competition, accelerating research progress to enter the market early for commercial placement is the primary goal of the research center. To explore the research field, researchers need to read a large amount of patent literature to grasp the market, therefore, how to effectively reduce the time to read and analyze patent documents is an important issue. To reduce the reading time of patent documents and improve the efficiency of text classification, this study proposes a patent text knowledge graph construction system to present the knowledge of patent documents to users in a structured manner. In this study, the United States Patent and Trademark Office (USPTO) patents about carbon absorption are used for case studies. The structure of this study can be divided into two systems, the first one is the visualization of the knowledge map for the experimental procedure, and the second one is the visualization map of the text abstract. Firstly, we use ALBERT (A Lite Bidirectional Encoder Representations from Transformers) to train the model for text classification, which is mainly used to extract the content related to the experimental procedure, and use the Python tool ChemDataExtractor to extract the chemical data related to the procedure for visualization. The Python tool ChemDataExtractor is used to extract the chemical data related to the program for visualization. For the visualization mapping of the text summary, we first use SBERT (Sentence Bidirectional Encoder Representations from Transformers) to vectorize the sentences in the text and use LexRank algorithm to extract the important key sentences to obtain the text summary and use KeyBERT extracts the key words in the summary as nodes in the map, and calculates the node-node relationship with the key words in the stationery to obtain the edge of the map. Finally, Cytoscape is used to construct a visual map to present the overall knowledge of the whole document in the form of a network, providing users with a visual summary of the text so that they can grasp the overview of the document more quickly.
1. A. Likas, N. vlassis and J.J. Verbeek, " The global k-means clustering algorithm," Pattern recognition, vol. 36, no. 3 p. 451-461, 2003.
2. C. D. Manning, M. Surdeanu, J. Bauer, J.R. Finkel, S. Bethard and D. McClosky, "The Stanford CoreNLP natural language processing toolkit," In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55-60, 2014.
3. C. Mallick, A. K. Das, M. Dutta, K. A. Das and A. Sarkar, "Graph-based text summarization using modified TextRank," Soft computing in data analytics, p. 137-146, 2019.
4. C. Zhang, "Automatic keyword extraction from documents using conditional random fields," Journal of Computational Information Systems, vol. 4, no. 3, p. 1169-1180, 2008.
5. D. B. West, Introduction to graph theory, vol. 2, Prentice Hall, 2001.
6. D. Campos, S. Matos and J. L. Oliveira , "A document processing pipeline for annotating chemical entities in scientific documents," Journal of Cheminformatics, vol. 7, p. 1-10, 2015.
7. D. M. D'Alessandro, B. Smit and J. R. Long, "Carbon Dioxide Capture: Prospects for New Materials," Angewandte Chemie International Edition, vol. 49, no. 35, p. 6058-6082, 2010.
8. D. M. Jessop, S. E. Adams, E. L. Willighagen, L. Hawizy and P. Murray-Rust, "OSCAR4: a flexible architecture for chemical text-mining," Journal of Cheminformatics, vol. 3, p. 1-12, 2011.
9. D. Nozza, F. Bianchi and D. Hovy, "What the [mask]? making sense of language-specific BERT models," arXiv preprint, arXiv:2003.02912, 2020.
10. Das, D. Ganguly and U. Garain, "Named entity recognition with word embeddings and wikipedia categories for a low-resource language," ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 16, p. 1-19, 2017.
11. E. Kim, K. Huang, A. Saunders, A. McCallum, G. Ceder and E. Olivetti, "Materials synthesis insights from scientific literature via text extraction and machine learning," Chemistry of Materials, vol. 29, p. 9436-9444, 2017.
12. E. Kim, K. Huang, O. Kononova, G. Ceder and E. Olivetti, "Distilling a materials synthesis ontology," Matter, vol. 5, p. 8-12, 2019.
13. E. Taher, S. A. Hoseini and M. Shamsfard, "Beheshti-NER: Persian named entity recognition Using BERT," arXiv preprint, arXiv:2003.08875, 2020.
14. G. Angeli, M. J. J. Premkumar, C. D. Manning, "Leveraging linguistic structure for open domain information extraction," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344-354, 2015.
15. G. Salton and C. Buclkey, "Term-weighting approaches in automatic text retrieval," Information processing & management, vol. 24, no. 5, p. 513-523, 1988.
16. H. Xu, B. Liu, L. Shu and P. S. Yu, "BERT post-training for review reading comprehension and aspect-based sentiment analysis," arXiv preprint, arXiv:1904.02232, 2019.
17. ISO Standard 10628-2, Diagrams for the chemical and petrochemical industry – Part 2: Graphical symbols, 2012.
18. J. Devlion, M.-W. Chang, K. Lee and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint, arXiv:1810.04805, 2018.
19. J. George and G. Hautier, "Chemist versus Machine: Traditional Knowledge versus Machine Learning Techniques," Trends in Chemistry, vol. 3, p. 86-95, 2021.
20. J. Pennington, R. Socher and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
21. J. Tao , K. A. Brayton and S. L. Broschat , "Automated Confirmation of Protein Annotation Using NLP and the UniProtKB Database," Applied Sciences, vol. 11, no. 1, p. 24, 2021.
22. K. Clark, U. Khandelwal, O. Levy and C. D. Manning , "What does bert look at? an analysis of bert's attention," arXiv preprint, arXiv:1906.04341, 2019.
23. Khazaei and M. Ghasemzadeh, "Comparing k-means clusters on parallel Persian-English corpus," Journal of AI and Data Mining, vol. 3, no. 2, p. 203-208, 2015.
24. Kondo, T. Kuboki, A. Suzuki, M. Udatsu and H. Watando, "Carbon dioxide absorbent and carbon dioxide separation and recovery system". United State Patent US10722838B2, 2020.
25. Kononova, T. He, H. Huo, A. Trewartha, E. A. Olivetti and G. Ceder, "Opportunities and challenges of text mining in materials research," Iscience, vol. 24, no. 3, 2021.
26. L. Hawizy, D. M. Jessop, N. Adams and P. Murray-Rust, "ChemicalTagger: A tool for semantic text-mining in chemistry," Journal of Cheminformatics, vol. 3, p. 1-13, 2011.
27. L. M. Manevitz and M. Yousef, "One-class SVMs for document classification," Journal of machine Learning research, p. 139-154, 2001.
28. Li, A. Sun, J. Han and C. Li, "A survey on deep learning for named entity recognition," IEEE Transactions on Knowledge and Data Engineering, 2020.
29. M. C. Swain and J. M. Cole, "ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature," Journal of Cheminformatics, vol. 56, p. 1894-1904, 2016.
30. M. Cline, M. E. Smoot, E. Cerami, A. Kuchinsky, N. Landys, C. Workman, R. Christmas, I. Avila-Campilo, M. Creech, B. Gross, K. Hanspers, R. Isserlin, R. Kelley, S. Killcoyne, S. Lotia, S. Maere, J. H. Morris, K. Ono, V. Pavlovic, A. Pico, A. Vailaya, P.-L. Wang, A. Adler, B. Conklin, L. Hood, M. Kuiper, C. Sander, I. Schmulevich, B. Schwikowski, G. J. Warner, T. Ideker and G. D. Bader, "Integration of biological networks and gene expression data using Cytoscape," Nature protocols, vol. 2, no. 10, p. 2366-2382, 2007.
31. M. Eddaoudi, V. Guillerm, L. Weselinski, M. H. Alkordi, M. I. H. Mohideen and Y. BELMABKHOUT, "Amine functionalized porous network". United State Patent US9663627B2, 2017.
32. M. M. Maroto-Valer, Developments and innovation in carbon dioxide (CO2) capture and storage technology: Carbon dioxide (CO2) storage and utilisation, Elsevier, 2010.
33. M. M. Parwita and D. Siahaan, "Classification of mobile application reviews using word embedding and convolutional neural network," Lontar Komputer: Jurnal Ilmiah Teknologi Informasi, p. 1-8, 2019.
34. M. N. Moghadasi and Y. Zhuang, "Sent2Vec: A New Sentence Embedding Representation With Sentimental Semantic," in 2020 IEEE International Conference on Big Data (Big Data), 2020.
35. M. Schmitz, R. Bart, S. Soderland and O. Etzioni, " Open language learning for information extraction," In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 523-534, 2012.
36. N. Nazari and M. A. Mahdavi, "A survey on automatic text summarization," Journal of AI and Data Mining, vol. 7, no. 1, p. 121-135, 2019.
37. N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv preprint, arXiv:1908.10084, 2020.
38. P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, no. 5, p. 135-146, 2017.
39. P. D. Turney, "Learning to Extract Keyphrases from Text," arXiv preprint, cs/0212013, 2002.
40. P. Gupta, I. Roy, G. Batra and A. K. Dubey, "Decoding Emotions in Text Using GloVe Embeddings," in 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), 2021.
41. P. Sayanta and S. Saha, "CyberBERT: BERT for cyberbullying identification," Multimedia Systems, p. 1-8, 2020.
42. P. Sharma and Y. Li, “Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling,” Preprints, 2019.
43. P.-I. Chen and S.-J. Lin, "Automatic keyword prediction using Google similarity distanc," Expert Systems with Applications, vol. 37, no. 3, p. 1928-1938, 2010.
44. R. Breuer , R. Klamma , Y. Cao and R. Vuorikari, "Social network analysis of 45,000 schools: A case study of technology enhanced learning in europe," in European Conference on Technology Enhanced Learning, Berlin, 2009.
45. R. Irina, "An empirical study of the naive Bayes classifier," in IJCAI 2001 workshop on empirical methods in artificial intelligence, 2001.
46. R. Pal and D. Saha, "An approach to automatic text summarization using WordNet," in 2014 IEEE international advance computing conference (IACC), 2014.
47. Ramesh, K. G. Srinivasa and N. Pramod, "SentenceRank—a graph based approach to summarize text," in In The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), 2014.
48. S. Arora, Y. Li, L. Yingyu, T. Ma and A. Risteski, "A latent variable model approach to pmi-based word embeddings," Transactions of the Association for Computational Linguistics, no. 4, p. 385-399, 2016.
49. S. Arora, Y. Liang and T. Ma, "A Simple but Tough-to-Beat Baseline for Sentence Embeddings," in International Conference on Learning Representations, 2017.
50. S. Beliga, "Keyword extraction: a review of methods and approaches," p. 1-9, 2014.
51. S. G. Kobourov, "Spring Embedders and Force Directed Graph Drawing Algorithms," arXiv preprint, arXiv:1201.3011v1, 2012.
52. S. K. Bharti and K. S. Babu, "Automatic keyword extraction for text summarization: A survey," arXiv preprint, arXiv:1704.03242, 2017.
53. S. Li and B. Gong, "Word embedding and text classification based on deep learning methods," in MATEC Web of Conferences, 2021.
54. Santos, N. Nedjah and L. de Macedo Mourelle, "Sentiment analysis using convolutional neural network with fastText embeddings," in 2017 IEEE Latin American conference on computational intelligence (LA-CCI), 2017.
55. T. Fukatani, K. Hoshiba and I. Fukuchi, "Binder for non-aqueous electrolyte rechargeable battery, negative electrode slurry for rechargeable battery including the same, negative electrode for rechargeable battery including the same, and rechargeable battery including the same". United State Patent US20200343556A1, 2020.
56. T. M. Fruchterman and E. M. Reingold, "Graph drawing by force‐directed placement," Software: Practice and Experience, vol. 21, p. 1129-1164, 1991.
57. T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint, arXiv:1301.3781, 2013.
58. T. Monden, H. Hyakutake and K. Ogaki, "Coating liquid for covering glass fiberand rubber-reinforcing glass fiber using same." United States Patent US8455097B2, 2005.
59. T. Vekmurugan and T. Santhanam, "Computational Complexity between K-Means and K-Medoids Clustering Algorithmsfor Normal and Uniform Distributions of Data Points, " Journal of computer science, vol. 6, no. 3, p. 363, 2010.
60. W. Sakata, T. Shibata, R. Tanaka and S. Kurohashi, "FAQ retrieval using query-question similarity and BERT-based query-answer relevance," in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019.
61. X. Gao, R. Tan and G. Li, "Research on text mining of material science based on natural language processing," in IOP Conference Series: Materials Science and Engineering, 2020.
62. Y. Bengio, "Neural net language models," Scholarpedia, vol. 3, no. 1, p. 3881, 2008.
63. Y. Goldberg, "A Primer on Neural Network Models for Natural Language Processing," Journal of Artificial Intelligence Research, no. 57, p. 345–420, 2016.
64. Y. K. Meena and D. Gopalani, "Evolutionary algorithms for extractive automatic text summarization," Procedia Computer Science, no. 48, p. 244-249, 2015.
65. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma and R. Soricut, "ALBERT: A Lite BERT for Supervised Learning of Language Representations," arXiv preprint, arXiv:1909.11942, 2019.
66. 經濟部智慧財產局, “專利主題網,” 2021. [線上]. Available: https://topic.tipo.gov.tw/patents-tw/sp-ipcq-full-101.html.
67. 張簡宇傑 (2020),基於文字探勘之智慧工程文件摘要系統(指導教授:張瑞芬),碩士論文,國立清華大學,工業工程與工程管理學系。