研究生: |
張簡宇傑 Chang Chien, Jack Y. C. |
---|---|
論文名稱: |
基於文字探勘之智慧工程文件摘要系統 Engineering Document Summarization System Using Text Mining Methods |
指導教授: |
張瑞芬
Trappey, Amy J. C. |
口試委員: |
吳政隆
Wu, Jheng-Long 樊晉源 Fan, Chin-Yuan |
學位類別: |
碩士 Master |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 74 |
中文關鍵詞: | 關鍵字提取 、詞嵌入 、自動化摘要 、分群 |
外文關鍵詞: | Key term extraction, Word embedding, Automatic summarization, Clustering |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
工程邀標書(Request for Quotation, RFQ)為一經常用於高度客製化產業中的工程文件,例如於大型變壓器製造業中,當顧客(如發電廠、大型工廠)欲進行變壓器採購流程時,會先提供RFQ並邀請具製造能力之變壓器廠商提供設計方案、成本估算、報價等作業。RFQ之長度冗長、內容繁雜,而其中的設計規格、製程技術、標準規範等要求嚴謹,欲參與競標的製造商需在短時間內解析RFQ所有重要資訊,對規格要求不能有所遺漏,且能快速對其所有採購內容細項進行成本估算、以利產出最佳之報價,此為一耗時且耗費高級工程技術專業人力的任務。本研究以此為例,發展一工程文件專屬的自動化摘要生成流程,在進行自動化摘要生成前,需先蒐集大量前案RFQ文件集 (包括了1,331 個 69kV-230kV型號變壓器之RFQ 文件集),並進行關鍵字詞提取(Retrieval)及重要性排序(Ranking)作業。本研究以TF-IDF以及N-gram演算法進行關鍵字詞提取及重要性排序。更進一步以1,331篇RFQ文件集、120萬維基文本、及1000篇變壓器技術論文之文檔作為三類訓練集 (Training datasets),以茲評估訓練集組合,以利詞嵌入 (Word2vec) 非監督式學習演算法較佳模型之訓練與產出,藉此能將此領域文件中之文字與其相對應之向量做精確之表示。提取關鍵字詞的目的在對RFQ技術文件的重要文句進行自動初步篩選—篩選出含有關鍵字詞的文句。進一步,將含關鍵字詞之文句,再以Word2vec模型將轉換成向量值,並利用TextRank進行相似文句重要程度排序,進而以重要程度高、含各類關鍵字詞之文句,自動組成高品質之精簡摘要。其有效性乃以文件摘要之壓縮率及保留率(Compression and retention ratio) 來評估。本研究以40份RFQ文件測試不同訓練集產出之Word2vec模型,評估其生成摘要之有效性。本研究發展一根據規格自動填入之摘要表,分別使用原文與生成摘要輸入該表,比較計算兩者的壓縮率與保留率,找出Word2vec最佳模型。又變壓器有不同規格,一規格又下會有不同要求,本研究更使用1,331 RFQ文件對於描述相同規格參數(電壓、阻抗、容量)等要求的文句進行K-means分群與關鍵字提取,統整客戶常見具相似規格需求進行管理,並使用40篇新RFQ進行驗證,以利在閱讀新RFQ案例時,減少閱讀規格要求時缺漏的機會,增加產品設計、成本評估與報價的精準度。
Request of Quotation (RFQ) is a kind of engineering document often used in high-customized industries such as large transformer manufacturer, the characteristics of RFQ are length, complicated, and it would be hard to get key information in a short time. This research takes RFQ as case to develop a novel summarization approach, in the beginning, historical 1,331 RFQ cases were collected for two purposes, one is to acquire key terms by TF-IDF and N-gram, which is used for filtering the content before summarization, the other one is to train for Word2vec model. When receiving a new RFQ, the content will be decomposed into sentences, then these sentences are filtered by key terms, after filtering, the trained Word2vec model is used to vectorize these filtered sentences, then TextRank, an extractive summarization technique, is implemented on each sentence to determine its importance. The sentences with higher importance would be picked up as a summary. In order to test the effect, 40 new RFQ cases and different word2vec model are used and retention ratio, an auto-fill table which can classify sentences according to specification key terms was developed, the evaluating method is to insert original RFQ and generated summary into that table, and compare these two results. Due to transformer has different kinds of specifications, a specification has various requirements, this research collected the sentence containing same specification key terms from 1,331 and used Word2vec model to vectorize them, then implement K-means on each specification sentences. After finishing clustering, do key term extraction toward each clustering. Then used another 40 RFQ cases and extracted the sentences under the same specifications classified these sentences into each cluster, in this way, the requirements of each specifications can be obtained. With this approach, when engineers read RFQ document, they can check if every requirement is met, which is can increase the accuracy of product design, cost evaluation and quotation.
1. Basiron, H., Jaya Kumar, Y., Ong, S. G., Ngo, H. C., & C Suppiah, P, “A review on automatic text summarization approaches,” Journal of Computer Science, 12(4), 178-190, 2016.
2. Becher, M., Endres-Niggemeyer, B., & Fichtner, G., “Scenario forms for web information seeking and summarizing in bone marrow transplantation,” In COLING-02: Multilingual Summarization and Question Answering, 2002.
3. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C., “A neural probabilistic language model,” Journal of machine learning research, 3(Feb), 1137-1155, 2003.
4. Bilgin, M., & Şentürk, İ. F., “Sentiment analysis on Twitter data with semi-supervised Doc2Vec,” In International Conference on Computer Science and Engineering (UBMK), pp. 661-666, IEEE, 2017.
5. Blair, W. R., Tetuan, D. J., Turcotte, W. E., (2006). U.S. Patent No. 7,072,061. Washington, DC: U.S. Patent and Trademark Office.
6. Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C., “Class-based n-gram models of natural language,” Computational linguistics, 18(4), 467-479, 1992.
7. Chaimongkol, P., & Aizawa, A., (2013), “Utilizing LDA Clustering for Technical Term Extraction,” In Proceedings of the 19th Annual Meeting of the Association for Natural Language Processing (ANLP), pp. 686-689.
8. Crosby, D., “The ideal transformer”, IRE Transactions on Circuit Theory, 5, pp. 145, 1958.
9. De Rybel, T., Singh, A., Vandermaar, J. A., Wang, M., Marti, J. R., & Srivastava, K. D., “Apparatus for online power transformer winding monitoring using bushing tap injection,” IEEE Transactions on Power Delivery, 24(3), pp. 996-1003, 2009.
10. Ercan, G. & Cicekli, I., “Using lexical chains for keyword extraction,” Information Processing & Management,” 43(6), 1705-1714, 2007.
11. Gaikwad, D. K., & Mahender, C. N., “A review paper on text summarization,” International Journal of Advanced Research in Computer and Communication Engineering, 5(3), 154-160, 2016.
12. Garbade, M. J., “A quick introduction to text summarization in machine learning,” In Toward Data Science, Retrieved from https://towardsdatascience.com/a-quick-introduction-to-text-summarization-in-machine-learning-3d27ccf18a9f, 2018.
13. Gockenbach, E., & Borsi, H., “Condition monitoring and diagnosis of power transformers,” In 2008 International Conference on Condition Monitoring and Diagnosis, pp. 894-897, IEEE.
14. Grbovic, M., & Cheng, H., (2018, July), “Real-time personalization using embeddings for search ranking at Airbnb,” In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 311-320.
15. Greenbacker, C. F., “Towards a framework for abstractive summarization of multimodal documents,” In Proceedings of the ACL Student Session, pp. 75-80, Association for Computational Linguistics, 2011.
16. Guénoche, A., Hansen, P., & Jaumard, B., “Efficient algorithms for divisive hierarchical clustering with the diameter criterion,” Journal of classification, 8(1), pp. 5-30, 1991.
17. Gupta, V., & Lehal, G. S., “A survey of text summarization extractive techniques,” Journal of emerging technologies in web intelligence, 2(3), 258-268, 2010.
18. Gupta, M. S., “Georg Simon Ohm and Ohm's Law”, IEEE Transactions on Education, 23(3), pp. 156-162, 1980.
19. Harris, Z. S., “Distributional structure,” Word, 10(2-3), 146-162, 1954.
20. Helen, A., “Automatic Abstractive Summarization Task for New Article,” EMITTER International Journal of Engineering Technology, 6(1), pp. 22-34, 2018.
21. Hardeniya, N., “NLTK essentials,” Packt Publishing Ltd, Birmingham, 2015.
22. Herrera, J. P. & Pury, P. A., “Statistical keyword detection in literary corpora,” The European Physical Journal B, 63(1), 135-146, 2008.
23. Hovy, E., & Lin, C. Y., “Automated text summarization and the SUMMARIST”. Advances in automatic text summarization, 14, 1999.
24. Kaikhah, K., “Text summarization using neural networks,” Faculty Publications-Computer Science, 2004.
25. Kikuchi, Y., Hirao, T., Takamura, H., Okumura, M., & Nagata, M., “Single document summarization based on nested tree structure,” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 315-320, 2014.
26. Kohonen, T., “The self-organizing map,” Neurocomputing, vol. 21, no. 1-3, Pages 1-6, 1998.
27. Korshunov A., “Keyterm extraction from microblogs' messages using Wikipedia-based keyphraseness measure,” In 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), pp. 925-931, IEEE, 2012.
28. Kupiec, J., Pedersen, J., & Chen, F., “A trainable document summarizer,” In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 68-73, 1995.
29. Kulkarni, S. V., & Khaparde, S. A., “Transformer engineering: design and practice,” (Vol. 25). CRC press, 2004.
30. Jain, A. K., Murty, M. N., & Flynn, P. J., “Data clustering: a review,” ACM computing surveys (CSUR), 31(3), pp. 264-323, 1999.
31. Johnson, S. C., “Hierarchical clustering schemes,” Psychometrika, 32(3), pp. 241-254, 1967.
32. Liau, B. Y., & Tan, P. P., “Gaining customer knowledge in low cost airlines through text mining,” Industrial Management & Data Systems, 114(9), pp. 1344-1359, 2014.
33. Lau, J. H., & Baldwin, T., “An empirical evaluation of doc2vec with practical insights into document embedding generation,” arXiv preprint arXiv:1607.05368,2016.
34. Le, Q., & Mikolov, T., “Distributed representations of sentences and documents,” In International conference on machine learning, pp. 1188-1196, 2014.
35. Li, D., Li, S., Li, W., Wang, W. & Qu, W., (2010, July), “A semi-supervised key phrase extraction approach: learning from title phrases through a document semantic network,” In Proceedings of the ACL 2010 conference short papers, pp. 296-300, Association for Computational Linguistics.
36. Li, J., Huang, G., Fan, C., Sun, Z., & Zhu, H., “Key word extraction for short text via word2vec, doc2vec, and textrank,” Turkish Journal of Electrical Engineering & Computer Sciences, 27(3), 1794-1805, 2019.
37. Liu, Z., Li, P., Zhang, Y., & Sun, M., “Clustering to find exemplar terms for keyphrase extraction,” Proceeding, In Conference on Empirical Methods in Natural Language Processing, Volume 1-Volume 1 (pp. 257-266). Association for Computational Linguistics, 2009.
38. MacQueen, J., “Some methods for classification and analysis of multivariate observations,” In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1, 14, pp 281-297, 1967.
39. Mallett, D., Elding, J., & Nascimento, M. A., “Information-content based sentence extraction for text summarization,” In International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004, Vol. 2, pp. 214-218, IEEE.
40. Matsumoto, H., Shibako, Y., Shiihara, Y., Nagata, R., & Neba, Y., “Three-phase lines to Single-phase Coil Planar Contactless Power Transformer,” IEEE Transactions on Industrial Electronics, 65(4), pp. 2904-2914, 2018.
41. Mehri. A., & Darooneh. A. H., “Keyword extraction by non-extensivity measure,” Physical Review E, 83(5), 056106.
42. Mihalcea, R., “Graph-based ranking algorithms for sentence extraction, applied to text summarization,” In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 170-173, 2004
43. Mikolov, T., Chen, K., Corrado, G., & Dean, J. “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
44. Nagpal, A., Jatain, A., & Gaur, D., “Review based on Data Clustering Algorithms,” IEEE Conference on Information and Communication Technologies, 2013.
45. Nandi, R. N., Zaman, M. A., Al Muntasir, T., Sumit, S. H., Sourov, T., & Rahman, M. J. U. (2018, September), “Bangla news recommendation using doc2vec,” In 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1-5, IEEE.
46. Nenkova, A., Maskey, S., & Liu, Y., “Automatic summarization” Foundations and Trends® in Information Retrieval, 5(2–3), 103-233, 2011.
47. Neto, J. L., Freitas, A. A., & Kaestner, C. A., “Automatic text summarization using a machine learning approach,” In Brazilian Symposium on Artificial Intelligence, pp. 205-215, Springer, Berlin, Heidelberg, 2002.
48. Quinlan, J. R., “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
49. Oya, T., Mehdad, Y., Carenini, G., & Ng, R., “A template-based abstractive meeting summarization: Leveraging summary and source text relationships,” In Proceedings of the 8th International Natural Language Generation Conference (INLG), pp. 45-53, 2014.
50. Page, L., Brin, S., Motwani, R., & Winograd, T., “The pagerank citation ranking: Bringing order to the web,” Stanford InfoLab, 1999.
51. Pyrhönen, J., Montonen, J., Lindh, P., Vauterin, J., & Otto, M., “Replacing copper with new carbon nanomaterials in electrical machine windings,” International Review of Electrical Engineering (IREE), 2015.
52. Radev, D. R., Jing, H., Styś, M., & Tam, D., “Centroid-based summarization of multiple documents,” Information Processing & Management, 40(6), 919-938, 2004.
53. Saggion, H., & Poibeau, T., “Automatic text summarization: Past, present and future.” In Multi-Source, Multilingual Information Extraction and Summarization pp. 3-21, Springer, Berlin, Heidelberg, 2013.
54. Salton, G. & McGill, M. J., “Introduction to modern information retrieval,” McGraw Hill Book Company, 1983.
55. Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J., Ding, W. P. & Lin, C. T., “A review of clustering techniques and developments,” Neurocomputing, 267, 664-681, 2017.
56. Siddiqi, S. & Sharan, A., “Keyword and keyphrase extraction techniques: a literature review,” International Journal of Computer Applications, 109(2), 2015.
57. Smith, R. T., Taylor, S., & Maher, S., “Modelling electromagnetic induction via accelerated electron motion,” Journal of Physics, 93(7), pp. 802-806, 2014.
58. Trappey, A. J., Trappey, C. V. & Govindarajan, U. H., “Knowledge extraction of rfq engineering documents for smart manufacturing,” In 22th International Conference Advances in Materials and Processing Techniques, Taipei, Taiwan, 2018.
59. Uysal, A. K., & Gunal, S., “The impact of preprocessing on text classification,” Information Processing & Management, 50(1), pp. 104-112, 2014.
60. Uzun, Y., “Keyword extraction using Naive Bayes,” In Bilkent University, Department of Computer Science, Turkey, 2005.
61. VRL, N. (2009, December), “An unsupervised approach to domain-specific term extraction,” In Australasian Language Technology Association Workshop 2009, pp. 94.
62. Wong, K. F., Wu, M., & Li, W., “Extractive summarization using supervised and semi-supervised learning,” In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, (pp. 985-992). Association for Computational Linguistics, 2008.
63. Wongvasu, N., “Methodologies for providing rapid and effective response to request for quotation (RFQ) of mass customization products,” Dissertation Abstracts International 62-10B, pp. 372, 2001.
64. Wu, Y. F. B., Li, Q., Bot, R. & Chen, X. “Domain-specific keyphrase extraction,” In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 283-284, ACM, 2005.
65. Xue, B., Fu, C., & Shaobin, Z., (2014, June), “A study on sentiment computing and classification of sina weibo with word2vec,” In IEEE International Congress on Big Data, pp. 358-363, IEEE, 2014.
66. Yogan, J. K., Goh, O. S., Halizah, B., Ngo, H. C., & Puspalata, C., “A review on automatic text summarization approaches,” Journal of Computer Science, 12(4), 178-190, 2016.
67. Yousefi-Azar, M., & Hamey, L., “Text summarization using unsupervised deep learning,” Expert Systems with Applications, 68, 93-105, 2017.
68. Zhang, K., Xu, H., Tang, J., & Li. J., “Keyword extraction using support vector machine,” In international conference on web-age information management pp. 85-96, Springer, Berlin, Heidelberg, 2006.
69. Zhang, C., Wang, X., Yu, S., & Wang, Y. (2018, June), “Research on Keyword Extraction of Word2vec Model in Chinese Corpus,” In 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), pp. 339-343, IEEE.
70. 王韋智 (2018),以多語系自然語言理解與機器學習為基之智慧型專利摘要系統(指導教授:張瑞芬),碩士論文,國立清華大學,工業工程與工程管理學系。