簡易檢索 / 詳目顯示

研究生: 吳冠緯
Wu, Kuan-Wei
論文名稱: 利用上下文增強車輛規格相關文件之檢索
LeCAR: Leveraging Context for Enhanced Automotive Specification Retrieval
指導教授: 陳宜欣
Chen, Yi-shin
口試委員: 彭文志
Peng, Wen-chih
洪智傑
Hung, Chih-chieh
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 62
中文關鍵詞: 資訊檢索稠密檢索知識圖譜特定領域
外文關鍵詞: Information retrieval, Dense retireval, Knowledge graph, Specific domain
相關次數: 點閱:49下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在車輛製造領域中,規格書多為複雜、包含各種資料形式的文件,描述了產品、設計或服務的所有細節,並作為製造車輛技術標準。為了根據這些規格製造產品,公司必須先從文件中提取製造所需資訊,並將其整合後進行內部或外包設計。除了車輛製造領域中缺乏可使用的訓練資料外,這項工作面臨三個主要挑戰:首先,車輛規格文件包含非結構化的文本資料和多樣化的資料格式;其次,用戶輸入的查詢句通常簡短且含糊不清。最後,我們的分析表明,查詢語句中不是所有字詞都具有相等的重要性。為了應對這些挑戰,我們提出了LeCAR框架,處理非結構化文本、消除查詢語句的歧義並聚焦搜索目標。根據實驗結果顯示,我們的方法在不需要額外的訓練數據之情況下,相較於主流之單純利用預訓練語言模型的方法,可以有效地提升召回率。


    A specification document in the automotive manufacturing field is a complex document that outlines all the details of a product, design, or service, and serves as a technical standard. In order to manufacture products based on these specifications, companies must extract essential information from the documents. In addition to the limited data available in the automotive manufacturing domain, this work encounters three primary challenges: first, the automotive specification comprises unstructured textual data and diverse data formats; second, the user-input query sentences are usually short and ambiguous; and lastly, our analysis revealed that not all tokens within a query hold equal weight. To address these challenges, we have proposed a method called LeCAR that processes the unstructured data, clarifies query sentences, and narrows the search focus. The experimental results demonstrate that our approach overcomes the limitations of state-of-the-art approaches that utilize pre-trained language models, without requiring extra training data.

    * 1 Introduction .................................... 1 * 2 Related Work.................................... 5 * 2.1  Term-basedRetrieval ............................. 6 * 2.2  Dense Retrieval ................................ 6 * 2.2.1 Dense Vector Representations .................... 6 * 2.2.2 Bi-encoder .............................. 7 * 2.2.3 Pretrained Language Model ..................... 8 * 2.3  Domain adaptation .............................. 8 * 2.4  Relation Extraction .............................. 9 * 3 Methodology.................................... 10 * 3.1  Problem Definition .............................. 10 * 3.2  Overview ................................... 12 * 3.3  Define minimum context unit......................... 13 * 3.3.1 Define minimum context unit-Text ................. 14 * 3.3.2 Define minimum context unit-Table................. 16 * 3.4  Disambiguate the query............................ 19 * 3.4.1 Construction of knowledgegraphs .................. 19 * 3.4.2 Clarifying context-related terms selection .................. 21 * 3.5 Narrow down search focus .......................... 29 * 3.5.1 Cosine similarity calculation ..................... 29 * 3.5.2 Implement prioritized search with weighted tokens .................. 30 * 4 Experiments and Discussion ........................... 32 * 4.1  Experiment setup ............................... 33 * 4.1.1 Dataset Description.......................... 33 * 4.1.2 Graph Information .......................... 34 * 4.2  Baseline.................................... 34 * 4.3  Result and discussion ............................. 36 * 4.3.1 Experiment result........................... 37 * 4.3.2 Retrieved passages discussions.................... 39 * 4.3.3 Analysis of different strategies in methodology .................. 41 * 4.4  Result of Ranking Score ........................... 43 * 4.5  Ablation Study ................................ 45 * 4.6  Analysis.................................... 49 * 4.7  Parameter Experiment............................. 50 * 5  Conclusions and FutureWork .......................... 52 * 6  Acknowledgement ................................. 54 References ....................................... 55

    [1] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing, 2009.
    [2] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christo- pher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
    [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    [4] Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipa ̈a ̈ Hellqvist, and Magnus Sahlgren. Semantic re-tuning with contrastive tension. In International Conference on Learning Representations, 2021.
    [5] Gobinda G Chowdhury. Introduction to modern information retrieval. Facet publishing, 2010.
    [6] Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162, 2017.
    [7] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators, 2020.
    [8] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. Ms marco: Benchmarking ranking models in the large-data regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 1566–1576, New York, NY, USA, 2021. Association for Computing Machinery.
    [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language un- derstanding, 2019.
    [10] vinayak-mehta Dimiter Naydenov. Camelot: Pdf table extraction for humans, 2018.
    [11] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple con- trastive learning of sentence embeddings, 2022.
    [12] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 55–64, 2016.
    [13] Matthew Henderson, Paweł Budzianowski, In ̃igo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrksˇic ́, Georgios Sp- ithourakis, Pei-Hao Su, Ivan Vulic ́, and Tsung-Hsien Wen. A repository of conversational datasets, 2019.
    [14]  Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
    [15]  Pere-Llu ́ıs Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end language generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370– 2381, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
    [16]  SamuelHumeau,KurtShuster,Marie-AnneLachaux,andJasonWeston. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring, 2020.
    [17]  Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics.
    [18]  Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Pro- ceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
    [19]  Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, and Luke Zettlemoyer. Pre-training via paraphrasing, 2020.
    [20]  Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Ku ̈ttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them, 2021.
    [21]  Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini: A Python toolkit for repro- ducible information retrieval research with sparse and dense representa- tions. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362, 2021.
    [22]  Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7999–8009, Online, July 2020. Association for Computational Linguistics.
    [23]  Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
    [24]  Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 4969–4983, Online, July 2020. Association for Computational Linguistics.
    [25]  WesMcKinneyetal.Datastructuresforstatisticalcomputinginpython. In Proceedings of the 9th Python in Science Conference, volume 445, pages 51–56. Austin, TX, 2010.
    [26]  Tan Nguyen, Wanjia Liu, Ethan Perez, Richard G. Baraniuk, and Ankit B. Patel. Semi-supervised learning with the deep rendering mixture model, 2016.
    [27]  JeffreyPennington,RichardSocher,andChristopherDManning.Glove: Global vectors for word representation. In Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
    [28]  MatthewE.Peters,MarkNeumann,MohitIyyer,MattGardner,Christo- pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations, 2018.
    [29]  Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. 
[30] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
    [31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
    [32] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
    [33] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
    [34] Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
    [35] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
    [36] Jeremy Singer-Vine and The pdfplumber contributors. pdfplumber - and easily extract text and tables., April 2023.
    [37] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297, 2020.
    [38] Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. Augmented SBERT: Data augmentation method for improv- ing bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310, Online, June 2021. Association for Computational Linguistics.
    [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
    [40] Kexin Wang, Nils Reimers, and Iryna Gurevych. Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979, 4 2021.
    [41] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. Gpl: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. arXiv preprint arXiv:2112.07577, 2021.
    [42] AdinaWilliams,NikitaNangia,andSamuelBowman.Abroad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
    [43] Zhiheng Yan, Chong Zhang, Jinlan Fu, Qi Zhang, and Zhongyu Wei. A partition filter network for joint entity and relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 185–197, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.

    QR CODE