從專利文件自動產生擷取領域相關之正規表示法

簡易檢索 / 詳目顯示

回結果列表

研究生：	李吉峰 Chi-Feng Lee
論文名稱：	從專利文件自動產生擷取領域相關之正規表示法 Automatic Acquisition of Domain Specific Regular Expressions from Patent Documents
指導教授：	蘇豐文 Von-Wun Soo
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2006
畢業學年度：	94
語文別：	英文
論文頁數：	61
中文關鍵詞：	專利、正規表式示
外文關鍵詞：	patent, string edit distance, regular expression
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

專利說明書是一種具有法律效用的文件，主要記載著各項技術發展的訊息和施行方法。同時，透過這些專利文件的分析和利用，不但可以促進產業進步，更可以藉此了解目前該科技產業發展的水平及關鍵技術，即時掌握最新的市場資訊。但是，由於專利文件的特殊用法和撰寫風格，使得在目前的專利文件分析工作極為困難，加上目前沒有足夠的自動化工具來幫助分析，所以大部份的分析工作仍依賴著人工處理，耗時耗力；然而，隨著科技產業的進步，專利文件的產量也以極驚人的速度在成長著如，果仍然以人工的方式逐一分析，勢必造成其產業及研究的停滯或人力、物力上的浪費，故唯有透過自動化分析輔助工具的協助，才能解決目前專利文件在分析上的窘境。
在本論文中，主要提出了一個方法，利用自動化的方式從專利文件中學習出擷取申請專利範圍的結構描述，並利用圖形化的方式呈現給使用者，以方便及加速使用者在專利文件上的分析工作，減少時間及人力上的浪費。本研究主要可分為兩部份，元件擷取及Triple關係擷取；在元件擷取方面，主要利用統計的方式，統計多字詞的頻率及詞性規則來決定何謂專利元件；而在Triple關係擷取方面，我們提供了一個網頁加註的方式來加速收集訓練資料，同時，將此訓練資料轉成相對應的正規表示式規則，並將這些規則做適當的歸納延伸，以用來擷取申請專利範圍的結構。

A patent specification is a document with legal protection. It records the information of technologies and its execution method. With the effective analysis and usage of patent documents, it not only can stimulate the development of industries and progress of science, but also can catch on the technical level and the key technologies of the industry and the science, and has the newest market information in hand. However, due to the specific terms and the specific writing formats of patent document, it is hard to understand for human beings within limited time. And another problem is that, the size of the patent documents is growing rapidly. If we still rely on human to conduct the analysis of the patent documents, it will be very costly in time and human power.
In this thesis, we provide an approach to automatically extract the claim structure from patent documents and show the results in visualization to help human beings to understand much more easily and effectively without wasting time and energy. There are two major goals in our research. One is the element extraction and the other is the triple extraction. In element extraction section, we use statistical method to count the frequency of NGram word and use the combination of part-of-speech to extract the element. And In triple extraction section, we develop a wrapper environment for user to wrap the training data and automatically generate its corresponding patterns with minimal user efforts, and provide an induction algorithm that is an adaptation of string edit distance to induce our training results into abstracted extraction patterns, and use the result patterns to extract the claim structure.

中文摘要    i
Abstract    ii
誌謝    iii
Table of Contents    iv
List of Figures    vi
List of Tables    vii
Chapter 1.    Introduction    - 1 -
1.1.    Preface    - 1 -
1.2.    Motivation and Objective    - 3 -
1.3.    Research Restrictions    - 3 -
1.4.    Organization of this thesis    - 4 -
Chapter 2.    Related Work    - 5 -
2.1.    Patent Document Overview    - 5 -
2.1.1.    Content in a Patent Document    - 5 -
2.1.2.    Claim    - 5 -
2.1.3.    The Writing Format of the Claim    - 7 -
2.2.    Information Extraction    - 10 -
Chapter 3.    Problem Analysis    - 12 -
3.1.    Regular Expressions    - 12 -
3.2.    Parser    - 13 -
3.3.    Regular Expressions vs a Parser    - 15 -
Chapter 4.    System Architecture    - 19 -
4.1.    Claim Structure Category    - 20 -
4.2.    Pre-Processing Phase    - 25 -
4.2.1.    Information Classification    - 25 -
4.2.2.    Sentence Normalization & Annotation    - 26 -
4.2.3.    Element Extraction    - 27 -
4.3.    Induction Learning Phase    - 28 -
4.3.1.    Wrapper    - 28 -
4.3.2.    Induction Algorithm    - 30 -
4.4.    Extraction Result    - 36 -
Chapter 5.    Evaluation    - 37 -
5.1.    Experiment-1 (Element Extraction)    - 38 -
5.1.1.    Experiment Description and Objective    - 38 -
5.1.2.    Experiment Design    - 38 -
5.1.3.    Experiment Result and Discussion    - 39 -
5.2.    Experiment-2 (Triple)    - 40 -
5.2.1.    Experiment Description and Objective    - 40 -
5.2.2.    Experiment Design    - 40 -
5.2.3.    Experiment Result and Discussion    - 41 -
Chapter 6.    Conclusions and Future Work    - 45 -
Reference    - 47 -
Appendix A.    U.S. Patent Manual    - 50 -
Appendix B.    Regular Expression Language    - 55 -
Appendix C.    Stanford Part-Of-Speech Tagger    - 58 -
Appendix D.    CMP Patent Number List    - 59 -

                                

[1]. U.S. Constitution in Article I, Section 8, Clause 8
[2]. 夏文龍, (1998), “專利對產業界的價值”, 智慧財產權管理季刊, Vol. 16, pp. 20-21
[3]. USPTO (United States Patent and Trademark Office) website : http://www.uspto.gov/
[4]. EPO (European Patent Office) website: http://ep.espacenet.com/
[5]. WIPO (World Intellectual Property Organization) website: http://www.wipo.int/portal/index.html.en
[6]. JPO (Japan Patent Office) website: http://www.jpo.go.jp/
[7]. TWPAT website: http://www.twpat.com/
[8]. Patent Guider website: http://www.learningtech.com.tw/products/pg/pg.aspx
[9]. Akihiro Shinmori, Manabu Okumura, Yuzo Marukawa, and Makoto Iwayama. (2003), “Patent Claim Processing for Readability–Structure Analysis and Term Explanation”, Proceedings of ACL workshop on Patent Corpus Processing.
[10]. Jean-Charles Lamirel, Shadi Al Shehabi, Martial Hoffmann, and Claire François.
(2003), “Intelligent patent analysis through the use of a neural network: experiment of multi-viewpoint analysis with the MultiSOM model”, Proceedings of ACL workshop on Patent Corpus Processing, pp. 7-23.
[11]. Von-Wun Soo, Shih-Yao Yang, Szu-Yin Lin, Shih-Neng Lin, and Shian-Luen Cheng. (2005), “A Cooperative Multi-Agent Platform for Invention based on Ontology and Patent Document Analysis”, Proceeding of the 9th International Conference on Computer Supported Cooperative Work in Design (CSCW), UK.
[12]. Jae-Ho Kim, Jin-Xia Huang, Ha-Yong Jung, and Key-Sun Choi, (2005), “Patent document retrieval and classification at KAIST”, Proceedings of NTCIR-5 Workshop Meeting, Japan.
[13]. Aitao Chen and Fredric C. Gey, (2003), “Experiments on cross-language and patent retrieval at NTCIR-3 Workshop”, Proceedings of Third NTCIR Workshop.
[14]. Chen, L., Tokuda, N. and Adachi, H., (2003), “A patent document retrieval system addressing both semantic and syntactic properties”, Proceedings of ACL Workshop on Patent Corpus Processing.
[15]. Alfred V. Aho, (1990), “Algorithms for finding patterns in strings”, In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 5, pages 254-300. Elsevier Science Publishers B. V.
[16]. Jerry R. Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, and Mabry Tyson, (1996), "FASTUS: Extracting Information from Natural-Language Texts", in Finite State Devices for Natural Language Processing, E. Roche and Y. Schabes (eds.), MIT Press, 1996.
[17]. Shih-Neng Lin, (2005), “Semantic Information Extraction and Comparison for Patent Documents”, Master Thesis, National Tsing Hua University, Hsinchu.
[18]. Ellen Riloff, (1996), “Automatically Generating Extraction Patterns from Untagged Text”, Proceedings of the Thirteen National Conference on Artificial Intelligence (AAAI), pp. 1044-1049.
[19]. Dayne Freitag and Nicholas Kushmerick, (2000), “Boosted Wrapper Induction”, Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Innovative Applications of AI Conference, AAAI Press.
[20]. Ellen Riloff, (1993), “Automatically Constructing a Dictionary for Information Extraction Tasks”, Proceedings of the Eleventh National Conference on Artificial Intelligence, AAAI Press / MIT Press, pages 811–816.
[21]. Parse Visualization Tools website: http://ai.stanford.edu/~rion/parsing/index.html
[22]. Dekang Lin, (1998), “Dependency based Evaluation of MINIPAR”, In Workshop on the Evaluation of Parsing Systems.
[23]. Dan Klein and Christopher D. Manning, (2003), “Accurate Unlexicalized Parsing”, Proceedings of the 41st Meeting of the Association for Computational Linguistics.
[24]. Daniel D. K. Sleator and Davy Temperley, (1991), “Parsing English with a Link Grammar”, Proceedings of the Third International Workshop on Parsing Technologies.
[25]. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer, (2003), “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network”, In Proceedings of HLT-NAACL 2003 pages 252-259.
[26]. George A. Miller, (1995), “WordNet: A Lexical Database for English”, Communications of the ACM, Vol. 38 No. 11.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文