簡易檢索 / 詳目顯示

研究生: 王輝龍
Huei-Long Wang
論文名稱: 網際網路上表格資訊擷取代理人
Internet Tabular Information Extraction and Personal Navigating Agent
指導教授: 石維寬
Wei-Kuan Shih
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2002
畢業學年度: 90
語文別: 中文
論文頁數: 117
中文關鍵詞: 表格資訊擷取代理人網際網路
外文關鍵詞: Table, Information Extraction, Agent, Internet, Web
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 除了透過肉眼閱讀以外,資訊擷取(Information extraction)系統技術可以自動從網路電子文件裡擷取我們所需要的資訊並加以整理。資訊擷取系統依賴規則規範資訊的關鍵文字與語意等特徵以及其可能的排列順序來尋找並比對出一段符合的文字。一般的資訊擷取系統透過機器學習(Machine Learning)來建立規則。然而,即使是擷取同一類的資訊,針對在同一個網站上面的文件所學習的規則通常是無法在其他網站上面使用的。即便是同一個網站,那天網站的編輯方式修改了,這些規則也會變得一點用也沒有。我們在網際網路表格資訊擷取代理人所發展的技術克服了書寫習慣的限制。透過知識的建立與格式的分析,我們發展出來的技術可以在不同的網站重複使用。
    我們將資訊擷取代理人設計成可以像人一樣在尋找資訊時瀏覽相關網頁。透過網頁瀏覽地圖(Navigating Map)的指示,一開始只是先從一個網頁擷取資訊,一旦發現一個網頁超連結,它會開始進行辨認,辨認下一個網頁是否是相關網頁,然後才連過去並進行資訊擷取的工作。除了一般資訊擷取系統慣用的特徵比對(Pattern Matching)規則之外,資訊擷取代理人還進行 HTML 的結構分析(Structure Analysis)以便整合連結相關資訊。除了結構分析的技術只適用於 HTML 之外,其餘技術都使用針對一個領域所建立的知識,必須適用於各種網站。

    在進行網頁資訊擷取的工作時我們發現,從表格擷取資訊存在一些問題是一般的資訊擷取技術所無法解決的。在某些表格裡面,有些儲存格的資訊被數個資訊框架(Case Frame)所分享。二維表格算是其中的典型代表。在本論文裡,我們又發展出獨特的表格資訊擷取技術。同樣的,該技術也是使用針對單一領域所建立的比對知識,透過與表格格式辨認技術的結合,可以適用於各種不同網站不同格式的表格。

    我們更進一步將資訊擷取代理人使用的語意規則以及網頁瀏覽地圖合併編織成知識脈絡圖(Knowledge Map)。知識脈絡圖是被用來擷取表格資訊的最佳利器。它具有嚴密的概念模型描述知識領域裡的物件結構以便於去除資訊混淆(Ambiguous)的狀況並且有助於連結相關儲存格資訊,並且定義各種可能的資訊框架。對於一個領域裡面的各種資訊,知識脈絡圖也分別紀錄用來辨認這些資訊的語意規則。在知識脈絡圖裡,我們可以對一項資訊定義它的細節,也可以定義細節的細節。這可以幫助我們定義該資訊在被表達時其細節各種可能的排列方式。也可以幫助我們擷取某項資訊的細節部分。

    另一個擷取表格資訊的利器是我們將表格的各種可能的資訊排列方式化為規則方便作特徵比對。對應於這些表格格式辨認規則,我們也定義對應的轉換規則來將表格格式轉換成資料庫表格格式。這些規則跟資訊內容的領域無關,因此可以適用於所有領域。這些規則由我們設計的格式描述與法來定義。我們從這些規則裡面可以找出跟表格格式最符合的規則並且幫助我們去除一些利用格式才能去除的資訊混淆。


    Information extraction (IE) systems can make the text online become more available to be accessed. Some IE applications need a separate set of rules tuned to the domain and the writing styles. In particular, rules created without carefully learning are too restricted to be reused in other web sties. Even in the same web site rules cannot be reused when the writing style is changed slightly. The Personal Navigating Agent (PNA) and Tabular Information Extraction (TIE) systems overcome the writing style restricts. Both systems are adaptive to extract information for one domain from different web sites with just one set of rules or knowledge.
    PNA is designed to follow the guide of domain knowledge to search information thru navigating related web pages as a person. It extracts information in one web page. When it recognized a hyperlink, it can navigate to the page pointed by this hyperlink and then extract information again after recognizing that the related information can be found in that page. PNA uses the most common pattern matching technique that is used by most IE systems to extract information from a single text input. It also uses advance technique such HTML structure analysis to associate extracted data.

    There also exist some problems when we use common techniques to extract information from tables. Some IE systems cannot extract slots shared by several case frames or text cannot be divided into segments of case frames. Two-dimensional tables are in such cases and there are other kinds of tables. TIE is designed to extract information just from tables. As same as PNA, TIE is also a domain-dependent and web-site-independent system. Moreover, the semantic templates and navigating maps in PNA are merged in TIE as knowledge maps with highly structured and the characteristic of reuse. A generic framework is also designed in TIE to identify and extract information for database queries.

    The knowledge map is the main power of TIE to extract information from tabular documents for each domain. A knowledge map is a concept model that describes the structure of objects in a domain. For tables, it describes the attributes a case frame may have. For attributes, it also describes the rules to recognize their labels and entry values. It allows an attribute to define its sub-attributes. It also allows attributes defined hierarchically. Based on the knowledge map, we can apply the information tagging for all cells in a table.

    The second power of TIE is a set of rules to recognize table layouts after the process of information tagging. These rules also define a semantic preserved transition to output a new table in form of relational database tables. These rules are domain independent and can be used to all tables. These rules are defined with layout description syntax that can describe tables in a concept model. Based these rules, we can find the best-fit layout and remove ambiguity. After the layout is recognized, we can transfer the input table into a relational database table. To extract case frames from such layout is as simple as to extract records from a relational database table.

    Abstract i List of Figures vi List of Tables ix Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Information extraction 2 1.3 Writing styles 3 1.4 Semi-structured text information extraction 6 1.5 Information in tables 8 1.6 Information across Web pages 13 1.7 Information closure 14 1.8 Information appearance 15 1.9 Extracting rules comparison 20 Chapter 2 Personal Navigating Agent 22 2.1 Introduction 22 2.2 The tagging system 24 2.2.1 Page type tags 25 2.2.2 Hyperlink tags 25 2.2.3 Information tags 26 2.3 Semantic templates 28 2.4 Navigation tour of PNA 30 Chapter 3 Tabular information extraction 34 3.1 Knowledge and information 34 3.2 Knowledge levels 38 3.3 Table classification 41 3.3.1 One-dimensional tables 42 3.3.2 Two-dimensional tables 43 3.3.3 List tables 44 3.3.4 Complex tables 44 3.3.4.1 Over-expanded cells 45 3.3.4.2 Partition label 48 3.3.4.3 Hierarchical labels 49 3.3.4.4 Combination of multiple tables 53 3.3.4.5 Hierarchical Partitions 56 3.3.4.6 Multiple items in a cell 58 3.3.4.7 Vertical writing 62 3.3.4.8 Forward reduction 63 3.3.4.9 Notes 65 3.4 Semantic database table 65 3.5 Approach of TIE1 67 3.5.1 Modeling 67 3.5.2 Basic abstract model 69 3.5.3 Extension 70 3.5.4 Knowledge representation base 71 3.5.5 Information extraction 71 3.5.6 Category identification 72 3.5.7 Reading path construction 74 3.5.8 Record collection 76 3.6 Approach of TIE2 77 3.6.1 Reformat a HTML table to a virtual table. 77 3.6.2 Convert a tagged table from a virtual table. 78 3.6.3 Remove tagging ambiguity. 79 3.6.4 Restructure to a database table 79 3.7 Techniques 80 3.7.1 Knowledge Base 80 3.7.2 Tagging 82 3.7.3 Tag Selection 86 3.7.4 Layout Transformation 89 3.7.5 Semantics Preserving Transition 92 3.8 Illustrative example 93 Chapter 4 Experiment Results 96 4.1 Results of TIE1 evaluation 96 4.2 Results of TIE2 evaluation 101 4.3 Results of PNA evaluation 107 Chapter 5 Conclusion and future works 111 Bibliography 114

    [1] P. Buneman. “Semistructured data”. Tutorial in Proc. of 16th ACM Symp. On Principles of Database Systems (PODS'97), 1997.
    [2] Serge Abiteboul. “Querying semi-structured data”. In Proceedings of the ICDT, 1997.
    [3] N. Kushmerick, R. Doorenbos, and D. Weld. “Wrapper induction for information extraction”. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, 1997.
    [4] M. Califf, R. Mooney. “Relational Learning of Pattern-Match Rules for Information Extraction”, Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 1997.
    [5] Dayne Freitag. “Multistrategy learning for information extraction”. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 161--169, San Francisco, 1998. Morgan Kaufmann.
    [6] S. Soderland, “Learning information extraction rules for semistructured and free text,” Machine learning, vol. 34 (Special issue on Natural Language Learning), no. 1/3, pp. 233--272, 1999.
    [7] Huei Long Wang, Wei-Kuan Shih, Chunnan Hsu, Yi-Shiou Chen, Yu-Lin Wang, Wen-Lian Hsu, “Personal Navigating Agent,” Proceedings of Third International Conference on Autonomous Agent, May 1-5, 1999, Seattle, Washington, USA.
    [8] Huei Long Wang, Shih-Hung Wu, I. C. Wang, Cheng-Lung Sung, W. L. Hsu, W. K. Shih, “Semantic Search on Internet Tabular Information Extraction for Answering Queries,” Proceedings of Ninth International Conference on Information and Knowledge Management, November 6-11, 2000, McLean, VA.
    [9] Riloff, E. 1993. “Automatically Constructing a Dictionary for Information Extraction Tasks”. In Proceedings of the Eleventh National Conference on Artificial Intelligence. 811--816.
    [10] S. Soderland. “Learning Text Analysis Rules for Domain Specific Natural Language Processing”. PhD thesis, University of Massachusetts Amherst, 1997.
    [11] S. Soderland, D. Fisher, J Aseltine, and W.G. Lehnert. “Crystal: Inducing a conceptual dictionary”. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1314-- 1321, 1995.
    [12] S.B. Huffman. “Learning Information Extraction Patterns from Examples”. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 246--260. Springer Verlag, 1996.
    [13] S. Soderland, “Learning to extract text-based information from the world wide web”, in Proc. of the Third Intern. Conf. on Knowledge Discovery and Data Mining, (1997).
    [14] Xinxin Wang (1996), “Tabular abstraction, editing, and formatting,” PhD Thesis, Department of Computer Science, University of Waterloo.
    [15] Hurst, M. & Douglas, S. (1997), “Layout and Language: Preliminary investigations in recognizing the structure of tables,” In Proceedings of the Fourth International Conference on Document Analysis and Recognition, 28-31 Aug., Ulm, Germany.
    [16] Douglas, S. & Hurst, M. (1996), ”Layout and Language: lists and tables in technical documents,” In Jones, B. (Ed.), Proceedings of SIGPARSE Workshop on Punctuation in Computational Linguistics, pp. 19-24, Santa Cruz.
    [17] H. L. Wang, W. L. Hsu, Y. S. Chen, T. L. Lau, C. H. Tang, H. M. Yeh, W. K. Shih (1999), “A Streamlined Approach for Tabular Information Extraction”. Proceedings of NCS99.
    [18] Hsu, J.Y. and Yih, W. “Template-based Information Mining from HTML Documents”. AAAI-97 (1997) 256--262
    [19] C. Hsu and M. Dung. “Generating finite-state transducers for semi-structured data extraction from the web”. Journal of Information Systems, 23(8):521--538, 1998.
    [20] Hurst, Matthew (1999), “Layout and Language: Beyond Simple Text for Information Interaction - Modeling The Table”. In The 2nd International Conference on Multimodal Interfaces, Hong Kong.
    [21] D. Embley, D.Campbell, R. Smith, S. Liddle (1998), “Ontology-based extraction and structuring of information from data-rich unstructured documents,” Proceedings of the Conference on Information and Knowledge Management (CIKM’ 98). pp52-59.
    [22] W. L. Hsu, Yi-Shiou Chen and Yuan-Kai Wang (1998), “A Context sensitive model for concept understanding,” Proceedings of ITALLC 98, pp.161-169.
    [23] W. L. Hsu, Yi-Shiou Chen and Yuan-Kai Wang (1999), “Natural language agents–An agent society on the Internet,” Proceedings of PRIMA 99.
    [24] Patricia Wright (1982), “A user oriented approach to the design of tables and flowcharts,” In David H. Jonassen, editor, The Technology of Text, pp. 317-340. Educational Technology Publications.
    [25] Kalinichenko L.A. “Methods and tools for equivalent data model mapping construction”. Proc. of the EDBT'90 Conference, 1990, Springer Verlag, p.92-119. 26
    [26] C. N. Hsu. (1998). “Initial Results on Wrapping Semi-structured Web Pages with Finite-States Transducers and Contextual Rules”. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-14, AAAI Press, Men Park, CA.
    [27] Kushmerick, Nicholas. (1997). “Wrapper Induction for Information Extraction”. Ph.D. Dissertation, Department of Computer Science and Engineering, University of Washington, Seatle, WA.
    [28] H. H. Chen, G. W. Bian. (1998). “White Page Construction from Web Pages for finding People on the Internet”. In International Journal of Computational Linguistics and Chinese Language Processing vol.3 no.1 Feb.
    [29] Defense Advanced Research Projects Agency(DARPA). (1995). Proceedings of the sixth message understanding conference (MUC-6), Morgan Kaufmann Publishers, Inc.
    [30] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. (1998). “Learning to Extract Symbolic Knowledge from the World Wide Web”. In Proceeding of 15th National Conference on Artifical Intelligence (AAAI-98).
    [31] D. Freitag. (1998). “Information Extraction from HTML: Application of a General Machine Learning Approach”. AAAI98
    [32] W. L. Hsu. (1995). “Chinese parsing in a phoneme-to-character conversion system based on semantic pattern matching”. International Journal on Computer Processing of Chinese and Oriental Languages 40, (1995),227-236.
    [33] Hsu, C.-N., and Chang, C.-C. “Finite-State Transducers for Semi-Structured Text Mining”. Proc. IJCAI-99 Workshop on Text Mining, 1999.
    [34] Gaizauskas R., Wilks Y, “Information Extraction: Beyond Document Retrieval”, in Memoranda in Computer and Cognitive Science, University of Sheffield, CS-97-10, 1997.
    [35] Lehnert, W. G. and Sundheim, B. 1991. “A Performance Evaluation of Text Analysis Technologies”. AI Magazine 12(3):81-94. 44
    [36] J. Cowie and W. Lehnert. “Information extraction”. Communications of the ACM, 39(1):80-91, 1996.
    [37] Deutsch A., Fernandez M., Florescu D., Levy A., and Suciu D., “A Query Language for XML,” WWW8 / Computer Networks, 31(11-16), pp. 1155-1169, 1999.
    [38] Deutsch A., Fernandez M.F., Suciu D., “Storing Semistructured Data with STORED,” ACM SIGMOD Conf., pp. 431-442, 1999.
    [39] M. Fernandez, J. Simeon and P. Wadler, “XML Query Languages: Experiences and Exemplars”,
    http://www-db.research.bell-labs.com/user/simeon/xgquery.html, 2000
    [40] Califf, M. E. 1998. “Relational Learning Techniques for Natural Language Information Extraction”. Ph.D. Dissertation, Department of Computer Sciences, University of Texas, Austin, TX. Also appears as Artificial Intelligence Laboratory Technical Report AI 98-276
    http://www.cs.utexas.edu/users/ai-lab

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE