研究生: |
王輝龍 Huei-Long Wang |
---|---|
論文名稱: |
網際網路上表格資訊擷取代理人 Internet Tabular Information Extraction and Personal Navigating Agent |
指導教授: |
石維寬
Wei-Kuan Shih |
口試委員: | |
學位類別: |
博士 Doctor |
系所名稱: |
電機資訊學院 - 資訊工程學系 Computer Science |
論文出版年: | 2002 |
畢業學年度: | 90 |
語文別: | 中文 |
論文頁數: | 117 |
中文關鍵詞: | 表格 、資訊擷取 、代理人 、網際網路 |
外文關鍵詞: | Table, Information Extraction, Agent, Internet, Web |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
除了透過肉眼閱讀以外,資訊擷取(Information extraction)系統技術可以自動從網路電子文件裡擷取我們所需要的資訊並加以整理。資訊擷取系統依賴規則規範資訊的關鍵文字與語意等特徵以及其可能的排列順序來尋找並比對出一段符合的文字。一般的資訊擷取系統透過機器學習(Machine Learning)來建立規則。然而,即使是擷取同一類的資訊,針對在同一個網站上面的文件所學習的規則通常是無法在其他網站上面使用的。即便是同一個網站,那天網站的編輯方式修改了,這些規則也會變得一點用也沒有。我們在網際網路表格資訊擷取代理人所發展的技術克服了書寫習慣的限制。透過知識的建立與格式的分析,我們發展出來的技術可以在不同的網站重複使用。
我們將資訊擷取代理人設計成可以像人一樣在尋找資訊時瀏覽相關網頁。透過網頁瀏覽地圖(Navigating Map)的指示,一開始只是先從一個網頁擷取資訊,一旦發現一個網頁超連結,它會開始進行辨認,辨認下一個網頁是否是相關網頁,然後才連過去並進行資訊擷取的工作。除了一般資訊擷取系統慣用的特徵比對(Pattern Matching)規則之外,資訊擷取代理人還進行 HTML 的結構分析(Structure Analysis)以便整合連結相關資訊。除了結構分析的技術只適用於 HTML 之外,其餘技術都使用針對一個領域所建立的知識,必須適用於各種網站。
在進行網頁資訊擷取的工作時我們發現,從表格擷取資訊存在一些問題是一般的資訊擷取技術所無法解決的。在某些表格裡面,有些儲存格的資訊被數個資訊框架(Case Frame)所分享。二維表格算是其中的典型代表。在本論文裡,我們又發展出獨特的表格資訊擷取技術。同樣的,該技術也是使用針對單一領域所建立的比對知識,透過與表格格式辨認技術的結合,可以適用於各種不同網站不同格式的表格。
我們更進一步將資訊擷取代理人使用的語意規則以及網頁瀏覽地圖合併編織成知識脈絡圖(Knowledge Map)。知識脈絡圖是被用來擷取表格資訊的最佳利器。它具有嚴密的概念模型描述知識領域裡的物件結構以便於去除資訊混淆(Ambiguous)的狀況並且有助於連結相關儲存格資訊,並且定義各種可能的資訊框架。對於一個領域裡面的各種資訊,知識脈絡圖也分別紀錄用來辨認這些資訊的語意規則。在知識脈絡圖裡,我們可以對一項資訊定義它的細節,也可以定義細節的細節。這可以幫助我們定義該資訊在被表達時其細節各種可能的排列方式。也可以幫助我們擷取某項資訊的細節部分。
另一個擷取表格資訊的利器是我們將表格的各種可能的資訊排列方式化為規則方便作特徵比對。對應於這些表格格式辨認規則,我們也定義對應的轉換規則來將表格格式轉換成資料庫表格格式。這些規則跟資訊內容的領域無關,因此可以適用於所有領域。這些規則由我們設計的格式描述與法來定義。我們從這些規則裡面可以找出跟表格格式最符合的規則並且幫助我們去除一些利用格式才能去除的資訊混淆。
Information extraction (IE) systems can make the text online become more available to be accessed. Some IE applications need a separate set of rules tuned to the domain and the writing styles. In particular, rules created without carefully learning are too restricted to be reused in other web sties. Even in the same web site rules cannot be reused when the writing style is changed slightly. The Personal Navigating Agent (PNA) and Tabular Information Extraction (TIE) systems overcome the writing style restricts. Both systems are adaptive to extract information for one domain from different web sites with just one set of rules or knowledge.
PNA is designed to follow the guide of domain knowledge to search information thru navigating related web pages as a person. It extracts information in one web page. When it recognized a hyperlink, it can navigate to the page pointed by this hyperlink and then extract information again after recognizing that the related information can be found in that page. PNA uses the most common pattern matching technique that is used by most IE systems to extract information from a single text input. It also uses advance technique such HTML structure analysis to associate extracted data.
There also exist some problems when we use common techniques to extract information from tables. Some IE systems cannot extract slots shared by several case frames or text cannot be divided into segments of case frames. Two-dimensional tables are in such cases and there are other kinds of tables. TIE is designed to extract information just from tables. As same as PNA, TIE is also a domain-dependent and web-site-independent system. Moreover, the semantic templates and navigating maps in PNA are merged in TIE as knowledge maps with highly structured and the characteristic of reuse. A generic framework is also designed in TIE to identify and extract information for database queries.
The knowledge map is the main power of TIE to extract information from tabular documents for each domain. A knowledge map is a concept model that describes the structure of objects in a domain. For tables, it describes the attributes a case frame may have. For attributes, it also describes the rules to recognize their labels and entry values. It allows an attribute to define its sub-attributes. It also allows attributes defined hierarchically. Based on the knowledge map, we can apply the information tagging for all cells in a table.
The second power of TIE is a set of rules to recognize table layouts after the process of information tagging. These rules also define a semantic preserved transition to output a new table in form of relational database tables. These rules are domain independent and can be used to all tables. These rules are defined with layout description syntax that can describe tables in a concept model. Based these rules, we can find the best-fit layout and remove ambiguity. After the layout is recognized, we can transfer the input table into a relational database table. To extract case frames from such layout is as simple as to extract records from a relational database table.
[1] P. Buneman. “Semistructured data”. Tutorial in Proc. of 16th ACM Symp. On Principles of Database Systems (PODS'97), 1997.
[2] Serge Abiteboul. “Querying semi-structured data”. In Proceedings of the ICDT, 1997.
[3] N. Kushmerick, R. Doorenbos, and D. Weld. “Wrapper induction for information extraction”. In Proceedings of the 15th International Joint Conference on Artificial Intelligence, 1997.
[4] M. Califf, R. Mooney. “Relational Learning of Pattern-Match Rules for Information Extraction”, Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 1997.
[5] Dayne Freitag. “Multistrategy learning for information extraction”. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 161--169, San Francisco, 1998. Morgan Kaufmann.
[6] S. Soderland, “Learning information extraction rules for semistructured and free text,” Machine learning, vol. 34 (Special issue on Natural Language Learning), no. 1/3, pp. 233--272, 1999.
[7] Huei Long Wang, Wei-Kuan Shih, Chunnan Hsu, Yi-Shiou Chen, Yu-Lin Wang, Wen-Lian Hsu, “Personal Navigating Agent,” Proceedings of Third International Conference on Autonomous Agent, May 1-5, 1999, Seattle, Washington, USA.
[8] Huei Long Wang, Shih-Hung Wu, I. C. Wang, Cheng-Lung Sung, W. L. Hsu, W. K. Shih, “Semantic Search on Internet Tabular Information Extraction for Answering Queries,” Proceedings of Ninth International Conference on Information and Knowledge Management, November 6-11, 2000, McLean, VA.
[9] Riloff, E. 1993. “Automatically Constructing a Dictionary for Information Extraction Tasks”. In Proceedings of the Eleventh National Conference on Artificial Intelligence. 811--816.
[10] S. Soderland. “Learning Text Analysis Rules for Domain Specific Natural Language Processing”. PhD thesis, University of Massachusetts Amherst, 1997.
[11] S. Soderland, D. Fisher, J Aseltine, and W.G. Lehnert. “Crystal: Inducing a conceptual dictionary”. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1314-- 1321, 1995.
[12] S.B. Huffman. “Learning Information Extraction Patterns from Examples”. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 246--260. Springer Verlag, 1996.
[13] S. Soderland, “Learning to extract text-based information from the world wide web”, in Proc. of the Third Intern. Conf. on Knowledge Discovery and Data Mining, (1997).
[14] Xinxin Wang (1996), “Tabular abstraction, editing, and formatting,” PhD Thesis, Department of Computer Science, University of Waterloo.
[15] Hurst, M. & Douglas, S. (1997), “Layout and Language: Preliminary investigations in recognizing the structure of tables,” In Proceedings of the Fourth International Conference on Document Analysis and Recognition, 28-31 Aug., Ulm, Germany.
[16] Douglas, S. & Hurst, M. (1996), ”Layout and Language: lists and tables in technical documents,” In Jones, B. (Ed.), Proceedings of SIGPARSE Workshop on Punctuation in Computational Linguistics, pp. 19-24, Santa Cruz.
[17] H. L. Wang, W. L. Hsu, Y. S. Chen, T. L. Lau, C. H. Tang, H. M. Yeh, W. K. Shih (1999), “A Streamlined Approach for Tabular Information Extraction”. Proceedings of NCS99.
[18] Hsu, J.Y. and Yih, W. “Template-based Information Mining from HTML Documents”. AAAI-97 (1997) 256--262
[19] C. Hsu and M. Dung. “Generating finite-state transducers for semi-structured data extraction from the web”. Journal of Information Systems, 23(8):521--538, 1998.
[20] Hurst, Matthew (1999), “Layout and Language: Beyond Simple Text for Information Interaction - Modeling The Table”. In The 2nd International Conference on Multimodal Interfaces, Hong Kong.
[21] D. Embley, D.Campbell, R. Smith, S. Liddle (1998), “Ontology-based extraction and structuring of information from data-rich unstructured documents,” Proceedings of the Conference on Information and Knowledge Management (CIKM’ 98). pp52-59.
[22] W. L. Hsu, Yi-Shiou Chen and Yuan-Kai Wang (1998), “A Context sensitive model for concept understanding,” Proceedings of ITALLC 98, pp.161-169.
[23] W. L. Hsu, Yi-Shiou Chen and Yuan-Kai Wang (1999), “Natural language agents–An agent society on the Internet,” Proceedings of PRIMA 99.
[24] Patricia Wright (1982), “A user oriented approach to the design of tables and flowcharts,” In David H. Jonassen, editor, The Technology of Text, pp. 317-340. Educational Technology Publications.
[25] Kalinichenko L.A. “Methods and tools for equivalent data model mapping construction”. Proc. of the EDBT'90 Conference, 1990, Springer Verlag, p.92-119. 26
[26] C. N. Hsu. (1998). “Initial Results on Wrapping Semi-structured Web Pages with Finite-States Transducers and Contextual Rules”. In Proceedings of AAAI-98 Workshop on AI and Information Integration, Technical Report WS-98-14, AAAI Press, Men Park, CA.
[27] Kushmerick, Nicholas. (1997). “Wrapper Induction for Information Extraction”. Ph.D. Dissertation, Department of Computer Science and Engineering, University of Washington, Seatle, WA.
[28] H. H. Chen, G. W. Bian. (1998). “White Page Construction from Web Pages for finding People on the Internet”. In International Journal of Computational Linguistics and Chinese Language Processing vol.3 no.1 Feb.
[29] Defense Advanced Research Projects Agency(DARPA). (1995). Proceedings of the sixth message understanding conference (MUC-6), Morgan Kaufmann Publishers, Inc.
[30] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. (1998). “Learning to Extract Symbolic Knowledge from the World Wide Web”. In Proceeding of 15th National Conference on Artifical Intelligence (AAAI-98).
[31] D. Freitag. (1998). “Information Extraction from HTML: Application of a General Machine Learning Approach”. AAAI98
[32] W. L. Hsu. (1995). “Chinese parsing in a phoneme-to-character conversion system based on semantic pattern matching”. International Journal on Computer Processing of Chinese and Oriental Languages 40, (1995),227-236.
[33] Hsu, C.-N., and Chang, C.-C. “Finite-State Transducers for Semi-Structured Text Mining”. Proc. IJCAI-99 Workshop on Text Mining, 1999.
[34] Gaizauskas R., Wilks Y, “Information Extraction: Beyond Document Retrieval”, in Memoranda in Computer and Cognitive Science, University of Sheffield, CS-97-10, 1997.
[35] Lehnert, W. G. and Sundheim, B. 1991. “A Performance Evaluation of Text Analysis Technologies”. AI Magazine 12(3):81-94. 44
[36] J. Cowie and W. Lehnert. “Information extraction”. Communications of the ACM, 39(1):80-91, 1996.
[37] Deutsch A., Fernandez M., Florescu D., Levy A., and Suciu D., “A Query Language for XML,” WWW8 / Computer Networks, 31(11-16), pp. 1155-1169, 1999.
[38] Deutsch A., Fernandez M.F., Suciu D., “Storing Semistructured Data with STORED,” ACM SIGMOD Conf., pp. 431-442, 1999.
[39] M. Fernandez, J. Simeon and P. Wadler, “XML Query Languages: Experiences and Exemplars”,
http://www-db.research.bell-labs.com/user/simeon/xgquery.html, 2000
[40] Califf, M. E. 1998. “Relational Learning Techniques for Natural Language Information Extraction”. Ph.D. Dissertation, Department of Computer Sciences, University of Texas, Austin, TX. Also appears as Artificial Intelligence Laboratory Technical Report AI 98-276
http://www.cs.utexas.edu/users/ai-lab