研究生: |
江政韓 Cheng-Han Chiang |
---|---|
論文名稱: |
非督導式細分類之網路具名實體自動發掘系統 Unsupervised Discovery of Named Entities with Fine-Grained Category on the Web |
指導教授: |
張俊盛
Jason S. Chang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2005 |
畢業學年度: | 93 |
語文別: | 英文 |
論文頁數: | 54 |
中文關鍵詞: | 具名實體 、相關字詞 、表面型態 、知識探勘 、資訊擷取 |
外文關鍵詞: | Named Entity, Related Words, Surface Pattern, Knowledge Discovery, Information Extraction |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一個利用網路資源自動發現特定類型的具名實體的方法。根據使用者所輸入的少量具名實體(named entity),在網路上的擷取相關資訊當作種子(seed),我們訓練出此類實體的共通特性的語言模型,及可以擷取此類實體的表面樣式(surface pattern), 之後我們根據訓練結果去網路上找出相同類型的具名實體。
在訓練過程中,我們首先利用網路自動收集所需的種子資料、依據語言學及統計學上的知識從種子資料中訓練出關鍵字詞表及其比重值,並且抽取出可清楚表達此類實體附近的字詞,最後將所有字詞的關聯性統計計算出最適合此類實體的擷取樣式(extraction pattern)。在執行階段,利用機率模組中最高頻的兩個名詞及年份從網路擷取相關資料,經過機率模組過濾出應為此類實體的相關文章,最後利用所學習到的表面樣式,擷取新的具名實體。
我們將提出的想法實作成系統雛形,並且將其實驗結果與Google Sets做比較。各項實驗結果證明我們的方法在準確率上與Google Sets不相上下,但是在涵蓋的範圍上則遠勝於Google Sets。本文的確為發掘具相同類型的專名實體提出了一個有效且簡單的方法。
We introduce a method for finding named entities (NEs) with the same category as a given set of seed named entities on the Web. In our approach, passages containing the given seed NEs are retrieved from the Web and subsequently used to construct linguistic model aimed at discovering more new NEs with the same category from the Web.
The method involves generating a key terms table with word classes from Webpage summaries containing the seed NEs and learning surface patterns containing the seed NEs from these passages. At runtime, we use salient key terms and word classes in the model to find the new Web summaries, filter out unlikely passages and extract the new NEs from the remaining passages using surface patterns.
We presented a prototype system, Name Finder, which applies the proposed method to discover additional NEs for a set of given several NEs. We evaluate and compare Name Finder with Google Sets. The experimental results show that our system produces more NEs with an average precision rate comparable with Google Sets. Our methodology cleanly supports automatic knowledge discovery and ontology extension.
Anderson, P.M., Hayes, P.J., Huettner, A.K., Nirenburg,I.B., Schmandt, L.M., and Weinstein, S.P. Automatic Extraction of Facts from Press Releases to Generate News Stories. In Processing of the Third Conference on Applied Natural Language Processing, 170-177, 1992.
Califf, M. and Mooney, R. Relational learning of pattern-match rules for information extraction. Working Papers of the ACL-97 Workshop in Natural Language Learning, 9-15, 1997.
Caraballo, S.A. Automatic construction of a hypernym-labeled noun hierarchy from text. In Processing of the Conference of the Association for Computational Linguistics, 1999.
Deepak Ravichandran and E. Hovy. Learning surface text patterns for a question answering system. In Proceedings of the 40th Annual Meeting of the ACL, pages 41-47, 2002.
Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL'98, pages 768-773, 1998a.
Douglas P. Metzler and Stephanie W. Haas. The Constituent Object Parser: Syntactic Structure Matching for Information Retrieval. In Proceedings of the ACM Transactions on Information Systems (TOIS-98), 292-361, 1998.
E. Riloff and R. Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), 1999.
Erik F. Tjong Kim Sang. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 155-158, 2002.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2003, pp. 142-147, 2003.
Eugene A. Nida. Componential Analysis of Meaning. The Hague, Mouton, 1975.
Freitag, D. Information extraction from html: Application of a general learning approach. In Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98), 517-523, 1998.
Gideon S. Mann. Fine-Grained Proper Noun Ontologies for Question Answering. In proceedings of emaNet'02: Building and Using Semantic Networks, Taipei, Taiwan, 2002.
Good, I. J. The population frequencies of species and the estimation of population parameters. Biometrika 40, pp.237-264, 1953.
Kim, J. and Moldovan, D. Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactiops on Knowledge and Data Engineering 7(5), pp.713-724, 1995.
Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson. 1999 Named Entity Recognition Task Definition. MITRE and SAIC, 1999.
Patrick Pantel and Dekang Lin. Discovering Word Senses from Text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-02), pp.613-619, 2002.
Patrick Pantel, Deepak Ravichandran, and Eduard Hovy. Towards Terascale Knowledge Acquisition. In Proceedings of the COLING, Geneva, Switzerland, 2004.
Pereira, F., Tishby, N., and Lee, L. Distributional clustering of English words. In Proceedings of the ACL, pp.183-190, 1993.
Phillips, W. and Riloff, E. Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.
Riloff, E. Automatically constructing a dictionary for information extraction tasks. In Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93), pp.811-816, 1993.
Rocchio, Jr., J. J. Relevance feedback in information retrieval. In Salton, G., editor, The SMART Retrieval System Experiments in Automatic Document Processing, pp.313-323. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971.
Soderland, S., Fisher, D., Aseltine, J., and Lehnert, W. Crystal: Inducing a conceptual dictionary. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pp.1314–1319, 1995.
Soderland, S. Learning information extraction rules for semi-structured and free text. To appear in the Journal of Machine Learning, 1998.