研究生: |
楊捷扉 Jie-Fei Yang |
---|---|
論文名稱: |
人物搜尋之資訊擷取與分類 Information Extraction and Classification for Person Search |
指導教授: |
張俊盛
Jason S. Chang |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 英文 |
論文頁數: | 50 |
中文關鍵詞: | 人名檢索 、資訊擷取 、文件分類 |
外文關鍵詞: | person search, information extraction, text categorization |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文提出一個以網路資源為本,自動收集中文人名經歷資訊及專業領域。透過個人經歷資訊擷取以及專業領域的分類,可以有效地解決人名歧異(Personal Name Disambiguation)之問題。而專業領域分類更使得個人資訊的提供,能有系統一致化地呈現給使用者。
在訓練過程中,我們利用語言學的知識以及統計學上的技術,從網路上收集經歷資訊之表面樣式(surface patterns),作為從網路上收集人名資訊以及擷取個人資訊之依據。並且應用Yarowsky (1995)的自舉式方法,以網路資源為本來訓練文件分類器。在執行階段,輸入的人名透過表面樣式之輔助收集經歷資訊,經由經歷資訊及領域分類,解析區隔同名同姓人士的資訊。
我們也將描述此一方法的系統實作。實驗結果證明我們的方法能夠有效地取出人名的經歷,並且區格不同領域的同名同姓人士,使得個人資訊之網路搜集更為有效。
We introduce a method for automatically collecting personal information and professional domain of the person. In our approach, personal information is extracted and the domain is identified from web-based data based on personal name disambiguation.
In the training phase, the method involves generating surface pattern to personal information extraction based on linguistic and statistical information from the Web, and an unsupervising algorithm for constructing Web-based text categorization. At runtime, submitting a person name into a search engine, extracting personal information and identifying each retrieved passage the domain according to the expected person name, finally the referents are sorted by domain, personal information and the degree of popularity.
We also described an implementation of the proposed method. Blind evaluation of a set of names shows that our method outperforms extracting personal information and cleanly classifying individual’s domain-specific knowledge. This method can be applied to help users quickly find about a person with resulting in the display of personal information in a systematic and consistent way.
AI-Kamha, R. and Embley, D. W. Grouping Search-Engine Returned Citations for Person-Name Queries. In WIDM’04, pp.96-103, Washington, DC, USA, 2004.
Bagga, A. and Baldwin, B. Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 79-85, Montreal, Canada, 1998.
Bekkerman, R. and McCallum A. Disambiguating Web Appearances of People in a Social Network. In Proceedings of the 15th World Wide Web Conference (WWW 2005), ACM press, pp.463-470, Chiba, Japan, 2005.
Bollegala D., Marsuo Y., and Ishizuka M. Extracting Key Phrases to Disambiguate Personal Names on the Web. In Proceeding of CICLing, 2006.
Fleischman M. B. and Hovy E. Multi-document Person Name Resolution. In Proceedings of the Workshop on Reference Resolution, Barcelona, Spain, 2004.
Googlism. 2003. <http://www.googlism.com> (1 July 2006).
Guha, R. and Garg, A. Disambiguating People in Search. In Proceedings of the 13th World Wide Web Conference (WWW 2004), ACM Press, 2004.
Lloyd, L., Bhagwan V., Gruhl D., and Tomkins A. Disambiguation of references to individuals. Technical Report RJ10364 (A0510-011), IBM Research, 2005
Malin, B. Unsupervised Name Disambiguation via Social Network Similarity. In proceedings of the Workshop on Link analysis, Counterterrorism, and Security, in conjunction with the SIAM International Conference on Data Mining, pp. 93-102, Newport Beach, CA, 2005.
Mann, G. S. and Yarowsky, D. Unsupervised Personal Name Disambiguation. In Proceedings of 7th Conference on Computational Natural Language Learning (CoNLL-2003), pp. 33-40, Edmonton, Canada, 2003.
Manning, C. D. Foundations of Statistical Natural Language Processing (London: England, 1999), pp. 232, 249-252, 494, 575.
Peng, F., Weischedel, R., Licuanan, A., Xu, J. Combining Deep Linguistics Analysis and Surface Pattern Learning: A Hybrid Approach to Chinese Definitional Question Answering, 2005. Retrieved June 2, 2006, from http://www.cs.umass.edu/fuchun/publication/HLT-EMNLP2005.pdf.
Soubbotin, M. M. Patterns of Potential Answer Expressions as Clues to the Right Answer. In Proceedings of the TREC-10 Conference, NIST, pp.175-182, Gaithersburg, MD, 2001.
Vivisimo Inc.2000. <http:// www.vivisimo.com> (1 July 2006).
Voorhees, E. M. Overview of the TREC 2003 Question answering Track. In proceeding of the 12th Text Retrieval Conference (TREC 2003), pp. 54-68, Gaithersburg, MD, 2004.
Wan, X., Gao, J., Li, M., and Ding, B. Person Resolution in Person Search Results: WebHawk. In Proceedings of ACM 14th Conference on Information and Knowledge Management (CIKM 2005), pp. 163-170, Bremen, Germany, 2005.
Yarowsky, D. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp.88-95, Las Cruces, NM, 1994.
Yarowsky, D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196, 1995.