半督導式中文特定類型具名實體擷取之研究｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳瑩綺 Chen, Ying-Chi
論文名稱：	半督導式中文特定類型具名實體擷取之研究 A Semi-Supervised Method for Extracting Instances of a Certain Type in Chinese
指導教授：	張俊盛 Chang, Jason S. 張智星 Jang, Jyh-Shing Roger
口試委員:
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2009
畢業學年度：	97
語文別：	英文
論文頁數：	68
中文關鍵詞：	資料擷取、具名實體辨識、網路語料庫、最大熵模型、自動標記
外文關鍵詞：	Information extraction (IE), Name Entity Recognition (NER), Web corpus, Maximum Entropy model (ME), Automatically tagging
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文描述半督導式資料抽取方法，自動抽取中文文字資料中，某領域下特定類型的具名實體。此方法能自動建立及標記語料庫，以此訓練機器學習模型，用以消除過去督導式機器學習方法中人工標記的限制。在訓練階段，我們利用一般用途且易取得的同義詞典，從中選取一組種子於網路上取得語料，並利用種子自動標記取得的語料庫，再訓練機器學習模型於自動標記的語料庫。在執行階段，應用訓練好的機器學習模型，從自然語言書寫的文章，抽取出目標的具名實體。我們利用完全比對的評估方式，證明本方法可以有效地抽出目標的具名實體，最高正確率逹78%。此實驗結果以少量的種子資料，成功地消除人工標記的限制，顯示本方法的領域移植性優於其它督導式機器學習方法。

We introduce a semi-supervised method for the extraction of instances of a certain type from a Chinese text under a domain. In our approach, a machine learning model for extraction is trained on an automatically collected and tagged corpus, aiming at eliminating the limiting factor of human annotation on current supervised systems. The method involves selecting seed data of target instances from off-the-shelf general purpose thesauri, using seeds to automatically collect a corpus from the Web, automatically tagging the corpus by seed data and training a machine learning model on the corpus. At run time, a natural language text is segmented into words, and the trained model is applied on the words to make the best tagging decisions, from which we extract target instances. The evaluation of exact match on a set of annotated test data shows that the method successfully extracts target instances at the precision rate of 78%. Our methodology accomplishes the elimination of human annotation on training data by small amount of seed data, and the method is highly portable to other domains.

致謝
摘要
ABSTRACT
Chapter 1  Introduction
Chapter 2  Related Work
Chapter 3  Method
   3.1  Problem Statement
   3.2  Training a machine learning model for extraction
      3.2.1  Data Collection
      3.2.2  Tag automatically a corpus
      3.2.3  Apply machine learning on tagged data
   3.3  Run-time extraction of target instances
Chapter 4  Experiment Setting and Result
   4.1  Experimental Setting
   4.2  Systems for comparison
   4.3  Evaluation Metrics and Annotation on Test Data
   4.4.  Evaluation Results
Chapter 5  Conclusion and Future Work
References
Appendix A – Samples of Seed data
   Source: Chinese Wordnet
   Source: ToYiChi CiLin (同義詞詞典)
   Source: Chinese thesaurus provided by Academia Sinica
   Source: UDList
Appendix B – Samples of Test Data

                                

Bender, O., Och, F. J., & Hermann, N. (2003). Maximum Entropy Models for Named Entity Recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL.

Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). NYU: Describption of the MENE Named Entity System as used in MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).

Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. In WebDB Wokshop at EDBT '98.
Downey, D., Broadhear, M., & Etzioni, O. (2007). Locating Complex Named Entities in Web Text. In Proceedings of IJCAI 2007.

Hearst, M. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th conference on Computational linguistics.

Kripke, S. Naming and Necessity. Boston: Harvard University Press.

Lisa, R. F. (1991). Extracting Company Names from Text. In Proceedings of Conference on Artificial Intelligence Applications of IEEE.

Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Retrieval. Cambridge university press.

McCallum, A., Freitag, D., & Pereira, F. (2000). Maximumentropy Markov models for information extraction andsegmentation. In Proceedings of ICML (pp. 591–598). California: Stan-ford.

Nadeau, D., & Sekine, S. (2007). A Survey of Named Entity Recognition and Classification. Lingvisticae Investigationes, Volume 30, Number 1 , pp. 3-26.

Nadeau, D., Turney, P. D., & Matwin, S. (2006). Unsupervised NamedEntity Recognition: Generating Gazetteers and Resolving Ambiguity. In Proceedings of Canadian Conference onArtificial Intelligence.

Patwardhan, S., & Riloff, E. (2006). Learning Domain-specific Information Extraction Patterns from the Web. In Proceedings of the ACL 2006 Workshop on Information Extraction Beyond the Document.

Riloff, E., & Jones, R. (1999). Learning Dictionaries for Information Extraction by Multi-level Bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.

Sekine, S., & Isahara, H. IREX: IR and IE Evaluation project in Japanese. In Proceedings of the 2nd International Conference on Language Resources and Evaluation.

Sekine, S., Sudo, K., & Nobata, C. (2002). Extended Named Entity Hierarchy. In Proceedings of the LREC-2002 Conference.

Shinzato, K., Sekine, S., Yoshinaga, N., & Torisawa, K. (2006). Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web. In Web Content Mining with Human Language Technolo-gies Workshop on the 5th International Semantic Web.

Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003.

Tsai, T.-H., Wu, S.-H., Lee, C.-W., Shih, C.-W., & Hsu, W.-L. (2004). Mencius: A Chinese Named Entity Recognizer Using Maximum Entropy-based Hybrid Model. International Journal of Computational Linguistics & Chinese Language Processing .

Uchimoto, K., Ma, Q., Murata, M., Ozaku, H., & Hitoshi, I. (2000). Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules. In Proceedings of 33rd Annual Meeting of the Association of the Computational Linguistics.

Zhang, Y.-J., & Zhang, T. (2007). ME-based Chinese Person Name and Location Name Recognition Model. In Proceedings of the Sixth International Conference on Machine Learning and Cybernetics.

Zhao, J., & Liu, F. (2008). Product Named Entity Recognition in Chinese Text. Language Resources and Evaluation .

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文