簡易檢索 / 詳目顯示

研究生: 陳瑩綺
Chen, Ying-Chi
論文名稱: 半督導式中文特定類型具名實體擷取之研究
A Semi-Supervised Method for Extracting Instances of a Certain Type in Chinese
指導教授: 張俊盛
Chang, Jason S.
張智星
Jang, Jyh-Shing Roger
口試委員:
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2009
畢業學年度: 97
語文別: 英文
論文頁數: 68
中文關鍵詞: 資料擷取具名實體辨識網路語料庫最大熵模型自動標記
外文關鍵詞: Information extraction (IE), Name Entity Recognition (NER), Web corpus, Maximum Entropy model (ME), Automatically tagging
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文描述半督導式資料抽取方法,自動抽取中文文字資料中,某領域下特定類型的具名實體。此方法能自動建立及標記語料庫,以此訓練機器學習模型,用以消除過去督導式機器學習方法中人工標記的限制。在訓練階段,我們利用一般用途且易取得的同義詞典,從中選取一組種子於網路上取得語料,並利用種子自動標記取得的語料庫,再訓練機器學習模型於自動標記的語料庫。在執行階段,應用訓練好的機器學習模型,從自然語言書寫的文章,抽取出目標的具名實體。我們利用完全比對的評估方式,證明本方法可以有效地抽出目標的具名實體,最高正確率逹78%。此實驗結果以少量的種子資料,成功地消除人工標記的限制,顯示本方法的領域移植性優於其它督導式機器學習方法。


    We introduce a semi-supervised method for the extraction of instances of a certain type from a Chinese text under a domain. In our approach, a machine learning model for extraction is trained on an automatically collected and tagged corpus, aiming at eliminating the limiting factor of human annotation on current supervised systems. The method involves selecting seed data of target instances from off-the-shelf general purpose thesauri, using seeds to automatically collect a corpus from the Web, automatically tagging the corpus by seed data and training a machine learning model on the corpus. At run time, a natural language text is segmented into words, and the trained model is applied on the words to make the best tagging decisions, from which we extract target instances. The evaluation of exact match on a set of annotated test data shows that the method successfully extracts target instances at the precision rate of 78%. Our methodology accomplishes the elimination of human annotation on training data by small amount of seed data, and the method is highly portable to other domains.

    致謝 摘要 ABSTRACT Chapter 1 Introduction Chapter 2 Related Work Chapter 3 Method 3.1 Problem Statement 3.2 Training a machine learning model for extraction 3.2.1 Data Collection 3.2.2 Tag automatically a corpus 3.2.3 Apply machine learning on tagged data 3.3 Run-time extraction of target instances Chapter 4 Experiment Setting and Result 4.1 Experimental Setting 4.2 Systems for comparison 4.3 Evaluation Metrics and Annotation on Test Data 4.4. Evaluation Results Chapter 5 Conclusion and Future Work References Appendix A – Samples of Seed data Source: Chinese Wordnet Source: ToYiChi CiLin (同義詞詞典) Source: Chinese thesaurus provided by Academia Sinica Source: UDList Appendix B – Samples of Test Data

    Bender, O., Och, F. J., & Hermann, N. (2003). Maximum Entropy Models for Named Entity Recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL.

    Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). NYU: Describption of the MENE Named Entity System as used in MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).

    Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. In WebDB Wokshop at EDBT '98.
    Downey, D., Broadhear, M., & Etzioni, O. (2007). Locating Complex Named Entities in Web Text. In Proceedings of IJCAI 2007.

    Hearst, M. (1992). Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of the 14th conference on Computational linguistics.

    Kripke, S. Naming and Necessity. Boston: Harvard University Press.

    Lisa, R. F. (1991). Extracting Company Names from Text. In Proceedings of Conference on Artificial Intelligence Applications of IEEE.

    Manning, C. D., Raghavan, P., & Schutze, H. (2009). Introduction to Information Retrieval. Cambridge university press.

    McCallum, A., Freitag, D., & Pereira, F. (2000). Maximumentropy Markov models for information extraction andsegmentation. In Proceedings of ICML (pp. 591–598). California: Stan-ford.

    Nadeau, D., & Sekine, S. (2007). A Survey of Named Entity Recognition and Classification. Lingvisticae Investigationes, Volume 30, Number 1 , pp. 3-26.

    Nadeau, D., Turney, P. D., & Matwin, S. (2006). Unsupervised NamedEntity Recognition: Generating Gazetteers and Resolving Ambiguity. In Proceedings of Canadian Conference onArtificial Intelligence.

    Patwardhan, S., & Riloff, E. (2006). Learning Domain-specific Information Extraction Patterns from the Web. In Proceedings of the ACL 2006 Workshop on Information Extraction Beyond the Document.

    Riloff, E., & Jones, R. (1999). Learning Dictionaries for Information Extraction by Multi-level Bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence.

    Sekine, S., & Isahara, H. IREX: IR and IE Evaluation project in Japanese. In Proceedings of the 2nd International Conference on Language Resources and Evaluation.

    Sekine, S., Sudo, K., & Nobata, C. (2002). Extended Named Entity Hierarchy. In Proceedings of the LREC-2002 Conference.

    Shinzato, K., Sekine, S., Yoshinaga, N., & Torisawa, K. (2006). Constructing Dictionaries for Named Entity Recognition on Specific Domains from the Web. In Web Content Mining with Human Language Technolo-gies Workshop on the 5th International Semantic Web.

    Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003.

    Tsai, T.-H., Wu, S.-H., Lee, C.-W., Shih, C.-W., & Hsu, W.-L. (2004). Mencius: A Chinese Named Entity Recognizer Using Maximum Entropy-based Hybrid Model. International Journal of Computational Linguistics & Chinese Language Processing .

    Uchimoto, K., Ma, Q., Murata, M., Ozaku, H., & Hitoshi, I. (2000). Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules. In Proceedings of 33rd Annual Meeting of the Association of the Computational Linguistics.

    Zhang, Y.-J., & Zhang, T. (2007). ME-based Chinese Person Name and Location Name Recognition Model. In Proceedings of the Sixth International Conference on Machine Learning and Cybernetics.

    Zhao, J., & Liu, F. (2008). Product Named Entity Recognition in Chinese Text. Language Resources and Evaluation .

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE