研究生: |
許偉忠 Wei-Chung Hsu |
---|---|
論文名稱: |
以成本敏感分類分析法建構之多語言文件分類技術 Cost-Sensitive Poly-Lingual Text Categorization |
指導教授: |
魏志平
Chih-Ping Wei |
口試委員: | |
學位類別: |
碩士 Master |
系所名稱: |
科技管理學院 - 科技管理研究所 Institute of Technology Management |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 英文 |
論文頁數: | 46 |
中文關鍵詞: | 文件探勘 、文件分類 、多語文件分類 、文件翻譯 、成本敏感度學習 、統計雙語詞典 |
外文關鍵詞: | Text mining, Text categorization, Poly-lingual text categorization, Document translation, Cost-sensitive learning, statistical-based bilingual thesaurus |
相關次數: | 點閱:3 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於網際網路的興起以及全球化的趨勢,各種文件的取得變的容易,並且這些文件都用各種不同的語言所撰寫,越來越多的組織和個人必須具備處理多語文件的能力。假設這些組織與個人有大量的已分類多語文件,就可以利用這些文件,建立一個自動的多語分類系統。當獲得新的多語文件時就可以利用這個系統來做適合的分類。
但到目前為止,多語文件分類系統並不多見,並且已做出的系統分類準確度仍有待加強,因此本研究提出了一套以成本敏感分類分析法建構之多語言文件分類技術,本研究會將不同語言的文章透過統計雙語詞典做翻譯,並且給每篇翻譯文章一個翻譯成本(品質),此成本為分類預測錯誤所獲得的成本,希望在分類錯誤成本最低的情況下,獲得最好的分類準確度。
本研究以特徵加強之多與文件分類技術作為我們的目標,在經過科學化的實驗方法後,證明本研究不論在中文或是英文文集都比目標較為傑出。
關鍵字:文件探勘,文件分類,多語文件分類,文件翻譯,成本敏感度學習,統計雙語詞典
Because of the trend of globalization, organizations and individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). If organizations or individuals have already organized poly-lingual documents into their categories and would like to use this set of preclassified poly-lingual documents as training documents for constructing text categorization models that can classify newly arrived poly-lingual documents into appropriate categories, the organizations and individuals face the poly-lingual text categorization (PLTC) problem. Poly-lingual text categorization (PLTC) refers to the automatic learning of a text categorization model(s) from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization model(s).Many text categorization techniques have been proposed in the literature; however, most of them deal with monolingual documents. In this study, we propose a cost-sensitive poly-lingual text categorization (CS-PLTC) technique that involves inclusion of translated documents to expand the training size for PLTC and use of cost-sensitive learning to reflect different qualities of training documents. Using the existing feature-reinforcement-based PLTC (FR-PLTC) technique as performance benchmarks, our empirical evaluation results show that our proposed CS-PLTC technique outperforms than the benchmark technique in both English and Chinese corpora.
Keywords: Text mining, Text categorization, Poly-lingual text categorization, Document translation, Cost-sensitive learning, statistical-based bilingual thesaurus
ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379.
[ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251.
[AZL04] Abe, N., Zadrozny, B., Langford, J., “An Iterative Method for Multi-class Cost-sensitive Learning,” Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, USA, 2004, pp. 3-11.
[BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103.
[B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
[B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727.
[BJN03] B. Zadrozny, J. Langford, and N. Abe., “Cost-sensitive learning by cost-proportionate example weighting,” Proceedings of the Third IEEE International Conference on Data Mining, 2003, pp. 435-442.
[BKV03] Bel, N., Koster, C. H. A., and Villegas, M., “Cross-Lingual Text Categorization,” Proceedings of 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL ’03), Trondheim, Norway, August 2003, pp. 126-139.
[CS99] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” ACM Transactions on Information Systems, Vol. 17, No. 2, 1999, pp. 141-173.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155.
[FSZ99] Fan, W., Stolfo, S., Zhang, J., and Chan, P. H., “AdaCost: Misclassification Cost-sensitive Learning,” Proceedings of the Sixteenth International Conference on Machine Learning (ICML'99), June 1999, pp.97-105.
[GS06] Gliozzo, A. and Strapparava, C., “Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-language Text Categorization,” Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL, Sydney, Australia, 2006, pp. 553-560.
[IT95] Iwayama, M. and Tokunaga, T., “Cluster-Based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp. 273-281.
[JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.
[J98] Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of 10th European Conference on Machine Learning (ECML 98), Chemnitz, Germany, 1998, pp. 137-142.
[LH98] Lam, W. and Ho, C. Y., “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 81-89.
[LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96), Zurich, Switzerland, August 1996, pp. 289-297.
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 16-22.
[LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
[MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘92), 1992, pp. 59-64.
[MN98] McCallun, A. K. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), 1997, pp. 67-73.
[OOH05] Olsson, J. S., Oard, D. W., and Hajic, J., “Cross-language Text Classification,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’05), Salvador, Brazil, 2005, pp. 645-646.
[P99] Domingos, P., “MetaCost: A General Method for Making Classifiers Cost-sensitive,” Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 1999 pp. 155-164.
[RML05] Rigutini, L., Maggini, M., and Liu, B., “An EM Based Training Algorithm for Cross-language Text Categorization,” Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI ’05), 2005, pp. 529-535.
[RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1, 1999, pp. 67-79.
[SHP95] Schutze, H., Hull, D. A., and Pedersen, J. O., “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), 1995, pp. 229-237.
[S02] Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34, No. 1, 2002, pp. 1-47.
[S06] Shih, H. H. “Poly-Lingual Text Categorization,” Unpublished Master Thesis, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2006.
[V00] Vapnik, V. N., The Nature of Statistical Learning Theory, 2nd ed. New York, NY: Springer-Verlag, 2000.
[V93] Voutilainen, A., “Nptool: A detector of English noun phrases. Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp. 48-57.
[WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, September 2002, pp. 208-222.
[WSY07] Wei, C., Shi, H., and Yang, C. C., “Feature Reinforcement Approach to Poly-lingual Text Categorization,” Proceedings of 10th International Conference on Asian Digital Libraries, Hanoi, Vietnam, December 2007.
[WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp. 63-69.
[WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp. 317-332.
[Y94] Yang, Y., “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘94), Dublin, Ireland, July 1994, pp. 13-22.
[YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transaction on Information Systems, Vol. 12, No. 3, 1994, pp. 252-277.
[YL99] Yang, Y. and Liu, X., “A Re-Examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
[YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.
[YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682.
[YLY00] Yang, C. C., Luk, J., Yung, S. and Yen, J., “Combination and Boundary Detection Approach for Chinese Indexing,” Journal of the American Society for Information Science, Vol. 51, No, 4, 2000, pp. 340-351.
[YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.
[ZC01] Zadrozny, B. and Elkan, C., “Learning and Making Decisions When Costs and Probabilities are Both Unknown,” Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, ACM Press, 2001, pp. 204-213.