簡易檢索 / 詳目顯示

研究生: 陳昭錡
Chao-Chi Chen
論文名稱: Poly-Lingual Category Integration Technique
多語言文件類別整合技術
指導教授: 魏志平
Chih-Ping Wei
口試委員:
學位類別: 碩士
Master
系所名稱: 科技管理學院 - 科技管理研究所
Institute of Technology Management
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 61
中文關鍵詞: 多語類別整合類別整合加強式貝氏分類特徵加強多語文件管理文件探勘
外文關鍵詞: Poly-lingual category integration, Category integration, Enhanced Naïve Bayes, Feature reinforcement, Multilingual document management, Text mining
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 文件類別整合(簡稱類別整合)在許多電子商務的應用上是十分重要的,包括中介者的資訊收集以及供應鏈管理的整合等等。由於全球化的趨勢,類別整合的需求由單語延伸至多語。多語類別整合的目標在於整合兩個文件目錄,每個目錄都包含了由不同語言寫成的文件。目前有幾個類別整合的技術已經在文獻中被提出,但是這些技術將重點放在單語類別整合而非多語類別整合。本研究首先提出了一個特徵加強的多語類別整合技術(簡稱FR-PLCI),當來源目錄中的文件要被整合進目標目錄中時,會考慮到目標目錄中所有語言的文章。接著我們發展出另外兩個延伸的特徵加強多語類別整合技術,也就是以文件加強多語類別整合(簡稱DE-PLCI)技術以及機率加強類別整合(簡稱PE-PLCI)技術,將經過FR-PLCI整合完畢的來源目錄中的文章,加入此兩種技術的起始流程,以重複進行的方式反覆操作特徵加強的多語類別整合流程。利用單語的類別整合技術作為評估成效的基準,我們的實驗結果顯示,不論是在英文或是中文的類別整合實驗,本研究所提出的特徵加強多語類別整合技術與單語類別整合技術相較之下可達到較高的整合準確度。另外,DE-PLCI以及PE-PLCI技術的實驗結果顯示,當兩個目錄相似度在中上程度時,能夠改善原本的特徵加強多語類別整合技術。


    Document-category integration (or category integration for short) is fundamental to many e-commerce applications, including information aggregation by intermediaries and integration of supply chain management. Because of the trend of globalization, the requirement for category integration has been extended from monolingual to poly-lingual settings. Poly-lingual category integration (PLCI) aims to integrate two document catalogs, each of which consists of documents written in a mix of languages. Several category integration techniques have been proposed in the literature, but these techniques focus only on monolingual category integration rather than PLCI. In this study, we first propose a feature-reinforcement-based PLCI (namely, FR-PLCI) technique that takes into account the master documents of all languages when integrating a source catalog into the master catalog. Furthermore, we also develop two extended FR-PLCI techniques (referred to as the DE-PLCI and PE-PLCI techniques) that employ integrated source documents into the FR-PLCI process in an iterative manner. Using the monolingual category integration technique (MnCI) as performance benchmarks, our empirical evaluation results show that our proposed FR-PLCI technique achieves higher integration accuracy than MnCI does in both English and Chinese category integration. In addition, the extended FR-PLCI techniques (i.e., DE-PLCI and PE-PLCI) generally improve the performance of FR-PLCI in the homogeneous and comparable scenario.

    Chapter 1 Introduction 1 1.1 Background 1 1.2 Research Motivation and Objective 2 1.3 Organization of the Thesis 4 Chapter 2 Literature review 6 2.1 Existing Monolingual Category Integration Techniques 6 2.1.1 Enhanced Naïve Bayes (ENB) Technique 6 2.1.2 Cluster Shrinkage Technique 9 2.1.3 Co-Bootstrapping Technique 10 2.1.4 Cluster-based Category Integration (CCI) Technique 12 2.2 Poly-Lingual Text Categorization Technique 14 Chapter 3 Design of Feature-Reinforcement-Based Poly-Lingual Category Integration (FR-PLCI) Technique 23 3.1 Bilingual Thesaurus Construction 24 3.2 Feature Reinforcement and Selection 27 3.2.1 Feature Extraction 27 3.2.2 Feature Reinforcement and Selection 28 3.3 Monolingual Category Integration 29 Chapter 4 Enhancements of Feature-Reinforcement-Based Poly-Lingual Category Integration Technique 31 4.1 Document-Enhancement PLCI (DE-PLCI) 32 4.2 Probability-Enhancement PLCI (PE-PLCI) 33 Chapter 5 Empirical Evaluation 35 5.1 Data Collection 35 5.2 Evaluation Design 36 5.2.1 Dimension of Evaluations and Creation of Synthetic Catalogs 36 5.2.2 Evaluation Procedure and Criteria 37 5.2.3 Performance Benchmark 38 5.3 Evaluation Results 38 5.3.1 Parameter Tuning Experiments 38 5.3.2 Comparative Evaluations 43 5.3.3 Comparative Evaluations Under Automatic Homogeneity Scenario Identification 49 Chapter 6 Conclusion and Future Work 57 Reference 59

    [ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379.
    [AS01] Agrawal, R. & Srikant, R. (2001), “On Integrating Catalogs”, Proceedings of the Tenth International Conference on World Wide Web, Hong Kong: ACM Press, pp. 603-612.
    [B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the 3rd Conference on Applied Natural Language Processing, Trento, Italy, Association for Computational Linguistics, 1992, pp.152-155.
    [B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727.
    [BJN03] B. Zadrozny, J. Langford, and N. Abe., “Cost-sensitive learning by cost-proportionate example weighting,” Proceedings of the Third IEEE International Conference on Data Mining, 2003, pp. 435-442.
    [CS99] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” ACM Transactions on Information Systems, Vol. 17, No. 2, 1999, pp. 141-173.
    [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155.
    [EW86] El-Hamdouchi, A. & Willett, P. (1986), “Hierarchical Document Clustering using Ward’s Method,” Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.149-156.
    [H07] Hsu, W. C. “Cost-Sensitive Poly-Lingual Text Categorization,” Master Thesis, Institute of Technology Management, National Tsing Hua University, Hsinchu, Taiwan, R.O.C., June 2007.
    [J99] Joachims, T. (1999), "Transductive Inference for Text Classification using Support Vector Machines," Proceedings of the 16th International Conference on Machine Learning (ICML). Bled, Slovenia, pp.200-209.
    [JC94] Jing, Y. & Croft, W. B. (1994), “An Association Thesaurus for Information Retrieval,” Proceedings of Intelligence of Multimedia Retrieval Systems and Management Conference (RIAO), Paris: CID-CASIS, pp. 146-160.
    [M97] Mitchell, T. M. (1997), Machine Learning. McGraw-Hill Press.
    [NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), 1997, pp. 67-73.
    [RC99] Roussinov, D., and Chen, H. (1999), “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, pp. 67-79.
    [S02] Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34, No. 1, 2002, pp. 1-47.
    [T06] Tzeng, G. H. “Cross-Lingual Category Integration Technique,” Master Thesis, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2006.
    [V00] Vapnik, V. N., The Nature of Statistical Learning Theory, 2nd ed. New York, NY: Springer-Verlag, 2000.
    [V86] Voorhees, E.M. (1986), “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol. 22, No. 6, pp. 465-476.
    [V93] Voutilainen, A., “Nptool: A detector of English noun phrases. Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp. 48-57.
    [WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp. 63-69.
    [WC03] Wei, C. & Cheng, T. (2003), “A Clustering-Based Approach for Supporting Document-category Integration,” Proceedings of 7th Pacific Asia Conference on Information Systems (PACIS), Adelaide, South Australia, July 2003, pp.1314-1326.
    [WLY05] Wei, C., Lin, Y. T., and Yang, C. C. “Cross-Lingual Text Categorization for Global Knowledge Management,” Working Paper, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2005.
    [WSY07] Wei, C., Shi, H., and Yang, C. C., “Feature Reinforcement Approach to Poly-lingual Text Categorization,” Proceedings of 10th International Conference on Asian Digital Libraries, Hanoi, Vietnam, December 2007.
    [YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682.
    [YLY00] Yang, C. C., Luk, J., Yung, S. and Yen, J., “Combination and Boundary Detection Approach for Chinese Indexing,” Journal of the American Society for Information Science, Vol. 51, No, 4, 2000, pp. 340-351.
    [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.
    [ZL04a] Zhang, D. & Lee, W.S. (2004), “Learning to Integrate Web Taxonomies,” Journal of Web Semantics, Vol. 2, No. 2, pp.131-151.
    [ZL04b] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integrating using Support Vector Machines,” Proceedings of 13th international conference on World Wide Web (WWW), New York, NY, pp.472-481.
    [ZL04c] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integration through Co-Bootstrapping,” Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, pp.410-417.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE