以主成分分析法為基礎之文件自動分類模式｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	許邦輝
論文名稱：	以主成分分析法為基礎之文件自動分類模式 A Document Classification Model Using Principal Component Analysis
指導教授：	侯建良
口試委員:
學位類別：	碩士 Master
系所名稱：	工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management
論文出版年：	2006
畢業學年度：	94
語文別：	中文
論文頁數：	147
中文關鍵詞：	主成分分析、文件分類、關鍵字擷取、知識管理
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，由於電腦資訊技術之蓬勃發展，各企業組織與機構之電子化文件乃以幾何級數之速度快速成長，因此如何利用自動化文件分類技術協助企業組織與機構管理電子化文件，進而提高企業知識管理之效能，實為現今知識管理與相關研究之重要課題。由於電子化文件之內容充滿複雜性與多樣性，故以人工決策之方式判斷文件類別不僅不符合經濟效益且其處理速度亦十分緩慢；此外，文件類別認定標準亦難維持一致性。有鑑於此，本研究提出一套以主成分分析法則為基之文件類別自動判定方法，其乃先擷取已知類別文件之關鍵字及其頻率值，並藉由整併此些關鍵字而取得所有類別文件下各文件關鍵字之聯集；再以此聯集之詞彙頻率進行分類依據關鍵字推論，進而尋找具分類代表性之關鍵字。之後，以具分類代表性關鍵字為基礎，擷取各個已知類別文件群和目標文件之關鍵字頻率值，並計算各文件類別與目標文件之隸屬關係值，以藉由隸屬關係值判定目標文件所屬類別。本研究最終乃建立一套知識文件自動分類系統，並以一案例評估此模式與技術之有效性與可行性。
綜合言之，本研究之目標乃為提昇文件自動分類技術之正確率與效率性，以協助企業組織與機構有效提高其知識文件管理之效能，並進而提昇企業之知識利用率。此外，對於資訊需求者而言，本研究則能協助資訊需求者於龐大之網路資訊/文件中，迅速且便捷地尋得其所需要之文件資料，以節省資訊需求者花費於資訊過濾與篩選之大量時間。

Owing to the booming growth of information technology, the number of digital documents has significantly increased over the Internet and within organizations. In order to enhance the performance for enterprises to manage their digital documents and domain knowledge, automatic document classification has become a key issue for enterprise knowledge management. Concerning complexity of different types of digital documents, this paper utilizes the principal component analysis (PCA) to develop an algorithm for automatic document classification. Based on PCA, representative keywords of distinct document categories can be obtained. Furthermore, according to the frequencies of representative keywords in the target document, the category of the target document can be determined.
In addition to the document classification algorithm, a Web-based document classification system is also developed and a demonstration case is applied to verify the performance of the proposed approach. The attempt of this research is to enhance the accuracy and efficiency of enterprise document classification technology and to enable a self-service knowledge management mechanism in organizations.

第一章、研究背景    9
1研究動機與目的    9
2研究步驟    11
3研究定位    13
第二章、文獻回顧    15
1文件探勘    15
1.1文件探勘之方法    15
1.2文件探勘之應用    16
2關鍵字擷取    18
2.1統計分析法    18
2.2詞庫比對法    20
2.3文法剖析法    22
2.4其他方法    22
3文件分類    24
3.1以資料探勘進行分類    24
3.2以統計方法進行分類    25
3.3以詞彙關聯性進行分類    27
3.4以共現語詞分析進行分類    28
3.5以其他方法進行分類    29
第三章、知識文件類別自動判定模式    31
1主成分分析之介紹    32
2分類依據關鍵字推論模式    34
3文件類別判定模式    48
第四章、系統架構規劃    58
1文件類別自動判定模式之流程架構    58
2系統功能架構    59
3資料模式定義    63
4系統功能流程    65
4.1系統功能操作流程    65
4.2系統資料流程    74
5系統開發工具    75
第五章、系統實作與案例分析    78
1專利文件類別推論之驗證與評估    78
2系統應用情境說明    93
第六章、結論與未來展望    101
參考文獻    105
附錄、系統功能操作說明    113

                                

1.方策民，2002，「電視新聞文稿之研究」，碩士論文（指導教授：傅心家），國立交通大學資訊工程研究所。
2.王經篤，2001，「中文文件自動分類方法的設計與評估」，博士論文（指導教授：蔡志忠），國立中正大學資訊工程研究所。
3.白博仁，1997，「中文語音及文字資料庫之國語語音檢索」，博士論文（指導教授：簡立峰、李琳山），國立台灣大學電機工程學系。
4.李紹群，1999，「以關鍵字相關性為基礎之超本文資訊檢索系統」，碩士論文（指導教授：賀嘉生），私立中原大學資訊工程學系系。
5.林志鴻，2003，「動態連結在組織知識管理系統之輔助應用」，碩士論文（指導教授：邱兆民），國立高雄第一科技大學資訊管理研究所。
6.林京翰，2002，「串接式類別縮減於自動化文件分類之研究」，碩士論文（指導教授：蔡志忠），國立中正大學資訊工程研究所。
7.林佩樺，2001，「在顧客導向之智慧型客戶接觸中心的架構下以文件分類的技術探索顧客的需求」，碩士論文（指導教授：王淑卿），朝陽科技大學資訊管理研究所。
8.林政男，2003，「以共現語詞為基礎的特徵選取在文件自動分類上之研究」，碩士論文（指導教授：李御璽），銘傳大學資訊管理研究所。
9.林盛康，2000，「以關聯式索引典為基礎之互動式查詢擴展應用於網頁資訊檢索」，碩士論文（指導教授：李漢銘），國立台灣科技大學電子工程研究所。
10.林嵩富，2002，「主題分類關鍵詞之動態更新模式研究」，碩士論文（指導教授：陳振東），大葉大學資訊管理學系。
11.邱中人，1999，「中文新聞摘要」，博士論文（指導教授：張智星、張俊盛），國立清華大學資訊工程學系。
12.侯永昌、楊雪花，1998，「以模糊理論和遺傳演算法為基礎的中文文件自動分類之研究」，模糊系統學刊，第4卷，第1期，第45-57頁。
13.施政瑋，2002，「以階層式詞義網路為基礎的中文文件分析及其效能評估」，碩士論文（指導教授：呂芳懌），東海大學資訊工程學系。
14.涂富祥，2002，「運用軟式計算技術發展一個基於Ontology架構之Q&A系統」，碩士論文（指導教授：郭耀煌、郭淑美），國立成功大學資訊工程研究所。
15.曹乃龍，1999，「模糊自動文件分類在網際網路上的探討」，碩士論文（指導教授：林丕靜），淡江大學資訊工程研究所。
16.莊慧美，1999，「以智慧型計算方法探討文件分類」，碩士論文（指導教授：李偉柏），屏東科技大學資訊管理研究所。
17.許雅芬，2001，「新聞文件自動分類之研究」，碩士論文（指導教授：柯淑津），東吳大學資訊科學研究所。
18.許暉煌，1999，「應用資料探勘技巧於多媒體文件分類法則之研究」，碩士論文（指導教授：鄭錫齊），銘傳大學資訊管理研究所。
19.陳永承，2002，「以灰色理論為基礎之關聯式索引典應用於互動式查詢拓展」，碩士論文（指導教授：李漢銘），國立台灣科技大學電子工程研究所。
20.陳如玅，2002，「應用文件探勘技術於FAQ系統之建置」，碩士論文（指導教授：莊大倉），銘傳大學資訊管理研究所。
21.陳孟華，2002，「以模糊理論與關聯法則為基礎之二階中文辭彙分割法」，碩士論文（指導教授：林丕靜），淡江大學資訊工程學系。
22.陳盈如，2003，「稅務法規問答系統之研究」，碩士論文（指導教授：陳振南豪），銘傳大學資訊管理研究所。
23.陳景揆，1999，「探勘中文新聞文件中的概念關聯及趨勢」，碩士論文（指導教授：許中川），雲林科技大學資訊管理研究所。
24.陳皙彥，2002，「一個有效的文件檢索索引結構-關鍵詞繼承結構」，碩士論文（指導教授：楊朝成），朝陽科技大學資訊管理系碩士班。
25.陳鈺瑾，1999，「可調式之中文文件自動摘要」，碩士論文（指導教授：張俊盛），國立清華大學資訊工程學系。
26.陳耀茂，2002，多變量分析導論，高立書局。
27.曾元顯，1997，「關鍵詞自動擷取技術之探討」，中國圖書館學會會訊，第106期，第26-29頁。
28.曾元顯，1997，「關鍵詞自動擷取技術之探討」，中國圖書館學會會訊，第106期，第26-29頁。
29.黃佳新，2003，「關鍵字擷取與文件分類之因子分析」，碩士論文（指導教授：侯建良），國立清華大學工業工程與工程管理學系。
30.黃政偉，1998，「具語句特徵選取能力的類神經網路文件分類器」，碩士論文（指導教授：李漢銘），國立台灣科技大學電子工程研究所。
31.黃森原，1995，「中文文件自動分類」，碩士論文（指導教授：林志青），國立交通大學資訊科學研究所。
32.黃燕萍，1998，「中文社會新聞文件資訊擷取」，碩士論文（指導教授：許中川），雲林科技大學資訊管理學系。
33.楊允言，1999，「中文文件自動分類之探討」，大漢學報，第十三卷，第241-256頁。
34.楊正銘，2003，「以文字探勘技術應用於疾病分類之輔助系統-以出入院病歷摘要為例」，碩士論文（指導教授：劉立），臺北醫學大學醫學資訊研究所。
35.葉慶章，1998，「應用遺傳演算法於文件萃取之查詢」，碩士論文（指導教授：劉寶鈞），中央大學資訊工程學系。
36.董元昕，1999，「以資料探勘為基礎之文件類別演進技術」，碩士論文（指導教授：陳年興），國立中山大學資訊管理研究所。
37.詹培萱，2002，「利用文字探勘技術進行犯罪資料之發掘—以網路販售違禁品及網路賭博為例」，碩士論文（指導教授：陳志誠），中央警察大學資訊管理研究所。
38.詹智凱，2000，「以詞的關聯性為基礎的文件自動分類」，碩士論文（指導教授：李徐俊傑），國立台灣科技大學資訊管理研究所。
39.詹權恩，2004，「以詞彙關聯性詞庫為基礎之文件關鍵字擷取模式」，碩士論文（指導教授：侯建良），國立清華大學工業工程與工程管理學系。
40.趙俊彥，2001，「以關聯式規則探勘為基礎建構關聯式索引典用於互動式查詢擴展」，碩士論文（指導教授：李漢銘），國立台灣科技大學電子工程研究所。
41.蔡純純，2002，「中文新聞文件空間資訊擷取之研究—以火災、搶劫、車禍事件為例」，碩士論文（指導教授：朱子豪），國立臺灣大學地理環境資源學研究所。
42.蔡嘉嘉，2002，「應用模糊集合理論對文件多重分類及相似性分析」，碩士論文（指導教授：曾守正），國立高雄第一科技大學資訊管理研究所。
43.蔡憲文，1998，「利用基因演算法來做文件自動分類之研究」，碩士論文（指導教授：洪文斌），淡江大學資訊工程學系。
44.鄭皓，2001，「文件探勘技術應用於供應鏈知識管理的研究」，碩士論文（指導教授：李俊宏），長榮管理學院經營管理研究所。
45.鄭滄祥，2002，「以群集技術支援文件類別整合之研究」，碩士論文（指導教授：魏志平），國立中山大學資訊管理研究所。
46.謝儒誠，2001，「資料探勘技術運用於文件自動分群之研究」，碩士論文（指導教授：王朝煌），中央警察大學資訊管理研究所。
47.鍾文豪，2000，「電視新聞內容分類與索引之研究」，碩士論文（指導教授：傅心家），國立交通大學資訊工程系。
48.鍾明強，2003，「基於Ontology架構之文件分類網路服務研究與建構」，碩士論文（指導教授：郭淑美），國立成功大學資訊工程研究所。
49.顏志娟，2004，「應用文件探勘技術於智慧型中文資訊檢索系統」，碩士論文（指導教授：黃謙順），中國文化大學資訊管理研究所。
50.魏源谷，2001，「多分類器系統在自動化文件分類之研究」，國立中正大學資訊工程研究所碩士論文（指導教授：蔡志忠）。
51.蘇振輝，2001，「可自動分類中文文件的智慧型代理程式之研究─以個人化健康電子報為例」，碩士論文（指導教授：陸承志，吳宗成），台灣科技大學資訊管系。
52.顧皓光，1997，「網路文件自動分類」，碩士論文（指導教授：莊裕澤），台灣大學資訊管理學系。
53.Antonie, M.-L. and Zaiane, O. R. Z., 2002, “Text document categorization by term association,” Proceeding of the IEEE International Conference on Data Mining, pp. 19-26.
54.Arita, T., Shishibori, M. and Aoe, J. I., 1998, “An efficient algorithm for full text retrieval for multiple keywords,” Journal on Information Sciences, Vol. 104, No. 3-4, pp. 345-363.
55.Atlam, E. S., Fuketa, M., Kashiji, S., Nakata, H. and Aoe, J. I., 2002, “A new method for construction filed association terms using co-occurrence words and declinable words information,” IEEE International Conference on Systems, Man and Cybernetics, Vol. 4, pp. 1217-1224.
56.Atlam, E. S., Fuketa, M., Morita, K. and Aoe, J. I., 2000, “Similarity measurement using term negative weight and its application to word similarity,” Information Processing and Management, Vol. 36, No. 5, pp. 717-736.
57.Bautista, M. J. M., Vila, M. A., Sanchez, D. and Larsen, H. L., 2000, “Fuzzy genes: Improving the effectiveness of information retrieval,” Proceedings of the 2000 Congress on Evolutionary Computation, Vol. 1, pp. 471-478.
58.Chan, S. W. K., 2004, “Extraction of salient textual patterns: Synergy between lexical cohesion and contextual coherence,” IEEE Transactions on Systems, Man and Cybernetics, Part A, Vol. 34, No. 2, pp. 205-218.
59.Chien, L. F., 1997, “PAT-tree-based keyword extraction for Chinese information retrieval,” ACM SIGIR Forum, Vol. 31, pp. 50-58.
60.Church, K. W. and Hanks, P., 1990, “Word association norms, mutual information and lexicography,” Computational Linguistics, Vol. 16, No. 1, pp. 22-29.
61.Dörre, J., Gerstl, P. and Seiffert, R., 1999, “Text mining: finding nuggets in mountains of textual data,” Proceedings of the 5’th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 398-401.
62.Farbrizio, S., 2002, “Machine learning in automated text categorization,” ACM Computing Surveys, Vol. 34, No. 1, pp.1-47.
63.Farkas, J., 1995, “Towards classifying full-text using recurrent neural networks,” Proceedings of the IEEE International Conference on Data Mining, pp. 513-520.
64.Feldman, R. and Dagan, I., 1995, “Knowledge discovery in textual database (KDT),” Proceedings of the First ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 112-117.
65.Feldman, R. and Hirsh, H., 1997, “Exploiting background information in knowledge discovery from text”, Journal of Information System, pp. 83-97.
66.Gonnet, G. H., Baeza, Y. R. and Snider, T., 1992, “New indices for text: pat trees and pat arrays,” Information Retrieval Data Structures and Algorithms, pp. 66-82.
67.Han, J. J., Choi, J. H., Park, J. J., Yang, J. D. and Lee, J. K., 1998, “An object-based information retrieval model: toward the structural construction of thesauri,” IEEE International Forum on Research and Technology Advances in Digital Libraries, pp. 117-125.
68.Haruechaiyasak, C., Shyu, M.-L. and Chen, S.-C., 2002, “Web document classification based on fuzzy association,” Proceedings of the 26th Annual International Computer Software and Applications Conference, pp. 487-492.
69.Horng, J. T. and Yeh, C. C., 2000, “Applying genetic algorithms to query optimization in document retrieval,” Information Processing and Management, Vol. 36, pp. 737-759.
70.Hou, J. L. and Chan, C. A., 2003, “A document content extraction model using keyword correlation analysis,” International Journal of Electronic Business Management, Vol. 1, No. 1, pp. 54-62.
71.IBM, 1998, Intelligent Miner for Text: Getting Started, IBM Corp.
72.Jo, T. C., 1999, “Text categorization with the concept of fuzzy set of informative keywords,” Proceedings of IEEE International Conference on Fuzzy Systems, Vol. 2, pp. 609-614.
73.Kaiser, H. F., 1960, “The application of electronic computers to factor analysis.” Educational and Psychological Measurement, pp. 141-151.
74.Kusumura, Y., Hijikata, Y. and Nishida, S., 2003, “NTM-Agent: Text mining agent for net auction,” Applications and the Internet, pp. 356-359.
75.Lagus, K. and Kaski, S., 1999, “WEBSOM-self organizing maps of document collections,” The Ninth International Conference on Artificial Neural Networks, Vol. 1, pp. 371-376.
76.Lam, W. and Chao, Y. H., 1999, “Modeling textual document classification,” Proceedings, IEEE International Conference on Systems, Man, and Cybernetics, Vol. 3, pp. 946-949.
77.Lee, H.-M., Chen, C.-M., and Tan, C.-C., 2001, “An intelligent web-page classifier with fair feature-subset selection,” The IFSA World Congress and 20th NAFIPS International Conference, Vol. 1, pp. 395-400.
78.Lin, S. H., Chen, M. C., Ho, J. M. and Huang, Y. M., 2002, “ACIRD: Intelligent Internet document organization and retrieval,” IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, pp. 599-614.
79.Liu, C.-H., Lu, C.-C. and Lee, W.-P., 2000, “Document categorization by genetic algorithms,” Proceedings, IEEE International Conference on Systems, Man, and Cybernetics, Vol. 5, pp. 3868-3872.
80.Nahm, U. Y. and Mooney, R. J., 2002, “Text mining with information extraction,” Proceedings of the AAAI on Mining Answers from Texts and Knowledge Bases, pp. 60-67.
81.National Library of Medicine, 2001, UMLS Knowledge Sources, National Library of Medicine, 12th Experimental Edition.
82.Neumann, G., 1998, “Interleaving natural language parsing and generation through uniform processing,” Artificial Intelligence, Vol. 99, No. 1, pp. 121-163.
83.Ricardo, B. Y. and Berthier, R. N., 1999, Modern information retrieval, Addison-Wesley.
84.Salton, G., 1989, Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley Pressed.
85.Silva, J., Mexia, J., Coelho, A. and Lopes, G., 2001, “Document clustering and cluster topic extraction in multilingual corpora,” Proceedings of the IEEE International Conference on Data Mining, pp. 513-520.
86.Singh, L., 1999, “An algorithm for constrained association rule mining in semi-structured data,” Proceedings of PAKDD, pp. 148-158.
87.Sullivan, D., 2001, Document Warehousing and Text Mining, Wiley Computer Publishing.
88.Timm, N. H., 1975, Multivariate Analysis: with Applications in Education and Psychology, Calif.: Brooks/Cole.
89.Wakami, N., Mizutani, K., Kataoka, M. and Imanaka, T., 1997, “Fuzzy intelligent information processing in home appliances,” Proceedings, The Ninth International Conference on Scientific and Statistical Database Management, pp. 12-21.
90.Wei, C. S., Liu, Q., Wang, J. T. L. and Ng, P. A., 1997, “Knowledge discovering for document classification using tree matching in TEXPROS,” Information Sciences, Vol. 100, No. 1-4, pp. 255-310.
91.Wei, J., Bressen, S. and Ooi, B. C., 2000, “Mining term association rules for automatic global query expansion: Methodology and preliminary results,” Proceedings of the First International Conference on Web Information Systems Engineering, Vol. 1, pp. 366-373.
92.Wuthrich, B., 1998, “Daily stock market forecast from textual web data”, IEEE International Conference on SMC, pp. 1-6.
93.Yang, H. C. and Lee, C. H., 2003, “A text mining approach on automatic generation of Web directories and hierarchies,” IEEE/WIC International Conference on Web Intelligence, pp. 625-628.

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)

簡易檢索 / 詳目顯示

相關論文