研究生: |
陳奕廷 Chen, Yi-Ting |
---|---|
論文名稱: |
基於整合特徵詞擷取方法的文件分類系統 A Text Classification System Based On Integrated Feature Selection Approach |
指導教授: |
張適宇
Chang, Shih-Yu |
口試委員: |
徐正炘
許慶賢 |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 通訊工程研究所 Communications Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 英文 |
論文頁數: | 44 |
中文關鍵詞: | 特徵詞擷取 、文件分類 、互消息 |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在文件分類中,特徵擷取技術是一項極為重要的過程,因為該技術與分類器處理的正確性有著相當大的關係。當原有的文集非常龐大時,基於處理時間的考量,從原有的文集中選擇適當的特徵詞,對於減少處理的時間將更有幫助。我們經由對文件分類中特徵擷取技術的研究,發現先前的研究中,很少有研究能夠有效地結合特徵擷取技術,通常都是使用單一的特徵擷取技術,以計算特徵詞的權重。在本研究中,我們提出一套整合的特徵擷取方法,是結合了知名的mutual information (MI)和term frequency–inverse document frequency (TF-IDF)方法,以及改進的向量空間模型,為的是解決特徵空間維度過高的問題,以及改進文件分類的效能。此方法是先計算每一個特徵詞的MI值,以便從訓練資料集中區分出較不重要的詞,接著將每一個詞的MI值乘上自己的TF-IDF值,以進一步的加強較重要的詞的權重。本研究是應用於中文文件上,其中文件集是從台灣奇摩新聞網站中得到。經由實驗結果證實,我們提出的整合特徵擷取方法,其效能優於傳統的TF-IDF和MI兩項方法。
[1] Tieli Sun; Pingping Han; Di Wu; Yingjie Jiang; Fengqin Yang; , "Re-
search on Ontology-Based Chinese Scientic Papers Classication," In-
ternet Technology and Applications (iTAP), 2011 International Confer-
ence on , vol., no., pp.1-4, 16-18 Aug. 2011
[2] Bin Xu; Yufeng Zhang; , "A new SVM Chinese text of classication algo-
rithm based on the semantic kernel," Multimedia Technology (ICMT),
2011 International Conference on , vol., no., pp.2857-2860, 26-28 July
2011
[3] Xuezeng Pang; Yixing Liao; , "A text classication model based on
training sample selection and feature weight adjustement," Advanced
Computer Control (ICACC), 2010 2nd International Conference on ,
vol.3, no., pp.294-297, 27-29 March 2010
[4] Jiangfeng Yang; Zheng Ma; , "Document Clustering based on mutual
information and PCA subspace," Articial Intelligence, Management
Science and Electronic Commerce (AIMSEC), 2011 2nd International
Conference on , vol., no., pp.2983-2986, 8-10 Aug. 2011
42
BIBLIOGRAPHY 43
[5] Haruechaiyasak, C.; Jitkrittum, W.; Sangkeettrakarn, C.; Damrongrat,
C.; , "Implementing News Article Category Browsing Based on Text
Categorization Technique,"Web Intelligence and Intelligent Agent Tech-
nology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference
on , vol.3, no., pp.143-146, 9-12 Dec. 2008
[6] Zhimao Lu; Hongxia Yu; Dongmei Fan; Chaoyue Yuan; , "Spam Filter-
ing Based on Improved CHI Feature Selection Method," Pattern Recog-
nition, 2009. CCPR 2009. Chinese Conference on , vol., no., pp.1-3, 4-6
Nov. 2009
[7] Mingzhen Chen; Yu Song; , "Summarization of text clustering based
vector space model," Computer-Aided Industrial Design and Conceptual
Design, 2009. CAID and CD 2009. IEEE 10th International Conference
on , vol., no., pp.2362-2365, 26-29 Nov. 2009
[8] Joel W. Reed; Yu Jiao; Thomas E. Potok; Brian A. Klump; Mark T.
Elmore; Ali R. Hurson; , "TF-ICF: A New Term Weighting Scheme
for Clustering Dynamic Data Streams," Machine Learning and Appli-
cations, 2006. ICMLA '06. 5th International Conference on , vol., no.,
pp.258-263, Dec. 2006
[9] T. Y. Liu, G. Z. Li, "Fault Diagnosis by Using Selective Ensemble Learn-
ing Based on Mutual Information"" Proceeding of OSB 08, Lijiang,
China, pp. 191-197, Nov. 2008.
[10] Yanling Li; Guanzhong Dai; Gang Li; , "Feature Selection Method of
Text Tendency Classication," Fuzzy Systems and Knowledge Discov-
BIBLIOGRAPHY 44
ery, 2008. FSKD '08. Fifth International Conference on , vol.2, no.,
pp.34-37, 18-20 Oct. 2008
[11] Hua-ying Zhou; Qi-rui Zhang; Man Luo; He-xian Wang; , "Feature se-
lection in medical text classication based on immune algorithm," Ad-
vanced Computer Theory and Engineering (ICACTE), 2010 3rd Inter-
national Conference on , vol.3, no., pp.V3-212-V3-216, 20-22 Aug. 2010
[12] Information on http://tw.news.yahoo.com/
[13] Information on http://www.csie.ntu.edu.tw/ cjlin/
[14] V. N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New
York, NY, 1998.
[15] B. C. How, and K. Narayanan, An empirical study of feature selection
for text categorization based on term weight age,In Proceedings of the
2004 IEEE/WIC/ACM International Conference on Web Intelligence.
Washington, DC: IEEE Computer Society, 2004, pp. 599602.