簡易檢索 / 詳目顯示

研究生: 陳奕廷
Chen, Yi-Ting
論文名稱: 基於整合特徵詞擷取方法的文件分類系統
A Text Classification System Based On Integrated Feature Selection Approach
指導教授: 張適宇
Chang, Shih-Yu
口試委員: 徐正炘
許慶賢
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 通訊工程研究所
Communications Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 44
中文關鍵詞: 特徵詞擷取文件分類互消息
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在文件分類中,特徵擷取技術是一項極為重要的過程,因為該技術與分類器處理的正確性有著相當大的關係。當原有的文集非常龐大時,基於處理時間的考量,從原有的文集中選擇適當的特徵詞,對於減少處理的時間將更有幫助。我們經由對文件分類中特徵擷取技術的研究,發現先前的研究中,很少有研究能夠有效地結合特徵擷取技術,通常都是使用單一的特徵擷取技術,以計算特徵詞的權重。在本研究中,我們提出一套整合的特徵擷取方法,是結合了知名的mutual information (MI)和term frequency–inverse document frequency (TF-IDF)方法,以及改進的向量空間模型,為的是解決特徵空間維度過高的問題,以及改進文件分類的效能。此方法是先計算每一個特徵詞的MI值,以便從訓練資料集中區分出較不重要的詞,接著將每一個詞的MI值乘上自己的TF-IDF值,以進一步的加強較重要的詞的權重。本研究是應用於中文文件上,其中文件集是從台灣奇摩新聞網站中得到。經由實驗結果證實,我們提出的整合特徵擷取方法,其效能優於傳統的TF-IDF和MI兩項方法。


    1 Introduction and Literature Review 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Motivation . . . . . . . . . . . . . . . . . 3 1.3 Organization of Thesis . . . . . . . . . . . . . . . 4 2 Previous Work 6 2.1 Vector Space Model . . . . . . . . . . . . . . . . . 7 2.2 Feature Selection Methods . . . . . . . . . . . . . . 7 2.2.1 Mutual Information . . . . . . . . . . . . . . . . 7 2.2.2 Term frequency-inverse document frequency . . . . . 8 2.3 Text Classi er . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Decision Tree . . . . . . . . . . . . . . . . . . . 9 2.3.2 K-Nearest Neighbors . . . . . . . . . . . . . . . .10 2.3.3 Nave Bayesian . . . . . . . . . . . . . . . . . . .11 2.4 Support Vector Machine . . . . . . . . . . . . . . . 12 3 The Proposed Approach 15 3.1 Architecture for a Text Classi cation System . . . . .15 3.2 TF calculation and MI threshold . . . . . . . . . . .17 3.3 The Improved Feature Selection Method . . . . . . . .18 3.4 Evaluate Weight with VSM . . . . . . . . . . . . . . 19 3.5 Time Complexity Analysis . . . . . . . . . . . . . . 19 3.6 The steps of the proposed method . . . . . . . . . . 20 3.6.1 Documents collection . . . . . . . . . . . . . . 20 3.6.2 Word segmentation . . . . . . . . . . . . . . . .21 3.6.3 Stop words removal . . . . . . . . . . . . . . . 21 3.6.4 Evaluate of the value of TF and MI and IDF . . . 22 3.6.5 Constructing a Vector Space Model . . . . . . . .23 3.6.6 Training SVM classi er with VSM of training documents 24 4 Experiments Result 25 4.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . 26 4.3 Results . . . . . . . . . . . . . . . . . . . . . . .27 4.3.1 Experiment 1 . . . . . . . . . . . . . . . . . . . 27 4.3.2 Experiment 2 . . . . . . . . . . . . . . . . . . . 30 4.3.3 Experiment 3 . . . . . . . . . . . . . . . . . . . 35 5 Conclusions 40 Reference . . . . . . . . . . . . . . . . . . . . . . . .42

    [1] Tieli Sun; Pingping Han; Di Wu; Yingjie Jiang; Fengqin Yang; , "Re-
    search on Ontology-Based Chinese Scienti c Papers Classi cation," In-
    ternet Technology and Applications (iTAP), 2011 International Confer-
    ence on , vol., no., pp.1-4, 16-18 Aug. 2011
    [2] Bin Xu; Yufeng Zhang; , "A new SVM Chinese text of classi cation algo-
    rithm based on the semantic kernel," Multimedia Technology (ICMT),
    2011 International Conference on , vol., no., pp.2857-2860, 26-28 July
    2011
    [3] Xuezeng Pang; Yixing Liao; , "A text classi cation model based on
    training sample selection and feature weight adjustement," Advanced
    Computer Control (ICACC), 2010 2nd International Conference on ,
    vol.3, no., pp.294-297, 27-29 March 2010
    [4] Jiangfeng Yang; Zheng Ma; , "Document Clustering based on mutual
    information and PCA subspace," Arti cial Intelligence, Management
    Science and Electronic Commerce (AIMSEC), 2011 2nd International
    Conference on , vol., no., pp.2983-2986, 8-10 Aug. 2011
    42
    BIBLIOGRAPHY 43
    [5] Haruechaiyasak, C.; Jitkrittum, W.; Sangkeettrakarn, C.; Damrongrat,
    C.; , "Implementing News Article Category Browsing Based on Text
    Categorization Technique,"Web Intelligence and Intelligent Agent Tech-
    nology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference
    on , vol.3, no., pp.143-146, 9-12 Dec. 2008
    [6] Zhimao Lu; Hongxia Yu; Dongmei Fan; Chaoyue Yuan; , "Spam Filter-
    ing Based on Improved CHI Feature Selection Method," Pattern Recog-
    nition, 2009. CCPR 2009. Chinese Conference on , vol., no., pp.1-3, 4-6
    Nov. 2009
    [7] Mingzhen Chen; Yu Song; , "Summarization of text clustering based
    vector space model," Computer-Aided Industrial Design and Conceptual
    Design, 2009. CAID and CD 2009. IEEE 10th International Conference
    on , vol., no., pp.2362-2365, 26-29 Nov. 2009
    [8] Joel W. Reed; Yu Jiao; Thomas E. Potok; Brian A. Klump; Mark T.
    Elmore; Ali R. Hurson; , "TF-ICF: A New Term Weighting Scheme
    for Clustering Dynamic Data Streams," Machine Learning and Appli-
    cations, 2006. ICMLA '06. 5th International Conference on , vol., no.,
    pp.258-263, Dec. 2006
    [9] T. Y. Liu, G. Z. Li, "Fault Diagnosis by Using Selective Ensemble Learn-
    ing Based on Mutual Information"" Proceeding of OSB 08, Lijiang,
    China, pp. 191-197, Nov. 2008.
    [10] Yanling Li; Guanzhong Dai; Gang Li; , "Feature Selection Method of
    Text Tendency Classi cation," Fuzzy Systems and Knowledge Discov-
    BIBLIOGRAPHY 44
    ery, 2008. FSKD '08. Fifth International Conference on , vol.2, no.,
    pp.34-37, 18-20 Oct. 2008
    [11] Hua-ying Zhou; Qi-rui Zhang; Man Luo; He-xian Wang; , "Feature se-
    lection in medical text classi cation based on immune algorithm," Ad-
    vanced Computer Theory and Engineering (ICACTE), 2010 3rd Inter-
    national Conference on , vol.3, no., pp.V3-212-V3-216, 20-22 Aug. 2010
    [12] Information on http://tw.news.yahoo.com/
    [13] Information on http://www.csie.ntu.edu.tw/ cjlin/
    [14] V. N. Vapnik, Statistical Learning Theory, John Wiley and Sons, New
    York, NY, 1998.
    [15] B. C. How, and K. Narayanan, An empirical study of feature selection
    for text categorization based on term weight age,In Proceedings of the
    2004 IEEE/WIC/ACM International Conference on Web Intelligence.
    Washington, DC: IEEE Computer Society, 2004, pp. 599602.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE