簡易檢索 / 詳目顯示

研究生: 蒂芙妮
Fontenelle-Augustin, Tiffany Natasha
論文名稱: 原型的選擇以作有效率的分類
Prototype Selection for Efficient Classification
指導教授: 蘇豐文
Soo, Von-Wun
口試委員: 陳宜欣
Chen, Yi-Shin
陳朝欽
Chen, Chaur-Chin
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊系統與應用研究所
Institute of Information Systems and Applications
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 46
中文關鍵詞: 原型的選擇分類大數據
外文關鍵詞: prototype selection, classification, big data
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 摘要
    大數據已經成為到處都是且在學術上具相當大意義。在快數成長的大數據, 許多問題企圖要透過操弄數據來預測已經展開。在此論文中, 我們點出企圖處理大數據的計算複雜度問題,並提出一個訣竅可以將既有的分類模型方法變成更適當來幫忙解決這個問題。我們的訣竅不僅更適合處理大數據而且在維持正確率或更加的狀況下更快來處理傳統分類問題。我們的方法包括在傳統分類問題中原生型的選擇。也就是在訓練例子中選擇一組數據作為原生型。我們將訓練例子中其餘的數據拋棄並以原生型數據來訓練分類器有別於傳統方法用整個訓練資料。我們用的分類器學習方法是J48決策樹演算法。我們比較我們的方法(只用原生型)與傳統的決策樹方法及天真貝式法(用整個訓練數據)的正確率與執行時間以評估表現。我們也比較我們方法與傳統方法所使用在訓練時所使用的數據量。我們測試三種大小不同的數據集。實驗發現證明我們的方法百分百快於傳統的方法只有微微下降一點正確率。


    Abstract

    Big data has become ubiquitous and has become of great significance in academia. With the rapid increase in the enormity of big data, many problems have arisen when trying to manipulate the data for the purpose of forecasting. In this thesis, we highlight the problem of computational complexity when attempting to deal with big data. We propose a heuristic that can help to solve this problem by altering the existing method of classification so that it is more suitable for handling big data, thereby increasing efficiency. Our heuristic would not only be more suitable to handle big data but it would also be faster than traditional classification while keeping the accuracy approximately the same as traditional classification, if not higher. Our heuristic combines prototype selection with the traditional classification process. In our heuristic, a subset of the training data is selected as prototypes. The remaining data in the training set is discarded and we continue the process of classification by training the set of prototypes as opposed to the conventional method of using the entire training set. The learning algorithm used in our heuristic is the J48 decision tree algorithm. We evaluated our heuristic by comparing the classification accuracy and running time of our algorithm (using prototypes) with the traditional decision tree and naïve Bayes algorithms (using the entire training set). We also compared the amount of data used in our training phase versus the amount used in the training phases of conventional methods. We tested the data on five data sets ranging from sizes small to large. Findings prove that for big data, our heuristic saves memory space and is 100% faster than traditional classification with only a slight drop in accuracy.

    Contents Introduction 1 1.1 Statement of the Problem 1 1.2 Research Objective and Contributions 2 1.2.1 Hypothesis 3 1.2.2 Contributions 5 1.3 Related Work 6 Methodology 8 2.1 Definitions and Symbols 8 2.2 Adapted PSC Algorithm 10 2.3 Experiment 16 2.3.1 Random Partitioning 17 2.3.2 Selection of Prototypes 18 2.3.3 Training and Testing using Prototypes in Conjunction with Decision Tree 19 2.3.4 Training and Testing using the Original Training Set 21 Results 23 3.1 Accuracy Results 23 3.2 Time Results 28 3.3 Memory Results 32 Evaluation 37 4.1 Datasets 37 4.2 Metrics 39 4.3 Discussion 40 Conclusion 43 5.1 Summary 43 5.2 Limitations 43 5.3 Future Work 44 References 45

    References

    [1] G. Halevi and H. Moed, "Special Issue on Big Data," Research Trends, no. 30, pp. 3 - 6, September 2012.

    [2] C. Ji, Y. Li, W. Qiu, K. Li and U. Awada, "Big Data Processing in Cloud Computing Environments," International Symposium on Pervasive Systems, Algorithms and Networks, 2012.

    [3] "Turn Big Data into Big Value: A practical strategy," Intel White Paper, 2013.

    [4] X. Jin, B. W. Wah, X. Chen and Y. Wang, "Big Data Research 2: Significance and Challenges of Big Data," Elsevier, pp. 59 - 64, 2015.

    [5] H. Hu, Y. Wen, T. Chua and X. Li, "Toward Scalable Systems for Big Data Analytics: A Technology Tutorial," IEEE Access, vol. 2, pp. 652 - 687, 2014.

    [6] J. Lopez, J. Ochoa and J. Trinidad, "Prototype Selection Methods," Computación y Sistemas, vol. 13, no. 4, pp. 449 - 462, 2010.

    [7] S. Garcia, J. Derrac, J. Cano and F. Herrera, "Prototype Selection for Nearest Neighbor," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 417 -435, 2012.

    [8] D. Wilson and T. Martinez, "Reduction Techniques for Instance-Based Learning Algorithms," Machine Learning, vol. 38, pp. 257-286, 2000.

    [9] I. Witten and E. Frank, Data Mining: Practical Machine Learning, Elsevier, 2005.
    [10] A. Moore, "Information Gain," Lecture Notes, 2003. [Online]. Available: https://www.autonlab.org/tutorials.

    [11] M. Lichman, "UCI Machine Learning Repository," University of California, School of Information and Computer Science, 2013. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed September 2017].

    [12] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, B. Liu and M. Steinbach, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1 - 37, 2008.

    [13] O. Mangasarian and W. H. Wolberg, "Cancer Diagnosis via Linear Programming," SIAM News, vol. 23, no. 5, p. 1 & 18, 1990.

    [14] M. Fayyad and K. Irani, "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning," Thirteenth International Joint Conference of Artificial Intelligence, pp. 1022-1027, 1993.

    QR CODE