原型的選擇以作有效率的分類｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	蒂芙妮 Fontenelle-Augustin, Tiffany Natasha
論文名稱：	原型的選擇以作有效率的分類 Prototype Selection for Efficient Classification
指導教授：	蘇豐文 Soo, Von-Wun
口試委員:	陳宜欣 Chen, Yi-Shin 陳朝欽 Chen, Chaur-Chin
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications
論文出版年：	2018
畢業學年度：	106
語文別：	英文
論文頁數：	46
中文關鍵詞：	原型的選擇、分類、大數據
外文關鍵詞：	prototype selection, classification, big data
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

摘要
大數據已經成為到處都是且在學術上具相當大意義。在快數成長的大數據, 許多問題企圖要透過操弄數據來預測已經展開。在此論文中, 我們點出企圖處理大數據的計算複雜度問題，並提出一個訣竅可以將既有的分類模型方法變成更適當來幫忙解決這個問題。我們的訣竅不僅更適合處理大數據而且在維持正確率或更加的狀況下更快來處理傳統分類問題。我們的方法包括在傳統分類問題中原生型的選擇。也就是在訓練例子中選擇一組數據作為原生型。我們將訓練例子中其餘的數據拋棄並以原生型數據來訓練分類器有別於傳統方法用整個訓練資料。我們用的分類器學習方法是J48決策樹演算法。我們比較我們的方法(只用原生型)與傳統的決策樹方法及天真貝式法（用整個訓練數據）的正確率與執行時間以評估表現。我們也比較我們方法與傳統方法所使用在訓練時所使用的數據量。我們測試三種大小不同的數據集。實驗發現證明我們的方法百分百快於傳統的方法只有微微下降一點正確率。

Abstract

Big data has become ubiquitous and has become of great significance in academia. With the rapid increase in the enormity of big data, many problems have arisen when trying to manipulate the data for the purpose of forecasting. In this thesis, we highlight the problem of computational complexity when attempting to deal with big data. We propose a heuristic that can help to solve this problem by altering the existing method of classification so that it is more suitable for handling big data, thereby increasing efficiency. Our heuristic would not only be more suitable to handle big data but it would also be faster than traditional classification while keeping the accuracy approximately the same as traditional classification, if not higher. Our heuristic combines prototype selection with the traditional classification process. In our heuristic, a subset of the training data is selected as prototypes. The remaining data in the training set is discarded and we continue the process of classification by training the set of prototypes as opposed to the conventional method of using the entire training set. The learning algorithm used in our heuristic is the J48 decision tree algorithm. We evaluated our heuristic by comparing the classification accuracy and running time of our algorithm (using prototypes) with the traditional decision tree and naïve Bayes algorithms (using the entire training set). We also compared the amount of data used in our training phase versus the amount used in the training phases of conventional methods. We tested the data on five data sets ranging from sizes small to large. Findings prove that for big data, our heuristic saves memory space and is 100% faster than traditional classification with only a slight drop in accuracy.

Contents

Introduction    1
1 Statement of the Problem    1
2 Research Objective and Contributions    2
2.1 Hypothesis    3
2.2 Contributions    5
3 Related Work    6

Methodology    8
1 Definitions and Symbols    8
2 Adapted PSC Algorithm    10
3 Experiment    16
3.1 Random Partitioning    17
3.2 Selection of Prototypes    18
3.3 Training and Testing using Prototypes in Conjunction with Decision Tree    19
3.4 Training and Testing using the Original Training Set    21

Results    23
1 Accuracy Results    23
2 Time Results    28
3 Memory Results    32

Evaluation    37
1 Datasets    37
2 Metrics    39
3 Discussion    40

Conclusion    43
1 Summary    43
2 Limitations    43
3 Future Work    44

References    45


                                

References

[1] G. Halevi and H. Moed, "Special Issue on Big Data," Research Trends, no. 30, pp. 3 - 6, September 2012.

[2] C. Ji, Y. Li, W. Qiu, K. Li and U. Awada, "Big Data Processing in Cloud Computing Environments," International Symposium on Pervasive Systems, Algorithms and Networks, 2012.

[3] "Turn Big Data into Big Value: A practical strategy," Intel White Paper, 2013.

[4] X. Jin, B. W. Wah, X. Chen and Y. Wang, "Big Data Research 2: Significance and Challenges of Big Data," Elsevier, pp. 59 - 64, 2015.

[5] H. Hu, Y. Wen, T. Chua and X. Li, "Toward Scalable Systems for Big Data Analytics: A Technology Tutorial," IEEE Access, vol. 2, pp. 652 - 687, 2014.

[6] J. Lopez, J. Ochoa and J. Trinidad, "Prototype Selection Methods," Computación y Sistemas, vol. 13, no. 4, pp. 449 - 462, 2010.

[7] S. Garcia, J. Derrac, J. Cano and F. Herrera, "Prototype Selection for Nearest Neighbor," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 417 -435, 2012.

[8] D. Wilson and T. Martinez, "Reduction Techniques for Instance-Based Learning Algorithms," Machine Learning, vol. 38, pp. 257-286, 2000.

[9] I. Witten and E. Frank, Data Mining: Practical Machine Learning, Elsevier, 2005.
[10] A. Moore, "Information Gain," Lecture Notes, 2003. [Online]. Available: https://www.autonlab.org/tutorials.

[11] M. Lichman, "UCI Machine Learning Repository," University of California, School of Information and Computer Science, 2013. [Online]. Available: http://archive.ics.uci.edu/ml. [Accessed September 2017].

[12] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, B. Liu and M. Steinbach, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1 - 37, 2008.

[13] O. Mangasarian and W. H. Wolberg, "Cancer Diagnosis via Linear Programming," SIAM News, vol. 23, no. 5, p. 1 & 18, 1990.

[14] M. Fayyad and K. Irani, "Multi-Interval Discretization of Continuous-Valued Attributes for Classiﬁcation Learning," Thirteenth International Joint Conference of Artificial Intelligence, pp. 1022-1027, 1993.

簡易檢索 / 詳目顯示

相關論文