簡易檢索 / 詳目顯示

研究生: 蔡亞勳
Tsai, Ya Hsun
論文名稱: Learning Classification Models From Datasets with Block Missing
指導教授: 魏志平
Wei, Chih-Ping
林福仁
Lin, Fu-Ren
口試委員: 魏志平
林福仁
楊錦生
陳宏鎮
學位類別: 碩士
Master
系所名稱: 科技管理學院 - 服務科學研究所
Institute of Service Science
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 63
中文關鍵詞: 資料探勘遺失值填補遺失值失真失真補償袋裝法
外文關鍵詞: Data Mining, missing value imputation, missing data distortion, distortion-based bagging
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在做資料探勘(Data mining)的研究時,一個分類模型的效率以及預測的準確度大大受到建立分類器時所使用的學習資料(Training data)所影響,換句話說,一個分類模型的好壞有很大一部分的因素決定於學習資料的品質。其中,學習資料的缺漏與不完整長久一來一直是影響其研究結果的因素之一。在大多數的情況下,我們可以發現資料缺漏、不完整的現象會因為資料收集的過程中的隱私、機密性或是人為因素等問題隨機的出現在資料集中。而這種隨機性缺漏(Random missing)的問題,大部份也都可以透過現存的許多方法如填值(Imputation)來降低其所造成的影響。但隨著資料屬性的演化及其多元特性,帶來了另外一種類型的問題與挑戰,即所謂的區塊式缺漏(Block Missing)。造成這種情況的原因可能是在一份資料集中因為資料收集的時間不同所造成因為屬性演化導致在新的屬性出現前的資料一致性的缺漏。或是多個不同來源,屬性組成相似但不完全相同的資料集間的合併所造成。在一些先行實驗的結果中顯示,常見的用來處理隨機缺漏的方法如Imputation並不完全適用於處理區塊性缺漏的問題上。因此,我們提出了一個方法將Imputation在填值時所產生的不確定性降低。首先我們去計算每個缺漏值的distortion information以顯示其所有可能值的數據。接著,沿用Bagging的概念,我們提出distortion-based bagging technique這個方法,針對同一筆預測,根據其distortion information建立不同的分類模型來填入其可能值。接著便根據區塊缺漏的屬性重要性及缺漏比例進行一系列的實驗,實驗結果顯示我們所提出的兩個方法相較於對照方法(Benchmark)較為突出,尤其是當缺漏部分的屬性是屬於區別力較高的情況時。


    The effectiveness of a classifier is significantly based on the quality of the training instances due to the essence of the machine learning algorithms and data mining techniques. In the past, there may be some values randomly missing in a training dataset due to personal privacy, confidence, or operational mistakes. These missing values could be handled by some existing methods, such as simple imputation ones. As a result, the performance of a classifier based on a training dataset with missing values will still be acceptable. However, due to the data sharing in the developing environment, another condition of data incompletion, i.e., blocking missing has become a new challenge and cannot be solved by the same ways for random missing. The blocking missing usually exists in an increasing training dataset or an integrated dataset of different sources. A dataset with blocking missing is meant to some instances lack of the values for certain specific attributes. The certain specific attributes could be the new attributes that never used before or the attributes that are exclusive in a source. Some preliminary experiments are conducted to demonstrate that the common imputation methods for random missing handling are not applicable for the block missing handling. To address this new challenge, we purpose two novel methods that consider the uncertainty of the imputed value and build the corresponding classifier model accordingly. Specifically, we first extract the distortion information for each missing value to show the corresponding statistics of all its possible values. Following the concept of bagging technique, we further adopt the proposed distortion-based bagging technique to build different classifier for the same prediction task based on different distorted training dataset that fill in missing values according to the corresponding distortion information. Finally, the final result for a testing instance can be obtained by the major option from all the classifiers for this prediction task. A series of experiments are then performed based on three different kinds of training dataset. The experimental results show that our two proposed methods are superior to the benchmark methods with acceptable effectiveness, especially the blocking missing exits in the attributes with higher discrimination for prediction.

    Chapter 1 Introduction 9 1.1 Research Background 9 1.2 Scenario 1 10 1.3 Scenario 2 11 1.4 Motivation 12 1.5 Objective & Challenge 16 Chapter 2 Literature Review 18 Chapter 3 Learning Classification Models Based on Distortion -Based Bagging Technique.…..…………………………………………………………………………..20 3.1 Generating Imputed Value with Distortion Information by KNN-Based Imputation Method 21 3.1.1 Distortion Information for Categorical Attributes 25 3.1.2 Distortion Information for Numerical Attributes 29 3.2 Generating Imputed Value with Distortion Information by Model-Based Imputation Method 33 3.2.1 Distortion Information for Categorical Attributes 35 3.2.2 Distortion Information for Numerical Attributes 36 3.3 Learning with Distortion-Based Bagging 37 3.4 Prediction Combination 40 Chapter 4 Empirical Evaluation 41 4.1 Data Collection 41 4.2 Evaluation Design and Procedure 42 4.3 Comparative Evaluation 44 4.3.1 Comparison of Imputation 44 4.3.2 Comparison with Benchmarks 46 4.3.3 Comparison with Bagging & without Bagging 53 Chapter 5 Conclusion and Future Work 60 References……………………………………………………………………………….61

    Asuncion, A. F. a. A. (2010). UCI Machine Learning Repository. Available from University of California, Irvine, School of Information and Computer Sciences http://archive.ics.uci.edu/ml
    Batista, G., & Monard, M. C. (2002). A study of k-nearest neighbour as an imputation method. Soft Computing Systems: Design, Management and Applications, 251-260.
    Bengio, Y., & Gingras, F. (1996). Recurrent Neural Networks for Missing or Asynchronous Data. Architecture, 1, 6.
    Berthold, M. R., & Huber, K. P. (1998). Missing values and learning of fuzzy rules. International Journal Of Uncertainty Fuzziness And Knowledge Based Systems, 6, 171-178.
    Bi, J., & Zhang, T. (2004). Support vector classification with input data uncertainty. Advances in Neural Information Processing Systems, 17(5), 161-168.
    Garcia-Laencina, P. J., Sancho-Gomez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural computing & applications, 19(2), 263-282.
    Ishibuchi, H., Nozaki, K., Yamamoto, N., & Tanaka, H. (1995). Selecting fuzzy if-then rules for classification problems using genetic algorithms. Fuzzy Systems, IEEE Transactions on, 3(3), 260-270.
    Jiang, K., Chen, H., & Yuan, S. (2006). Classification for incomplete data using classifier ensembles. Paper presented at 2005 International Conference on Neural Networks and Brain.
    Juszczak, P., & Duin, R. (2004). Combining one-class classifiers to classify missing data. Multiple Classifier Systems, 92-101.
    Kohonen, T. (2006). Self-organizing maps (3rd ed.): Springer.
    Krause, S., & Polikar, R. (2003). An ensemble of classifiers for the missing feature problem. Paper presented at the Proc Intl Jt Conf Neural Netw, Portland, USA.
    Lakshminarayan, K., Harp, S. A., & Samad, T. (1999). Imputation of missing data in industrial databases. Applied Intelligence, 11(3), 259-275.
    Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data: Wiley
    Nauck, D., & Kruse, R. (1999). Learning in neuro-fuzzy systems with symbolic attributes and missing values.
    Pelckmans, K., De Brabanter, J., Suykens, J., & De Moor, B. (2005). Handling missing values in support vector machine classifiers. Neural Networks, 18(5-6), 684-692.
    Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. doi: 10.1007/bf00116251
    Quinlan, J. R. (1989). Unknown attribute values in induction. Paper presented at the Proceedings of the sixth international workshop on Machine learning, Ithaca, New York, United States.
    Quinlan, J. R. (1993). C4.5: programs for machine learning: Morgan Kaufmann Publishers.
    Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys (Vol. 519): Wiley Online Library.
    Samad, T., & Harp, S. A. (1992). Self-organization with partial data. Network: Computation in Neural Systems, 3(2), 205-212.
    Schafer, J. (1997). Analysis of incomplete multivariate data. Florida: Chapman & Hall.
    Tusell, F. (2002). Neural networks and predictive matching for flexible imputation.
    Webb, G. (1998). The problem of missing values in decision tree grafting. Advanced Topics in Artificial Intelligence, 273-283.
    Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE