研究生: |
林文彬 Lin, Wen Ben |
---|---|
論文名稱: |
處理長尾分布與屬性資料扭曲之資料探勘技術 Development of Novel Data Mining Techniques for Handling Long Tail Distribution and Attribute-Specific Distortion |
指導教授: |
葉維彰
Yeh, Wei Chang 魏志平 Wei, Chih Ping |
口試委員: |
曾新穆
Tseng, Shin Mu 黃正魁 Huang, Cheng Kui 劉敦仁 Liu, Duen Ren 楊錦生 Yang, Chin Sheng 黃佳玲 Huang, Chia Ling |
學位類別: |
博士 Doctor |
系所名稱: |
工學院 - 工業工程與工程管理學系 Department of Industrial Engineering and Engineering Management |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 109 |
中文關鍵詞: | 機器學習 、長尾分布 、類別不平衡 、重取樣 、扭曲資料 |
外文關鍵詞: | Machine Learning, Long Tail Distribution, Class Imbalance, Resampling, Distortion Data |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
長尾分布資料與屬性資料扭曲是常見的兩種資料特性,並且會影響其預測誤差值大小與準確度高低。長尾分布資料發生於許多領域,由於長尾分布中的尾部資料數稀少,導致分析或預測時誤差值也相對較大,不利決策判斷。本論文對於長尾分布問題,提出兩不同技術分別降低尾部資料預測誤差值過大的現象。第一種技術是將所有資料在特定範圍內鄰居數的多寡,轉換成其抽樣機率大小之分布,並透過重複抽樣得到多個訓練集合,再整合所有模型的預測結果為最後預測值。論文中亦提出新的混合策略來整合協調傳統模型與整合模型的優點,預測來自長尾分布中不同位置的資料。第二種技術則是藉由重取樣的超取樣與低取樣的方法來解決長尾分布資料問題,同樣地,論文亦對此第二種技術提出新的混合策略來整合協調傳統模型與改良模型的優點,預測來自長尾分布中不同位置的資料。根據論文的實證評估結果,兩技術顯著降低長尾分布中尾部資料的誤差值,且各自的混合策略在傳統模型與改良模型之間呈現了截長補短之功能。
除了長尾分布缺乏足夠資料的議題外,資料本身的正確性也會影響分析或預測準確度高低,造成資料不正確的因素包括隨機誤差與系統誤差,其中又以系統誤差中的受測者誤差類型,常見於各類系統所得到的資料集合中,換言之,資料的觀測值並非全為實際值。現行方法大都只侷限在處裡隨機誤差,或只侷限以決策樹演算法來處理類別扭曲資料。本論文針對具有扭曲資料的類別屬性,藉由專家對該屬性所提供的先驗資訊,將其各種觀測結果依不同的條件機率值,轉換為可能為真的出象,並且呈現在多個樣本集合中,然後對每一個樣本集合,透過重複抽樣的方式得到數個訓練集合,整合此數個訓練集合下的模型,做為其樣本集合的預測結果,然後再整合每一個樣本集合的預測結果做為最終結果。根據論文的實證評估結果,此技術顯著優於傳統模型處理類別扭曲資料的準確度。
Data characteristics are critical to prediction effectiveness, especially for the long-tailed regression problem and the specific attribute distortion problem. However, the current techniques are applied to the general prediction tasks without the ability to deal with such specific data characteristics. Both density bagging and bin-resampling techniques are developed respectively to solve the long-tailed regression problem. However, both two techniques pay for accuracy in the head and even the central part of the long-tailed distribution. This thesis addresses two different hybrid methods corresponding to density bagging and bin-resampling respectively, which can improve the prediction performance for the tail part of the long-tailed distribution without sacrificing more prediction accuracy for the head and even the central part. Three datasets are finally taken to evaluate the performance of our proposed techniques and their hybrid methods respectively and compared with several ensemble methods.
A data characteristic of a specific attribute distortion problem indicates that an observe outcome of an instance corresponding to an input attribute is not always the true outcome in real world applications. We develop a state populate bagging to solve the specific attribute distortion in classification analysis. We first transform several true datasets into observed datasets according to the distortion matrices corresponding to their specific attributes, and afterwards transform each one of them into a possible true dataset according the reverse distortion matrices. Next step is to sample several same size sets with replacement on each one of possible true datasets. State populate bagging with two voting layers not only practices an observed outcome into possible true outcomes but also captures ensemble gain for any classifying algorithms without limiting to only a specific one. Finally, several true data sets from UCI machine repository are taken to reverse true data sets into observed data sets, and afterwards we evaluate the performance of state populate bagging and compared with several benchmark algorithms.
Balog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework for expert finding. Information Processing & Management, 45(1), 1-19.
Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning.Knowledge and Data Engineering, IEEE Transactions on, 26(2), 405-425.
Barua, S., Islam, M. M., & Murase, K. (2013). ProWSyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning. In Advances in Knowledge Discovery and Data Mining (pp. 317-328). Springer Berlin Heidelberg.
Zhang, J. B. T. (2005). Support vector classification with input data uncertainty. Advances in neural information processing systems, 17, 161.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. CRC press.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Breiman, L. (1999). Prediction games and arcing algorithms. Neural computation, 11(7), 1493-1517.
Campbell, C. S., Maglio, P. P., Cozzi, A., & Dom, B. (2003, November). Expertise identification using email communications. In Proceedings of the twelfth international conference on Information and knowledge management (pp. 528-531). ACM.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. In Knowledge Discovery in Databases: PKDD 2003 (pp. 107-119). Springer Berlin Heidelberg.
Chen, S., He, H., & Garcia, E. (2010). Ramoboost: Ranked minority oversampling in boosting. Neural Networks, IEEE Transactions on, 21(10), 1624-1642.
Chevalier, J. A., & Mayzlin, D. (2006). The effect of word of mouth on sales: Online book reviews. Journal of marketing research, 43(3), 345-354.
Chien, W. C., Pai, L., Lin, C. C., & Chen, H. C. (2003). Epidemiology of hospitalized burns patients in Taiwan. Burns, 29(6), 582-588.
Churchill Jr, G. A. (1979). A paradigm for developing better measures of marketing constructs. Journal of marketing research, 64-73.
Cohen, G., Hilario, M., Sax, H., & Hugonnet, S. (2003). Data imbalance in surveillance of nosocomial infections. In Medical Data Analysis (pp. 109-117). Springer Berlin Heidelberg.
Curado, C., & Bontis, N. (2011). Parallels in knowledge cycles. Computers in Human Behavior, 27(4), 1438-1444.
Davenport, T. H., De Long, D. W., & Beers, M. C. (1998). Successful knowledge management projects. Sloan management review, 39(2), 43-57.
Davenport, T. H., & Prusak, L. (1998). Working knowledge: How organizations manage what they know. Harvard Business Press.
Dellarocas, C. (2003). The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Management science, 49(10), 1407-1424.
Dellarocas, C., Zhang, X. M., & Awad, N. F. (2007). Exploring the value of online product reviews in forecasting sales: The case of motion pictures. Journal of Interactive marketing, 21(4), 23-45.
Dom, B., Eiron, I., Cozzi, A., & Zhang, Y. (2003, June). Graph-based ranking algorithms for e-mail expertise analysis. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery (pp. 42-48). ACM.
Drucker, H. (1997, July). Improving regressors using boosting techniques. InICML (Vol. 97, pp. 107-115).
Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting algorithm. In ICML (Vol. 96, pp. 148-156).
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.
Fu, Y., Xiang, R., Liu, Y., Zhang, M., & Ma, S. (2007, November). A CDD-based formal model for expert finding. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 881-884). ACM.
Hair J.F., Black B., and Anderson R.E. (2009). Multivariate Data Analysis: Global Perspective (pp.7), 7th ed., New Jersey: Pearson Education Inc.
Ho, S. Y., Hsieh, C. H., Yu, F. C., & Huang, H. L. (2007). An intelligent two-stage evolutionary algorithm for dynamic pathway identification from gene expression profiles. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 4(4), 648-704.
Hu, P. J. H., Wei, C. P., Cheng, T. H., & Chen, J. X. (2007). Predicting adequacy of vancomycin regimens: A learning-based classification approach to improving clinical decision making. Decision Support Systems, 43(4), 1226-1241.
Hu, N., Liu, L., & Zhang, J. J. (2008). Do online reviews affect product sales? The role of reviewer characteristics and temporal effects. Information Technology and Management, 9(3), 201-214.
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. The Journal of machine learning research, 4, 933-969.
Jackson, T., & Tedmori, S. (2004). Capturing and managing electronic knowledge: the development of the email knowledge extraction (EKE) system.Innovations Through Information Technology, 463-466.
Japkowicz, N. (2003, August). Class imbalances: are we focusing on the right issue. In Workshop on Learning from Imbalanced Data Sets II (Vol. 1723, p. 63).
Jiang, Z., Mookerjee, V. S., & Sarkar, S. (2005). Lying on the web: implications for expert systems redesign. Information Systems Research, 16(2), 131-148.
Jurczyk, P. & Agichtein, E. (2007a). Discovering authorities in question answer communities by using link analysis. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 919-922). ACM.
Jurczyk, P. & Agichtein, E. (2007b). Hits on question answer portals: exploration of link analysis for author ranking. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 845-846). ACM.
Lari, A. R., Alaghehbandan, R., & Nikui, R. (2000). Epidemiological study of 3341 burns patients during three years in Tehran, Iran. Burns, 26(1), 49-53.
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the eleventh international conference on machine learning (pp. 148-156).
Lin, L., Xu, Z., Ding, Y., & Liu, X. (2013). Finding topic-level experts in scholarly networks. Scientometrics, 97(3), 797-819.
Liu, D. R., Chen, Y. H., Kao, W. C., & Wang, H. W. (2013). Integrating expert profile, reputation and link analysis for expert finding in question-answering websites. Information Processing & Management, 49(1), 312-329.
Lu, Y., Quan, X., Ni, X., Liu, W., & Xu, Y. (2009, October). Latent link analysis for expert finding in user-interactive question answering services. InSemantics, Knowledge and Grid, 2009. SKG 2009. Fifth International Conference on (pp. 54-59). IEEE.
Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., & Tomita, M. (2003). Dynamic modeling of genetic networks using genetic algorithm and S-system.Bioinformatics, 19(5), 643-650.
Komiak, S. Y., & Benbasat, I. (2006). The effects of personalization and familiarity on trust and adoption of recommendation agents. Mis Quarterly, 941-960.
Ku, Y. C., Wei, C. P., & Hsiao, H. W. (2012). To whom should I listen? Finding reputable reviewers in opinion-sharing communities. Decision Support Systems, 53(3), 534-542.
Maybury, M. T. (2006). Expert Finding Systems: The MITRE Corporation (http://www.mitre.org/work/tech_papers/tech_papers_06/06_1115/index.html).
Mingers, J. (1987). Expert systems-rule induction with statistical data. Journal of the operational research society, 39-47.
Moles, C. G., Mendes, P., & Banga, J. R. (2003). Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome research, 13(11), 2467-2474.
Mudambi, S. M., & Schuff, D. (2010). What makes a helpful review? A study of customer reviews on Amazon. com. MIS quarterly, 34(1), 185-200.
Mac Namee, B., Cunningham, P., Byrne, S., & Corrigan, O. I. (2002). The problem of bias in training data in regression problems in medical decision support. Artificial intelligence in medicine, 24(1), 51-70.
Natividade da Silva, P.N., Amarante, J., Costa-Ferreira, A., Silva, A., & Reis, J. (2003). Burn patients in Portugal: analysis of 14 797 cases during 1993–1999. Burns, 29(3), 265-269.
Niblett, T. (1987). Constructing decision trees in noisy domains.
Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453).
Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine learning, 42(3), 203-231.
Quinlan, J. R. (1992, November). Learning with continuous classes. In 5th Australian joint conference on artificial intelligence (Vol. 92, pp. 343-348).
Quinlan, J. R. (1987). Simplifying decision trees. International journal of man-machine studies, 27(3), 221-234.
Rabjohn, N., Cheung, C. M., & Lee, M. K. (2008, January). Examining the perceived credibility of online opinions: information adoption in the online environment. In Hawaii International Conference on System Sciences, Proceedings of the 41st Annual (pp. 286-286). IEEE.
Resnick, P., Kuwabara, K., Zeckhauser, R., & Friedman, E. (2000). Reputation systems. Communications of the ACM, 43(12), 45-48.
Ridgeway, G., Madigan, D., & Richardson, T. (1999, January). Boosting methodology for regression problems. In Proceedings of the International Workshop on AI and Statistics (pp. 152-161).
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5, 3.
Sands, P. J., & Voit, E. O. (1996). Flux-based estimation of parameters in S-systems. Ecological Modelling, 93(1), 75-88.
Savageau, M. A. (1976). Biochemical systems analysis: a study of function and design in molecular biology. Addison-Wesley.
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression.Statistics and computing, 14(3), 199-222.
Solomatine, D. P., & Shrestha, D. L. (2004, July). AdaBoost. RT: a boosting algorithm for regression problems. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on (Vol. 2, pp. 1163-1168). IEEE.
Tapscott, D. (1999). IBM is showing leadership on the privacy issue.Computerworld, 33(17), 34.
Turney, P. D. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303-336.
Ueda, T., Koga, N., & Okamoto, M. (2001). Efficient numerical optimization technique based on real-coded genetic algorithm. Genome Informatics, 12, 451-453.
Voit, E. O. (2000). Computational analysis of biochemical systems: a practical guide for biochemists and molecular biologists. Cambridge University Press.
Voit, E. O., & Radivoyevitch, T. (2000). Biochemical systems analysis of genome-wide expression data. Bioinformatics, 16(11), 1023-1037.
Wang, G. A., Jiao, J., Abrahams, A. S., Fan, W., & Zhang, Z. (2013). ExpertRank: A topic-aware expert finding algorithm for online knowledge communities. Decision Support Systems, 54(3), 1442-1451.
Wang, Y., & Witten, I. H. (1996). Induction of model trees for predicting continuous classes.
Wei, C. P., Chen, Y. M., Yang, C. S., & Yang, C. C. (2010). Understanding what concerns consumers: a semantic approach to product feature extraction from consumer reviews. Information Systems and E-Business Management,8(2), 149-167.
Wei, C. P., & Chiu, I. T. (2002). Turning telecommunications call details to churn prediction: a data mining approach. Expert systems with applications,23(2), 103-112.
Wei, C. P., Hu, P. J. H., & Chen, H. H. (2002). Design and evaluation of a knowledge management system. IEEE software, (3), 56-59.
Yang, C. S., Wei, C. P., Yuan, C. C., & Schoung, J. Y. (2010). Predicting the length of hospital stay of burn patients: Comparisons of prediction accuracy among different clinical stages. Decision Support Systems, 50(1), 325-335.
Wu, B., Goel, V., & Davison, B. D. (2006). Propagating Trust and Distrust to Demote Web Spam. MTW, 190.
Wu, J., Xiong, H., Wu, P., & Chen, J. (2007, August). Local decomposition for rare class analysis. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 814-823). ACM.
Yang, K. H., Chen, C. Y., Lee, H. M., & Ho, J. M. (2008, October). EFS: expert finding system based on wikipedia link pattern analysis. In Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on (pp. 631-635). IEEE.
Yeh, W. C., Lin, W. B., Hsieh, T. J., & Liu, S. L. (2011). Feasible prediction in S-system models of genetic networks. Expert Systems with Applications,38(1), 193-197.
Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718-5727.
Yu, H., Ni, J., & Zhao, J. (2013). ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data.Neurocomputing, 101, 309-318.
Zhang, J., & Ackerman, M. S. (2005, November). Searching for expertise in social networks: a simulation of potential strategies. In Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work (pp. 71-80). ACM.