研究生: |
李侑倫 Lee, You-Luen |
---|---|
論文名稱: |
資料中心先知:精準預測災難性伺服器故障意外事件 DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters |
指導教授: |
張世杰
Chang, Shih-Chieh |
口試委員: |
周志遠
Chou, Chi-Yuan 彭文志 Peng, Wen-Chih |
學位類別: |
碩士 Master |
系所名稱: |
|
論文出版年: | 2017 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 31 |
中文關鍵詞: | 日誌分析 、支持向量機 、隨機森林 、異常檢測 、數據中心 、資料中心 、可靠度 |
外文關鍵詞: | Log analysis, Datacenter |
相關次數: | 點閱:2 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在工業資料中心中,伺服器何時會發生災難性的故障?預測這些故障從而採取一些預防措施來增加資料中心的可靠性是有可能的嗎?基於這些疑問,我們分析目前應該是最大的且最先公開的資料中心記錄──Google Cluster Traces,這個資料集包含來自12,500多台機器超過1.04億的所有事件;在這些樣本中,我們觀察並分類三種類型的機器故障,這些故障可能導致信息丟失,甚至更糟的是降低了資料中心的可靠性;我們進一步提出了一個基於一類支持向量機(One-Class SVM)和隨機森林(Random Forest)兩種方法的兩階段框架──DC-Prophet,DC-Prophet能發掘到機器故障事件的狀態,並準確預測機器的下一次故障。實驗結果說明,DC-Prophet在預測中AUC達到了0.93、F3-score為0.88。平均來說,DC-Prophet與其他經典機器學習方法相比,在F3-score的表現上大約有39.45%進步。
When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework—DC-Prophet—based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-prophet outperforms other classical machine learning methods by 39.45% in F3-score.
[1] 2016 cost of data center outages report. https://goo.gl/OeNM4U.
[2] Google cluster data - discussions, 2011. https://groups.google.com/
forum/#!forum/googleclusterdata-discuss.
[3] L. A. Barroso, J. Clidaras, and U. H¨olzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013.
[4] C. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, 2006.
[5] M. Botezatu, I. Giurgiu, J. Bogojeska, and D. Wiesmann. Predicting disk replacement towards reliable data centers. 2016.
[6] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
[7] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[8] X. Chen, C.-D. Lu, and K. Pattabiraman. Failure analysis of jobs in compute clouds: A google cluster case study. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 167–177. IEEE, 2014.
[9] S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing, pages 230–238. IEEE, 2012.
[10] Q. Guan and S. Fu. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Reliable Distributed Systems (SRDS), 2013 IEEE 32nd International Symposium on, pages 205–214. IEEE, 2013.
[11] G. Hamerly, C. Elkan, et al. Bayesian approaches to failure prediction for disk drives. In ICML, pages 202–209. Citeseer, 2001.
[12] D.-C. Juan, L. Li, H.-K. Peng, D. Marculescu, and C. Faloutsos. Beyond poisson: Modeling inter-arrival time of requests in a datacenter. In Advances in Knowledge Discovery and Data Mining, pages 198–209. Springer, 2014.
[13] Z. Liu and S. Cho. Characterizing machines and workloads on a google cluster. In 2012 41st International Conference on Parallel Processing Workshops, pages 397–403. IEEE, 2012.
[14] T. D. Miller and I. L. Crawford Jr. Terminating a non-clustered workload in response to a failure of a system with a clustered workload, Jan. 26 2010. US Patent 7,653,833.
[15] J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning Research, 6(May):783–816, 2005.
[16] D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. 2011.
[17] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In SOCC, page 7. ACM, 2012.
[18] C. Reiss, J. Wilkes, and J. L. Hellerstein. Google cluster-usage traces: format + schema. Technical report, Google Inc., Mountain View, CA, USA, Nov. 2011. Revised 2014-11-17 for version 2.1. Posted at https://github.com/google/cluster-data.
[19] B. Scholkopf, K.-K. Sung, C. J. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE transactions on Signal Processing, 45(11):2758–2765, 1997.
[20] Y. Tan and X. Gu. On predictability of system anomalies in real world. In 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 133–140. IEEE, 2010.
[21] C. Van-Rijsbergen. Information Retrieval. Butterworths, London, England, 2nd edition, 1979.
[22] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J.Wilkes. Largescale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, page 18. ACM, 2015.
[23] Q. Zhang, J. Hellerstein, and R. Boutaba. Characterizing task usage shapes in google compute clusters. 2011.