研究生: |
林庭宇 Lin, Ting-Yu |
---|---|
論文名稱: |
根據執行時間預測資訊對平行計算系統進行即時偵錯 Online Failures Detection for Parallel Computing Systems Based on Time Prediction Information |
指導教授: |
周志遠
Chou, Jerry |
口試委員: |
李端興
Lee, Duan-Shin 李哲榮 Lee, Che-Rung |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊系統與應用研究所 Institute of Information Systems and Applications |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 30 |
中文關鍵詞: | 即時故障偵測 、平行計算系統 、執行時間預測 |
外文關鍵詞: | Online Failure Detection, Parallel Computing Systems, Execution Time Prediction |
相關次數: | 點閱:1 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
機器故障可能導致許多問題,例如工作執行失敗、用戶資料遺失等。而對於平行計算的工作來說,由於可以同時在多台機器上執行,機器故障的影響便將進一步加劇。常見的策略之一是藉由偵測機器性能的下降,作為識別機器故障的可能因素。因此,許多先前嘗試的作法旨在開發更準確的作業時間預測方法,而其所根據的是具有固定的執行時間並且在單個節點上獨立運行的工作執行模型。但是,隨著虛擬化技術如虛擬機或容器在計算機叢集中的應用趨勢,不僅工作可以跨多個節點平行計算,機器也可以在作業之間共享。所以,與先前的研究相反,本篇研究的目的是提出能在更加通用且復雜的平行計算環境中,容許不準確工作執行預測的機器故障偵測算法。經由實驗的驗證,我們的做法可以在F1得分上取得20%到40%之間的改善。
Machine failures can cause many problems, such as job execution failures, user data loss. The problem can be further aggravated for parallel computing jobs because a job can be executed across multiple nodes. One of common approaches is to detect performance degradation as a hint for identifying possible machine failures. Hence many previous attempts aim to develop more accurate job time prediction methods based on the execution model where jobs have a fixed execution time and run independently on a single node. However, with the trend of adapting virtualization techniques, such as virtual machine or container, in computing clusters, not only jobs can run in parallel across multiple nodes, machines can also be shared among jobs. Therefore, contrary to previous work, we purpose a detection algorithm that can accommodate inaccurate job prediction in a more general and complex parallel computing environment. An improvement on F1-score between 20% to 40% was observed from our evaluations.
[1] Abdelsalam, M., Krishnan, R., Huang, Y., and Sandhu, R. Malware detection in cloud infrastructures using convolutional neural networks. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD) (2018), IEEE, pp. 162–169.
[2] Abdelsalam, M., Krishnan, R., and Sandhu, R. Clustering-based iaas cloud monitoring. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD) (2017), IEEE, pp. 672–679.
[3] Ahmad,S.,andPurdy,S.Real-timeanomalydetectionforstreaminganalytics. arXiv preprint arXiv:1607.02480 (2016).
[4] Aygun, R. C., and Yavuz, A. G. Network anomaly detection with stochas- tically improved autoencoder based models. In 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud) (2017), IEEE, pp. 193–198.
[5] Baseman, E., Blanchard, S., DeBardeleben, N., Bonnie, A., and Morrow, A. Interpretable anomaly detection for monitoring of high performance comput- ing systems. In Outlier Definition, Detection, and Description on Demand Workshop at ACM SIGKDD. San Francisco (Aug 2016) (2016).
[6] Bhaduri, K., Das, K., and Matthews, B. L. Detecting abnormal machine char- acteristics in cloud infrastructures. In 2011 IEEE 11th International Confer- ence on Data Mining Workshops (2011), IEEE, pp. 137–144.
[7] Bhatia, S., Kumar, A., Fiuczynski, M. E., and Peterson, L. L. Lightweight, high-resolution monitoring for troubleshooting production systems. In OSDI (2008), pp. 103–116.
[8] Chalapathy,R.,andChawla,S.Deeplearningforanomalydetection:Asurvey. CoRR abs/1901.03407 (2019).
[9] Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 15.
[10] Chung, C.-J., Khatkar, P., Xing, T., Lee, J., and Huang, D. Nice: Network intrusion detection and countermeasure selection in virtual network systems. IEEE transactions on dependable and secure computing 10, 4 (2013), 198– 211.
[11] Ciccotelli, C. Practical fault detection and diagnosis in data centers.
[12] Dean, D. J., Nguyen, H., and Gu, X. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th international conference on Autonomic computing (2012), ACM, pp. 191–200.
[13] Gabel, M., Schuster, A., Bachrach, R.-G., and Bjørner, N. Latent fault detec- tion in large scale services. In IEEE/IFIP International Conference on De- pendable Systems and Networks (DSN 2012) (2012), IEEE, pp. 1–12.
[14] Gaddam, S. R., Phoha, V. V., and Balagani, K. S. K-means+ id3: A novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. IEEE transactions on knowledge and data engineering 19, 3 (2007), 345–354.
[15] Ibidunmoye, O., Hernández-Rodriguez, F., and Elmroth, E. Performance anomaly detection and bottleneck identification. ACM Computing Surveys (CSUR) 48, 1 (2015), 4.
[16] Leung, K., and Leckie, C. Unsupervised anomaly detection in network intru- sion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38 (2005), Australian Computer So- ciety, Inc., pp. 333–342.
[17] Markou, M., and Singh, S. Novelty detection: a review—part 1: statistical approaches. Signal processing 83, 12 (2003), 2481–2497.
[18] Nikolai, J., and Wang, Y. Hypervisor-based cloud intrusion detection system. In 2014 International Conference on Computing, Networking and Communi- cations (ICNC) (2014), IEEE, pp. 989–993.
[19] Pietri, I., Juve, G., Deelman, E., and Sakellariou, R. A performance model to estimate execution time of scientific workflows on the cloud. In 2014 9th Work- shop on Workflows in Support of Large-Scale Science (2014), IEEE, pp. 11–19.
[20] Roy, A., Zeng, H., Bagga, J., and Snoeren, A. C. Passive realtime datacenter fault detection and localization. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17) (2017), pp. 595–612.
[21] Shen, K., Stewart, C., Li, C., and Li, X. Reference-driven performance anomaly identification. In ACM SIGMETRICS Performance Evaluation Re- view (2009), vol. 37, ACM, pp. 85–96.
[22] Shipmon, D. T., Gurevitch, J. M., Piselli, P. M., and Edwards, S. T. Time series anomaly detection; detection of anomalous drops with limited fea- tures and sparse examples in noisy highly periodic data. arXiv preprint arXiv:1708.03665 (2017).
[23] Song, G., Meng, Z., Huet, F., Magoules, F., Yu, L., and Lin, X. A hadoop mapreduce performance prediction method. In 2013 IEEE 10th Interna- tional Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Comput- ing (2013), IEEE, pp. 820–825.
[24] Tan,J.,Pan,X.,Kavulya,S.,Gandhi,R.,andNarasimhan,P.Salsa:Analyzing logs as state machines. WASL 8 (2008), 6–6.
[25] Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., and Schwan, K. Statistical techniques for online anomaly detection in data centers. In 12th IFIP/IEEE International Symposium on Integrated Network Manage- ment (IM 2011) and Workshops (2011), IEEE, pp. 385–392.
[26] Wang, K., and Khan, M. M. H. Performance prediction for apache spark plat- form. In 2015 IEEE 17th International Conference on High Performance Com- puting and Communications, 2015 IEEE 7th International Symposium on Cy- berspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (2015), IEEE, pp. 166–173.
[27] Wang, T., Wei, J., Zhang, W., Zhong, H., and Huang, T. Workload-aware anomaly detection for web applications. Journal of Systems and Software 89 (2014), 19–32.
[28] Wang, Y., Miao, Q., Ma, E. W., Tsui, K.-L., and Pecht, M. G. Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Transac- tions on Reliability 62, 1 (2013), 136–145.
[29] Williams, A. W., Pertet, S. M., and Narasimhan, P. Tiresias: Black-box failure prediction in distributed systems. In 2007 IEEE International Parallel and Distributed Processing Symposium (2007), IEEE, pp. 1–8.
[30] Yuan, C., Lao, N., Wen, J.-R., Li, J., Zhang, Z., Wang, Y.-M., and Ma, W.- Y. Automated known problem diagnosis with event traces. In ACM SIGOPS Operating Systems Review (2006), vol. 40, ACM, pp. 375–388.