LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 1 of total 1

Search options

Book ; Online: Workload Failure Prediction for Data Centers

Li, Jie / Wang, Rui / Ali, Ghazanfar / Dang, Tommy / Sill, Alan / Chen, Yong

2023  

Abstract: Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased ... ...

Abstract Failed workloads that consumed significant computational resources in time and space affect the efficiency of data centers significantly and thus limit the amount of scientific work that can be achieved. While the computational power has increased significantly over the years, detection and prediction of workload failures have lagged far behind and will become increasingly critical as the system scale and complexity further increase. In this study, we analyze workload traces collected from a production cluster and train machine learning models on a large amount of data sets to predict workload failures. Our prediction models consist of a queue-time model that estimates the probability of workload failures before execution and a runtime model that predicts failures at runtime. Evaluation results show that the queue-time model and runtime model can predict workload failures with a maximum precision score of 90.61% and 97.75%, respectively. By integrating the runtime model with the job scheduler, it helps reduce CPU time, and memory usage by up to 16.7% and 14.53%, respectively.
Keywords Computer Science - Distributed ; Parallel ; and Cluster Computing
Subject code 006
Publishing date 2023-01-12
Publishing country us
Document type Book ; Online
Database BASE - Bielefeld Academic Search Engine (life sciences selection)

More links

Kategorien

To top