Machine Learning-Enhanced Predictive Maintenance for High-Performance Computing (HPC) Systems
Keywords:
Predictive Maintenance, Machine Learning, High-Performance Computing (HPC), Failure Prediction, Anomaly Detection, Time-Series Forecasting, AI-Driven System Monitoring, Fault Diagnosis, Downtime Reduction, System Reliability.Abstract
High-Performance Computing (HPC) systems play a crucial role in scientific simulations, artificial intelligence workloads, and large-scale data processing. However, the complexity and scale of HPC infrastructure introduce challenges in system reliability, failure prediction, and downtime mitigation. Machine learning (ML)-enhanced predictive maintenance offers a transformative approach to addressing these challenges by leveraging historical system logs, sensor data, and operational metrics to forecast failures before they occur. This paper explores the integration of ML techniques, such as deep learning, anomaly detection, and time-series forecasting, to enhance fault prediction, optimize maintenance schedules, and reduce unplanned downtime. We propose a hybrid predictive maintenance framework that incorporates supervised and unsupervised learning models for anomaly classification and real-time system monitoring. Experimental evaluations on real-world HPC datasets demonstrate the efficacy of our approach in improving system reliability and resource efficiency. The findings suggest that ML-driven predictive maintenance not only enhances HPC system performance but also contributes to cost savings and energy efficiency by minimizing unnecessary maintenance interventions.