Making Control in High Performance Computing for Overload Avoidance Adaptive in Time and Job Size
Résumé
The feedback control of High-Performance Computing (HPC) has been explored as an application area of Control Theory, because of the high variability involved in their resource management. A regulation mechanism can allow to soundly automate the injection of small flexible jobs in a cluster. A trade-off is needed, to fill up the cluster’s computing capacity while avoiding overload of e.g., the file server. In this work, we describe new results in this context, where the overload avoidance controller is made adaptive to the jobs’ size, that is a time-varying unknown parameter. To do so, the original PI controller is enhanced with an online estimation algorithm that allows the controller to adapt to various working conditions, to avoid performance degradation. Parallel and robust estimation algorithms are designed, tackling the challenges of bursting and noise in the system. Validation and evaluation of the adaptive controller are performed on a large-scale experimental HPC platform, showing higher robustness thanthe state-of-the-art in highly varying conditions. Reproducible analysis are available at doi:10.5281/zenodo.11961696.
Origine | Fichiers produits par l'(les) auteur(s) |
---|