Submitted by scrawford on
|Title||Runtime Level Failure Detection and Propagation in HPC Systems|
|Publication Type||Conference Paper|
|Year of Publication||2019|
|Authors||Zhong, D., A. Bouteiller, X. Luo, and G. Bosilca|
|Conference Name||European MPI Users' Group Meeting (EuroMPI '19)|
|Conference Location||Zürich, Switzerland|
As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.