Runtime Level Failure Detection and Propagation in HPC Systems

Submitted by scrawford on Thu, 10/10/2019 - 11:36

Title	Runtime Level Failure Detection and Propagation in HPC Systems
Publication Type	Conference Paper
Year of Publication	2019
Authors	Zhong, D., A. Bouteiller, X. Luo, and G. Bosilca
Conference Name	European MPI Users' Group Meeting (EuroMPI '19)
Date Published	2019-09
Publisher	ACM
Conference Location	Zürich, Switzerland
ISBN Number	978-1-4503-7175-9
Abstract	As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.
DOI	10.1145/3343211.3343225

Project Tags:

File:

External Publication Flag: