Runtime Level Failure Detection and Propagation in HPC Systems

TitleRuntime Level Failure Detection and Propagation in HPC Systems
Publication TypeConference Paper
Year of Publication2019
AuthorsZhong, D., A. Bouteiller, X. Luo, and G. Bosilca
Conference NameEuropean MPI Users' Group Meeting (EuroMPI '19)
Date Published2019-09
PublisherACM
Conference LocationZürich, Switzerland
ISBN Number978-1-4503-7175-9
Abstract

As the scale of high-performance computing (HPC) systems continues to grow, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. In order to efficiently run long computing jobs on these systems, handling system failures becomes a prime challenge. We present here the design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Multiple overlapping topologies are used to optimize the detection and propagation, minimizing the incurred overheads and guaranteeing the scalability of the entire framework. The resulting framework has been implemented in the context of a system-level runtime for parallel environments, PMIx Reference RunTime Environment (PRRTE), providing efficient and scalable capabilities of fault management to a large range of programming and execution paradigms. The experimental evaluation of the resulting software stack on different machines demonstrate that the solution is at the same time generic and efficient.

DOI10.1145/3343211.3343225
Project Tags: 
External Publication Flag: