Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

TitleAsynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
Publication TypeConference Paper
Year of Publication2019
AuthorsLosada, N., A. Bouteiller, and G. Bosilca
Conference NameFault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19)
Date Published2019-11
Keywordscheckpoint/restart, Fault tolerance, Message logging, MPI, ULFM, User Level Fault Mitigation
Abstract

With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery,
in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations
forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an
equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.

URLhttps://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf
Project Tags: 
External Publication Flag: