Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications

Nuria Losada; Aurelien Bouteiller; George Bosilca

Submitted by claxton on Fri, 12/27/2019 - 03:58

Title	Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
Publication Type	Conference Paper
Year of Publication	2019
Authors	Losada, N., A. Bouteiller, and G. Bosilca
Conference Name	Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19)
Date Published	2019-11
Keywords	checkpoint/restart, Fault tolerance, Message logging, MPI, ULFM, User Level Fault Mitigation
Abstract	With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, in which only the fault impacted processes are restarted from a checkpoint, the consistency of further progress in the execution is achieved through the replay of communication from a message log. This theoretically sound approach encounters some practical limitations: the presence of collective operations forces a synchronous recovery that prevents survivor processes from continuing their execution, removing any possibility for overlapping further computation with the recovery; and the amount of resources required at recovering peers can be untenable. In this work, we solved both problems by implementing an asynchronous, receiver-driven replay of point-to-point and collective communications, and by exploiting remote-memory access capabilities to access the message logs. This new protocol is evaluated in an implementation of local rollback over the User Level Failure Mitigation fault tolerant Message Passing Interface (MPI). It reduces the recovery times of the failed processes by an average of 59%, while the time spent in the recovery by the survivor processes is reduced by 95% when compared to an equivalent global rollback protocol, thus living to the promise of a truly localized impact of recovery actions.
URL	https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf

Project Tags:

ulfm

File:

icl-utk-1313-2019.pdf

External Publication Flag: