Submitted by claxton on
Title | Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications |
Publication Type | Conference Paper |
Year of Publication | 2019 |
Authors | Losada, N., A. Bouteiller, and G. Bosilca |
Conference Name | Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19) |
Date Published | 2019-11 |
Keywords | checkpoint/restart, Fault tolerance, Message logging, MPI, ULFM, User Level Fault Mitigation |
Abstract | With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high performance computing application. In many instances, failures have a localized scope, usually impacting a subset of the resources being used, yet widely used failure recovery strategies (like checkpoint/restart) fail to take advantage and rely on global, synchronous recovery actions. Even with local rollback recovery, |
URL | https://sc19.supercomputing.org/proceedings/workshops/workshop_files/ws_ftxs103s2-file1.pdf |