Publications
Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,”
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.
(440.7 KB)
“Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging,”
Future Generation Computer Systems, vol. 91, pp. 450-464, February 2019.
DOI: 10.1016/j.future.2018.09.041 (1.16 MB)
“Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution,”
Future Generation Computer Systems, vol. 106, pp. 467-481, May 2020.
DOI: 10.1016/j.future.2020.01.026 (2.06 MB)
“