User Level Failure Mitigation (ULFM) is a set of new interfaces for MPI that enables message-passing applications to restore MPI functionality affected by process failures. The MPI implementation is spared the expense of internally taking protective and corrective automatic actions against failures. Instead, it can prevent any fault-related deadlock situation by reporting operations wherein the completions were rendered impossible by failures.
Using the constructs defined by ULFM, applications and libraries drive the recovery of the parallel application and execution environment state. Consistency issues resulting from failures are addressed according to an application’s needs, and the recovery actions are limited to MPI communication objects. Many application types and middlewares are building on top of ULFM to deliver scalable fault tolerance. Notable additions include the CoArray Fortran language and SAP databases. ULFM software is available in recent versions of MPICH and Open MPI.
Find out more at https://fault-tolerance.org/