HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) is an experimental Metacomputing System aiming at providing a highly dynamic, fault-tolerant computing environment for high performance computing applications.
To make the HARNESS system more accessible to the user community a HARNESS MPI API has been developed, known as FT-MPI.
Fault Tolerant MPI (FT-MPI) is a full 1.2 MPI specification implementation that provides process level fault tolerance at the MPI API level. FT-MPI is built upon the fault tolerant HARNESS runtime system.
FT-MPI survives the crash of n-1 processes in a n-process job, and, if required, can respawn/restart them. However, it is still the responsibility of the application to recover the data-structures and the data on the crahsed processes.