FT-LA is a research effort aimed at understanding and developing Algorithm Based Fault Tolerance (ABFT) into major dense linear algebra kernels. The constatation motivating this work is that checkpoint based approaches do not scale: when the number of processes grows, the checkpoint overhead grows accordingly. Meanwhile, the Mean Time Between Failures (MTBF) decreases as the hardware harnesses more independent processing units, demanding even more expensive checkpoints to ensure application progress.
FT-LA proposes to replace algorithm agnostic checkpoints by a mathematical approach of fault tolerance. By using properties of the algorithm, FT-LA aims at providing dense linear algebra kernels that can survive failures. This project relies on the functionalities provided by User Level Failure Mitigation, an extension to the MPI Standard currently under discussion by the MPI Forum, and a fault tolerant MPI implementation that let the application space be in control of failure recovery. Though not automatic, the FT-LA approach masks most of the recovery complexity inside the numerical kernels by providing ready-to-use fault tolerant computing building blocks and a policed interface of failure handling.