The Top500 list, the list of the 500 most powerful supercomputers in the world, highlights a constant increase in the number of processors. An attentive reading shows that the top 10 of these supercomputers contain several thousands of processor each. Despite a more careful design of the components in these supercomputers, the increase in the number of components directly affects the reliability of these systems. Moreover, the remaining systems in the Top500 list are built using commodity components, which greatly affects their reliability. While the Mean Time Between Failures(MTBF) on the BlueGene/L is counted in days, the commodity clusters exhibit a usual MTBF of tens of hours. With a further expected increase in the number of processors in the next generation supercomputers, one might deduct that the probability of failures will continue to increase, leading to a drastic decrease in reliability.

Fault tolerant algorithms have a long history of research. Only recently, since the practical issue has been raising, High Performance Computing (HPC) software has been adapted to deal with failures. One of the most popular fault tolerant technique nowadays, coordinating checkpoint, builds a consistent recovery set. As today’s HPC users are facing occasional failures, they have not suffered from the slow recovery procedure, involving restarting all the computing nodes even when only one has failed. Even worst, the massive traffic generated by checkpoint data increases with the number of computing ressources. Considering future systems will endure higher fault frequency, recovery time could become another gap between the peak performance of the architecture and the effective performance users can actually harvest from the system. Clearly, there is a need for a more efficient method of dealing with failures. One can envision that relying tightly on the mathematical properties of the algorithm can yield massive improvement in checkpointing overhead, up to the point where checkpoints are not useful anymore.

Saying that develloping numerical fault tolerant algorithm is a tough task is an understatement. While the expected performance benefits are very desirable, some sort of automation is needed to keep the software engineering cost to reasonable levels. Considering that many applications ultimatelly rely on some sort of dense linear algebra computation, the FT-LA proposal is to provide fault tolerant Basic Linear Algebra Subroutines (BLAS), that can benefit from deep knowledge of the mathematical algorithm to avoid the major penalty involved by checkpointing, while still providing fault tolerant functionalities to applications in a polished package.

Proposed Approach

In the FT-LA project we have selected to investigate Algorithm Based Fault Tolerance (ABFT), which has many advantages: it does not rely on checkpoints, and it fits well with most of the standard BLAS3 kernels such as Matrix- Matrix multiplication or LU/QR decomposition. In this approach, the original matrices are preconditioned to generate a set of larger matrices including floating point redundancy data. Then, the application is spawned on more resources and the problem is solved with the usual algorithm on the redundant matrix (the above figureABFT Matrix Multiply illustrate this concept for the example of a matrix multiplication).

Should some failure occurs, the missing information can be rebuilt by applying a reduction on the data set stored on surviving processes. As the fault tolerant algorithm is similar to the original one, applied to a slightly larger data set, and as the recovery procedure can be expressed with a single reduction, the overall scalability is very good. The major drawback appears during the recovery: the reduction operation used to regenerate the lost processes data applies on floating point data. When the matrices are ill-conditioned, the numerical stability of the recovery can be poor, though generally this is not an issue. Once the data have been reconstructed, it is possible for the algorithm to weight the lost accuracy compared to a failure free run. As outlined by the above figure depicting performance of the summa algorithm with a growing number of processes, early performance results are very promising. The ABFT approach used in FT-LA is the only fault tolerant strategy benefiting from strong scalability. The more processors are used, the smaller is the fault tolerance overhead.

FT DGEMM performance exhibit strong scaling
The prototype implementation we designed this year is based on the FT-MPI library, developed at the University of Tennessee, which aims at providing a consistent MPI environment to the application after a failure. Unlike MPI implementations strictly following the standard, in FT-MPI the communicators and internal data structures are not left in a random, unusable state after a failure hits another process. Instead, several recovery policies are available for the application to select from, including shrinking the communicator size to the now available number of processes, setting the failed process ranks as “blanks” allowing for the remaining processes to exclude the failed processes from subsequent communications, or replacing the missing processes with new incarnations. Using this tool, several fault tolerant linear algebra kernels have been designed, namely dgemm (summa and pumma algorithms), dgemv, dlacpy, dnrm2 and dnrm2mat. Using this library we could then develop a fault tolerant matrix-matrix multiply application to demonstrate its operation.


Dense Factorizations

Using a similar approach, three traditional dense factorization operations have been developed and are part of the FT-LA library: Cholesky (LLT), LU, and QR transformations are provided as basic reliable routines. Before calling these routines, users extend their data with a row-wise checksum, and enable MPI to return on failures. If a failure happens during a hardened factorization, the control will be released to the user level (the factorization will return with an error code), and the user has the opportunity to fix the environment (using ULFM constructs), recover the data that is not covered by the factorization call (using user-based rollback recovery for example). Once this is done, if the user re-enters the hardened call, this operation will recover the user data using ABFT techniques, and complete the factorization.

Examples of usage are provided with the FT-LA library.


Project Handouts

Jun 29 2022 Admin Login