Submitted by webmaster on
Title | Fine-grained Bit-Flip Protection for Relaxation Methods |
Publication Type | Journal Article |
Year of Publication | 2016 |
Authors | Anzt, H., J. Dongarra, and E. S. Quintana-Orti |
Journal | Journal of Computational Science |
Date Published | 2016-11 |
Keywords | Bit flips, Fault tolerance, High Performance Computing, iterative solvers, Jacobi method, sparse linear systems |
Abstract | Resilience is considered a challenging under-addressed issue that the high performance computing community (HPC) will have to face in order to produce reliable Exascale systems by the beginning of the next decade. As part of a push toward a resilient HPC ecosystem, in this paper we propose an error-resilient iterative solver for sparse linear systems based on stationary component-wise relaxation methods. Starting from a plain implementation of the Jacobi iteration, our approach introduces a low-cost component-wise technique that detects bit-flips, rejecting some component updates, and turning the initial synchronized solver into an asynchronous iteration. Our experimental study with sparse incomplete factorizations from a collection of real-world applications, and a practical GPU implementation, exposes the convergence delay incurred by the fault-tolerant implementation and its practical performance. |
DOI | 10.1016/j.jocs.2016.11.013 |
File:
External Publication Flag: