Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

Zizhong Chen; Jack Dongarra

Submitted by scrawford on Wed, 07/29/2020 - 15:53

Title	Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
Publication Type	Journal Article
Year of Publication	2009
Authors	Chen, Z., and J. Dongarra
Journal	IEEE Transactions on Computers
Volume	58
Issue	11
Pagination	1512-1524
Date Published	2009-11
Abstract	As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[log p]. k ((beta + 2gamma) m + alpha) to (1 + O (radic(p)/radic(m))) 2 . k (beta + 2gamma)m, where alpha is the communication latency, 1/beta is the network bandwidth between processes, {1\over \gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O (1/radic(m))). k (beta + 2gamma)m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example.
DOI	10.1109/TC.2009.42

File:

icl-utk-1399-2009.pdf

External Publication Flag: