Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures

TitleHessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
Publication TypeConference Paper
Year of Publication2016
AuthorsJia, Y., P. Luszczek, and J. Dongarra
Conference Name 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
Date Published2016-05
PublisherIEEE
Conference LocationChicago, IL
AbstractGraphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
Project Tags: