Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Submitted by scrawford on Thu, 12/06/2018 - 16:37

Title	Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
Publication Type	Conference Paper
Year of Publication	2018
Authors	Haidar, A., S. Tomov, J. Dongarra, and N. J. Higham
Conference Name	The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
Date Published	2018-11
Publisher	IEEE
Conference Location	Dallas, TX
Abstract	Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
DOI	10.1109/SC.2018.00050

Project Tags:

File:

External Publication Flag: