Performance, Design, and Autotuning of Batched GEMM for GPUs

Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Jack Dongarra

Submitted by webmaster on Thu, 12/08/2016 - 14:00

Title	Performance, Design, and Autotuning of Batched GEMM for GPUs
Publication Type	Book Chapter
Year of Publication	2016
Authors	Abdelfattah, A., A. Haidar, S. Tomov, and J. Dongarra
Editor	Kunkel, J. M., P. Balaji, and J. Dongarra
Book Title	High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings
Series Volume	9697
Pagination	21–38
Publisher	Springer International Publishing
ISBN Number	978-3-319-41321-1
Abstract	The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
URL	http://dx.doi.org/10.1007/978-3-319-41321-1_2
DOI	10.1007/978-3-319-41321-1_2

File:

icl-utk-945-2016.pdf