Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

Midkiff, Sam; Azzam Haidar; Mark Gates; Stanimire Tomov; Jack Dongarra

Submitted by webmaster on Wed, 07/03/2013 - 16:40

Title	Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
Publication Type	Conference Paper
Year of Publication	2013
Authors	Haidar, A., M. Gates, S. Tomov, and J. Dongarra
Editor	Malony, A. D., M. Nemirovsky, and S. Midkiff
Conference Name	Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
Date Published	2013-06
Publisher	ACM Press
Conference Location	Eugene, Oregon, USA
ISBN Number	9781450321303
Keywords	eigenvalue, gpu communication, gpu computation, heterogeneous programming model, performance, reduction to tridiagonal, singular value decomposiiton, task parallelism
Abstract	The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
URL	http://dl.acm.org/citation.cfm?doid=2464996.2465438
DOI	10.1145/2464996.2465438

File:

icl-utk-561-2013.pdf