Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication

TitleToward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
Publication TypeConference Paper
Year of Publication2013
AuthorsHaidar, A., M. Gates, S. Tomov, and J. Dongarra
EditorMalony, A. D., M. Nemirovsky, and S. Midkiff
Conference NameProceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
Date Published2013-06
PublisherACM Press
Conference LocationEugene, Oregon, USA
ISBN Number9781450321303
Keywordseigenvalue, gpu communication, gpu computation, heterogeneous programming model, performance, reduction to tridiagonal, singular value decomposiiton, task parallelism

The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.