On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures

Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Jack Dongarra

Submitted by webmaster on Tue, 04/12/2016 - 16:45

Title	On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
Publication Type	Conference Paper
Year of Publication	2016
Authors	Abdelfattah, A., A. Haidar, S. Tomov, and J. Dongarra
Conference Name	The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
Date Published	2016-05
Publisher	IEEE
Conference Location	Chicago, IL
Keywords	batched computation, GPUs, variable small sizes
Abstract	Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required. This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.

Project Tags:

bblas

magma

File:

icl-utk-876-2016.pdf

External Publication Flag: