Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.

This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.

%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %D 2016 %T High-performance Matrix-matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Joël Falcou %A Jack Dongarra %X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries. %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %I Springer International Publishing %C Grenoble, France %8 2016-08 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Journal Article %J Acta Numerica %D 2016 %T Linear Algebra Software for Large-Scale Accelerated Multicore Computing %A Ahmad Abdelfattah %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A undefined %A Asim YarKhan %X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. %B Acta Numerica %V 25 %P 1-160 %8 2016-05 %G eng %R 10.1017/S0962492916000015 %0 Generic %D 2016 %T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Ahmad Abdelfattah %A Jack Dongarra %X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2016-08 %G eng %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Generic %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-02 %G eng %0 Conference Paper %B The International Supercomputing Conference (ISC High Performance 2016) %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B The International Supercomputing Conference (ISC High Performance 2016) %C Frankfurt, Germany %8 2016-06 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs %A Ahmad Abdelfattah %A Hatem Ltaeif %A David Keyes %A Jack Dongarra %X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. %B Concurrency and Computation: Practice and Experience %V 28 %P 3447 - 3465 %8 2016-05 %G eng %U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full %N 12 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.v28.1210.1002/cpe.3874 %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K Cholesky Factorization %K GPUs %K Tuning %XSolving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.

This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

%B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Batched Matrix Computations on Hardware Accelerators Based on GPUs %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K plasma %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Journal Article %J VECPAR 2012 %D 2012 %T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators %A Ahmad Abdelfattah %A Jack Dongarra %A David Keyes %A Hatem Ltaeif %B VECPAR 2012 %C Kobe, Japan %8 2012-07 %G eng