%0 Generic
%D 2017
%T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation
%C Atlanta, GA
%8 2017-03
%G eng

%0 Generic
%D 2017
%T Small Tensor Operations on Advanced Architectures for High-Order Applications
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%B University of Tennessee Computer Science Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-04
%G eng

%0 Generic
%D 2016
%T Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Veselin Dobrev
%A Ian Karlin
%A Tzanio Kolev
%A Stanimire Tomov
%A Jack Dongarra
%I moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster
%C Gatlinburg, TN
%8 2016-09
%G eng

%0 Generic
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many  independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-01
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%K Applications
%K Batched linear algebra
%K FEM
%K gpu
%K Tensor contractions
%K Tensor HPC
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng

%0 Generic
%D 2015
%T Towards a High-Performance Tensor Algebra Package for Accelerators
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I moky Mountains Computational Sciences and Engineering Conference (SMC15)
%C Gatlinburg, TN
%8 2015-09
%G eng