%0 Conference Paper %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %D 2020 %T High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs %A Natalie Beams %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %A Tzanio Kolev %A Yohann Dudouit %K Batched linear algebra %K finite elements %K gpu %K high-order methods %K matrix-free FEM %K Tensor contractions %X We present new GPU implementations of the tensor contractions arising from basis-related computations for highorder finite element methods. We consider both tensor and nontensor bases. In the case of tensor bases, we introduce new kernels based on a series of fused device-level matrix multiplications (GEMMs), specifically designed to utilize the fast memory of the GPU. For non-tensor bases, we develop a tuned framework for choosing standard batch-BLAS GEMMs that will maximize performance across groups of elements. The implementations are included in a backend of the libCEED library. We present benchmark results for the diffusion and mass operators using libCEED integration through the MFEM finite element library and compare to those of the previously best-performing GPU backends for stand-alone basis computations. In tensor cases, we see improvements of approximately 10-30% for some cases, particularly for higher basis orders. For the non-tensor tests, the new batch-GEMMs implementation is twice as fast as what was previously available for basis function order greater than five and greater than approximately 105 degrees of freedom in the mesh; up to ten times speedup is seen for eighth-order basis functions. %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %I IEEE %8 2020-11 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng