%0 Generic
%D 2020
%T FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs
%A Stanimire Tomov
%A Alan Ayala
%A Azzam Haidar
%A Jack Dongarra
%B ECP Milestone Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-01
%G eng
%9 ECP WBS 2.3.3.13 Milestone Report

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2020)
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%K exascale
%K FFT
%K gpu
%K scalable algorithm
%X Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than 40× speedup with respect to local kernels from CPU state-of-the-art libraries, and over 2× speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs.
%B International Conference on Computational Science (ICCS 2020)
%C Amsterdam, Netherlands
%8 2020-06
%G eng
%R https://doi.org/10.1007/978-3-030-50371-0_19

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%I NVIDIA GPU Technology Conference (GTC2020)
%8 2020-10
%G eng

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Azzam Haidar
%A Jack Dongarra
%X Considered one of the top 10 algorithms of the 20th century, the Fast Fourier Transform (FFT) is widely used by applications in science and engineering. Large scale parallel applications targeting exascale, such as those part of the DOE Exascale Computing Project (ECP), are designed for heterogeneous architectures and, currently, more than a dozen ECP applications use FFTs in their codes. To address the applications needs, we developed the highly efficient FFTs for exascale (heFFTe) library. The heFFTe library release features very good weak and strong scalability and performance that is close to 90% of the roofline peak performance. We present these performance results on the Summit supercomputer. heFFTe is also integrated in a number of applications and we present how the overall performance gets improved by using hFFTe. Performance model, limitations, and challenges are discussed for current and upcoming computer architectures.
%I SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20)
%C Seattle, WA
%8 2020-02
%G eng

%0 Generic
%D 2020
%T heFFTe: Highly Efficient FFT for Exascale (Poster)
%A Alan Ayala
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T MAGMA Templates for Scalable Linear Algebra on Emerging Architectures
%A Mohammed Al Farhan
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Mark Gates
%A Dalal Sukkari
%A Azzam Haidar
%A Robert Rosenberg
%A Jack Dongarra
%X With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.
%B The International Journal of High Performance Computing Applications
%V 34
%P 645-658
%8 2020-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342020938421

%0 Journal Article
%J Proceedings of the Royal Society A
%D 2020
%T Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems
%A Azzam Haidar
%A Harun Bayraktar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%K GMRESLU factorization
%K GPU computing
%K half precision arithmetic
%K iterative refinement
%K mixed precision solvers
%X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4×−5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.
%B Proceedings of the Royal Society A
%V 476
%8 2020-11
%G eng
%N 2243
%R https://doi.org/10.1098/rspa.2020.0110

%0 Generic
%D 2020
%T Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing
%A Azzam Haidar
%A Harun Bayraktar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%X Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. A primary challenge in high-performance computing is to leverage reduced- and mixed-precision hardware. We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability. We achieve a 4×–5× performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-05
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng

%0 Journal Article
%J Parallel Computing
%D 2019
%T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Marc Baboulin
%A Joël Falcou
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K HPC
%K Matrix-matrix product
%K optimization
%K Small matrices
%X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.
%B Parallel Computing
%V 81
%P 1–21
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2018.10.003

%0 Generic
%D 2019
%T Design and Implementation for FFT-ECP on Distributed Accelerated Systems
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Daniel Schultz
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng
%9 ECP WBS 2.3.3.09 Milestone Report

%0 Journal Article
%J International Journal of High Performance Computing and Networking
%D 2019
%T Evaluation of Directive-Based Performance Portable Programming Models
%A M. Graham Lopez
%A Wayne Joubert
%A Verónica Larrea
%A Oscar Hernandez
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K OpenACC
%K OpenMP 4
%K performance portability
%K Programming models
%X We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architecture with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, and we document how much tuning might be required and what lessons we can learn from these experiences. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. To better understand fundamental compute vs. bandwidth bound characteristics, we add the compute-bound Level 3 BLAS GEMM kernel to our linear algebra evaluation. We implement the kernels of interest using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both x86_64 and Power8 with attached NVIDIA GPUs, x86_64 multicores, self-hosted Intel Xeon Phi KNL, as well as an x86_64 host system with Intel Xeon Phi coprocessors. We update these evaluations with the newest version of the NVIDIA Pascal architecture (P100), Intel KNL 7230, Power8+, and the newest supporting compiler implementations. Furthermore, we present in detail what factors affected the performance portability, including how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimise and target multiple platforms.
%B International Journal of High Performance Computing and Networking
%V 14
%P 165-182
%8 2019–07
%G eng
%N 2
%R http://dx.doi.org/10.1504/IJHPCN.2017.10009064

%0 Generic
%D 2019
%T FFT-ECP Fast Fourier Transform
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Daniel Schultz
%A Jack Dongarra
%I 2019 ECP Annual Meeting (Research Poster)
%C Houston, TX
%8 2019-01
%G eng

%0 Generic
%D 2019
%T FFT-ECP Implementation Optimizations and Features Phase
%A Stanimire Tomov
%A Azzam Haidar
%A Alan Ayala
%A Hejer Shaiek
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-10
%G eng

%0 Generic
%D 2019
%T GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems
%A Hejer Shaiek
%A Stanimire Tomov
%A Alan Ayala
%A Azzam Haidar
%A Jack Dongarra
%K CUDA-Aware MPI
%K ECP
%K FFT
%K FFT-ECP
%K gpu
%K GPUDirect
%X Fast Fourier transforms (FFTs) are used in applications ranging from molecular dynamics and spectrum estimation to machine learn- ing, fast convolution and correlation, signal modulation, wireless multimedia applications, and others. However, FFTs are memory bound, and therefore, to accelerate them, it is crucial to avoid and optimize the FFTs’ communications. To this end, we present a 3-D FFT design for distributed graphics processing unit (GPU) systems that: (1) efficiently uses GPUs’ high bandwidth, (2) reduces global communications algorithmically, when possible, and (3) employs GPUDirect technologies as well as MPI optimizations in the development of high-performance FFTs for large-scale GPU-accelerated systems. We show that these developments and optimizations lead to very good strong scalability and a performance that is close to 90% of the theoretical peak.
%B EuroMPI'19 Posters, Zurich, Switzerland
%I ICL
%8 2019-09
%G eng
%9 Extended Abstract

%0 Conference Paper
%B Workshop on Exascale MPI (ExaMPI) at SC19
%D 2019
%T Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation
%A Alan Ayala
%A Stanimire Tomov
%A Xi Luo
%A Hejer Shaiek
%A Azzam Haidar
%A George Bosilca
%A Jack Dongarra
%K Collective MPI
%K Exascale applications
%K FFT
%K Heterogeneous systems
%K scalable
%B Workshop on Exascale MPI (ExaMPI) at SC19
%C Denver, CO
%8 2019-11
%G eng

%0 Generic
%D 2019
%T MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs
%A Lucien Ng
%A Sihan Chen
%A Alex Gessinger
%A Daniel Nichols
%A Sophia Cheng
%A Anu Meenasorna
%A Kwai Wong
%A Stanimire Tomov
%A Azzam Haidar
%A Eduardo D'Azevedo
%A Jack Dongarra
%I University of Tennessee
%8 2019-01
%G eng
%R 10.13140/RG.2.2.14906.64961

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491

%0 Generic
%D 2018
%T Accelerating Linear Algebra with MAGMA
%A Stanimire Tomov
%A Mark Gates
%A Azzam Haidar
%I ECP Annual Meeting 2018, Tutorial
%C Knoxville, TN
%8 2018-02
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Batched
%K Eigenvalue and singular value problems
%K hardware accelerators
%K numerical linear algebra
%K Two-sided factorization algorithms
%X The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
%B Journal of Computational Science
%V 26
%P 237–245
%8 2018-05
%G eng
%R https://doi.org/10.1016/j.jocs.2018.01.007

%0 Generic
%D 2018
%T Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Marc Baboulin
%A Joël Falcou
%A Jack Dongarra
%X Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1; 600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.
%B Innovative Computing Laboratory Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Dense linear solvers
%K energy efficiency
%K GPU computing
%X Graphics Processing Units (GPUs) are widely used in accelerating dense linear solvers. The matrix factorizations, which dominate the runtime for these solvers, are often designed using a hybrid scheme, where GPUs perform trailing matrix updates, while the CPUs perform the panel factorizations. Consequently, hybrid solutions require high-end CPUs and optimized CPU software in order to deliver high performance. Furthermore, they lack the energy efficiency inherent for GPUs due to the use of less energy-efficient CPUs, as well as CPU-GPU communications. This paper presents analysis and design techniques that overcome the shortcomings of the hybrid algorithms, and allow the design of high-performance and energy-efficient dense LU and Cholesky factorizations that use GPUs only. The full GPU solution eliminates the need for a high-end CPU and optimized CPU software, which leads to a better energy efficiency. We discuss different design choices, and introduce optimized GPU kernels for panel factorizations. The developed solutions achieve 90+ percent of the performance of optimized hybrid solutions, while improving the energy efficiency by 50 percent. They outperform the vendor library by 30-50 percent in single precision, and 15-50 percent in double precision. We also show that hybrid designs trail the proposed solutions in performance when optimized CPU software is not available.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 2700–2712
%8 2018-12
%G eng
%N 12
%R 10.1109/TPDS.2018.2842785

%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng

%0 Journal Article
%J Journal of Computational Science
%D 2018
%T Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batch computation
%K GPU computing
%K matrix factorization
%X The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8×.
%B Journal of Computational Science
%V 26
%P 226–236
%8 2018-05
%G eng
%R https://doi.org/10.1016/j.jocs.2018.01.005

%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 2018-08
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276

%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2018)
%D 2018
%T The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Panruo Wu
%A Srikara Pranesh
%A Stanimire Tomov
%A Jack Dongarra
%X As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. Exploiting both the hardware features and algorithms is an effective solution to achieve power efficiency, and to address the energy constraints in modern and future HPC systems. In this work, we present a novel design and implementation of an energy-efficient solution for dense linear systems of equations, which are at the heart of large-scale HPC applications. The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. While most of the energy efficiency approaches aim to reduce the consumption with a minimal performance penalty, our method improves both the performance and the energy efficiency. Compared to highly-optimized linear system solvers, our kernels deliver the same accuracy solution up to   2×  faster and reduce the energy consumption up to half on Intel Knights Landing (KNL) architectures. By efficiently using the Tensor Cores available in the NVIDIA V100 PCIe GPUs, the speedups can be up to   4× , with more than 80% reduction in the energy consumption.
%B International Conference on Computational Science (ICCS 2018)
%I Springer
%C Wuxi, China
%V 10860
%P 586–600
%8 2018-06
%G eng
%U https://rdcu.be/bcKSC
%R https://doi.org/10.1007/978-3-319-93698-7_45

%0 Generic
%D 2018
%T Evaluation and Design of FFT for Distributed Accelerated Systems
%A Stanimire Tomov
%A Azzam Haidar
%A Daniel Schultz
%A Jack Dongarra
%B ECP WBS 2.3.3.09 Milestone Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-10
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Stanimire Tomov
%A Jack Dongarra
%X We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6x for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 973–984
%8 2018-05
%G eng
%N 5
%R 10.1109/TPDS.2017.2783929

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%D 2018
%T Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Nicholas J. Higham
%X Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, Ax = b, where A is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16-FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.
%B The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18)
%I IEEE
%C Dallas, TX
%8 2018-11
%G eng
%R https://doi.org/10.1109/SC.2018.00050

%0 Generic
%D 2018
%T Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC), Poster
%C San Jose, CA
%8 2018-03
%G eng

%0 Journal Article
%J Concurrency Computation: Practice and Experience
%D 2018
%T Investigating Power Capping toward Energy-Efficient Scientific Applications
%A Azzam Haidar
%A Heike Jagode
%A Phil Vaccaro
%A Asim YarKhan
%A Stanimire Tomov
%A Jack Dongarra
%K energy efficiency
%K High Performance Computing
%K Intel Xeon Phi
%K Knights landing
%K papi
%K performance analysis
%K Performance Counters
%K power efficiency
%X The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore how different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. We quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms.
%B Concurrency Computation: Practice and Experience
%V 2018
%P 1-14
%8 2018-04
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4485
%N e4485
%R https://doi.org/10.1002/cpe.4485

%0 Generic
%D 2018
%T MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster
%C Dallas, TX
%8 2018-11
%G eng

%0 Generic
%D 2018
%T MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Ichitaro Yamazaki
%A Jack Dongarra
%I NSF PI Meeting, Poster
%C Washington, DC
%8 2018-04
%G eng
%R https://doi.org/10.6084/m9.figshare.6174143.v3

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC’18)
%D 2018
%T Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X This paper introduces several frameworks for the design and implementation of high performance GPU kernels that target batch workloads with irregular sizes. Such workloads are ubiquitous in many scientific applications, including sparse direct solvers, astrophysics, and quantum chemistry. The paper addresses two main categories of frameworks, taking the Cholesky factorization as a case study. The first uses hostside kernel launches, and the second uses device-side launches. Within each category, different design options are introduced, with an emphasis on the advantages and the disadvantages of each approach. Our best performing design outperforms the state-of-the-art CPU implementation, scoring up to 4.7× speedup in double precision on a Pascal P100 GPU.
%B IEEE High Performance Extreme Computing Conference (HPEC’18)
%I IEEE
%C Waltham, MA
%8 2018-09
%G eng

%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732

%0 Generic
%D 2018
%T Tensor Contractions using Optimized Batch GEMM Routines
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC), Poster
%C San Jose, CA
%8 2018-03
%G eng

%0 Conference Paper
%B ISC High Performance (ISC'18), Best Poster
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%B ISC High Performance (ISC'18), Best Poster
%C Frankfurt, Germany
%8 2018-06
%G eng

%0 Generic
%D 2018
%T Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
%A Azzam Haidar
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Mawussi Zounon
%A Jack Dongarra
%I ISC High Performance (ISC18), Best Poster Award
%C Frankfurt, Germany
%8 2018-06
%G eng

%0 Generic
%D 2017
%T Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation
%C Atlanta, GA
%8 2017-03
%G eng

%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 2017-12
%G eng
%1 04

%0 Journal Article
%J Procedia Computer Science
%D 2017
%T Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X This paper presents new algorithmic approaches and optimization techniques for LU factorization and matrix inversion of millions of very small matrices using GPUs. These problems appear in many scientific applications including astrophysics and generation of block-Jacobi preconditioners. We show that, for very small problem sizes, design and optimization of GPU kernels require a mindset different from the one usually used when designing LAPACK algorithms for GPUs. Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. We also take advantage of the small matrix sizes to eliminate the intermediate row interchanges in both the factorization and inversion kernels. The proposed GPU kernels achieve performance speedups vs. CUBLAS of up to 6× for the factorization, and 14× for the inversion, using double precision arithmetic on a Pascal P100 GPU.
%B Procedia Computer Science
%V 108
%P 606–615
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.250

%0 Journal Article
%J Journal of Computational Science
%D 2017
%T Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K GPU computing; Cholesky factorization; Batched execution
%X This paper presents a GPU-accelerated Cholesky factorization for two different modes of operation. The first one is the batch mode, where many independent factorizations on small matrices can be performed concurrently. This mode supports fixed size and variable size problems, and is found in many scientific applications. The second mode is the native mode, where one factorization is performed on a large matrix without any CPU involvement, which allows the CPU do other useful work. We show that, despite the different workloads, both modes of operation share a common code-base that uses the GPU only. We also show that the developed routines achieve significant speedups against a multicore CPU using the MKL library, and against a GPU implementation by cuSOLVER. This work is part of the MAGMA library.
%B Journal of Computational Science
%V 20
%P 85–93
%8 2017-05
%G eng
%R https://doi.org/10.1016/j.jocs.2016.12.009

%0 Journal Article
%J ISC High Performance 2017
%D 2017
%T A Framework for Out of Memory SVD Algorithms
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Aurelien Bouteiller
%A Jack Dongarra
%X Many important applications – from big data analytics to information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large dense singular value decompositions (SVD). In many cases the problems are too large to fit into the computer’s main memory, and thus require specialized out-of-core algorithms that use disk storage. In this paper, we analyze the SVD communications, as related to hierarchical memories, and design a class of algorithms that minimizes them. This class includes out-of-core SVDs but can also be applied between other consecutive levels of the memory hierarchy, e.g., GPU SVD using the CPU memory for large problems. We call these out-of-memory (OOM) algorithms. To design OOM SVDs, we first study the communications for both classical one-stage blocked SVD and two-stage tiled SVD. We present the theoretical analysis and strategies to design, as well as implement, these communication avoiding OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models.
%B ISC High Performance 2017
%P 158–178
%8 2017-06
%G eng
%R https://doi.org/10.1007/978-3-319-58667-0_9

%0 Conference Paper
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%D 2017
%T High-performance Cholesky Factorization for GPU-only Execution
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. We achieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8× faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.
%B Proceedings of the General Purpose GPUs (GPGPU-10)
%I ACM
%C Austin, TX
%8 2017-02
%G eng
%R https://doi.org/10.1145/3038228.3038237

%0 Conference Paper
%B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2017
%T Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers
%A Azzam Haidar
%A Panruo Wu
%A Stanimire Tomov
%A Jack Dongarra
%X The use of low-precision arithmetic in mixed-precision computing methods has been a powerful tool to accelerate numerous scientific computing applications. Artificial intelligence (AI) in particular has pushed this to current extremes, making use of half-precision floating-point arithmetic (FP16) in approaches based on neural networks. The appeal of FP16 is in the high performance that can be achieved using it on today’s powerful manycore GPU accelerators, e.g., like the NVIDIA V100, that can provide 120 TeraFLOPS alone in FP16. We present an investigation showing that other HPC applications can harness this power too, and in particular, the general HPC problem of solving Ax = b, where A is a large dense matrix, and the solution is needed in FP32 or FP64 accuracy. Our approach is based on the mixed-precision iterative refinement technique – we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly-tuned implementations that resolve the main computational challenges of efficiently parallelizing, scaling, and using FP16 arithmetic in the approach on high-end GPUs. Subsequently, we show for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers. Our results are reproducible and the developments will be made available through the MAGMA library. We quantify in practice the performance, and limitations of the approach.
%B ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Denver, CO
%8 11/2017
%G eng

%0 Generic
%D 2017
%T MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs
%A Stanimire Tomov
%A Azzam Haidar
%I GPU Technology Conference (GTC17), Presentation in Session S7728
%C San Jose, CA
%8 2017-05
%G eng

%0 Generic
%D 2017
%T MagmaDNN – High-Performance Data Analytics for Manycore GPUs and CPUs
%A Lucien Ng
%A Kwai Wong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I 2017 Summer Research Experiences for Undergraduate (REU), Presentation
%C Knoxville, TN
%8 2017-12
%G eng

%0 Conference Paper
%B International Conference on Supercomputing (ICS '17)
%D 2017
%T Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%B International Conference on Supercomputing (ICS '17)
%I ACM
%C Chicago, Illinois
%8 2017-06
%G eng
%U http://dl.acm.org/citation.cfm?id=3079103
%R 10.1145/3079079.3079103

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X A challenging class of problems arising in many GPU applications, called batched problems, involves linear algebra operations on many small-sized matrices. We designed batched BLAS (Basic Linear Algebra Subroutines) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We proposed device functions and big-tile settings in our batched BLAS design. We adopted auto-tuning to optimize different instances of GEMV routines. We illustrated our batched BLAS approach to optimize batched bi-diagonalization progressively on a K40c GPU. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%8 2017-06
%G eng
%U http://www.sciencedirect.com/science/article/pii/S1877050917308645
%R https://doi.org/10.1016/j.procs.2017.05.237

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17)
%D 2017
%T Out of Memory SVD Solver for Big Data
%A Azzam Haidar
%A Khairul Kabir
%A Diana Fayad
%A Stanimire Tomov
%A Jack Dongarra
%X Many applications – from data compression to numerical weather prediction and information retrieval – need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer’s main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m×n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng

%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Generic
%D 2017
%T POMPEI: Programming with OpenMP4 for Exascale Investigations
%A Jack Dongarra
%A Azzam Haidar
%A Oscar Hernandez
%A Stanimire Tomov
%A Manjunath Gorentla Venkata
%X The objective of the Programming with OpenMP4 for Exascale Investigations (POMPEI) project is to explore new task-based programming techniques together with data structure centric programming for scientific applications to harness the potential of extreme-scale systems. Tasking is a well established by now approach on such systems as it has been used successfully to handle their large-scale parallelism and heterogeneity, which are leading challenges on the way to exascale computing. The approach is to harness the latest features of OpenMP4.5 and OpenACC2.5 to design abstractions shared among tasks and mapped efficiently to data-structure driven programming paradigms. This technical report describes the approach, along with its reference implementation and results for dense linear algebra algorithms.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-12
%G eng

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%D 2017
%T Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%X The emergence of power efficiency as a primary constraint in processor and system designs poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers in particular for peta- and exa- scale systems. Understanding and improving the energy efficiency of numerical simulation becomes very crucial.    We present a detailed study and investigation toward control- ling power usage and exploring how different power caps affect the performance of numerical algorithms with different computa- tional intensities, and determine the impact and correlation with performance of scientific applications.    Our analyses is performed using a set of representatives kernels, as well as many highly used scientific benchmarks. We quantify a number of power and performance measurements, and draw observations and conclusions that can be viewed as a roadmap toward achieving energy efficiency computing algorithms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091085

%0 Generic
%D 2017
%T Power-Aware HPC on Intel Xeon Phi KNL Processors
%A Azzam Haidar
%A Heike Jagode
%A Asim YarKhan
%A Phil Vaccaro
%A Stanimire Tomov
%A Jack Dongarra
%I ISC High Performance (ISC17), Intel Booth Presentation
%C Frankfurt, Germany
%8 2017-06
%G eng

%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%1 01

%0 Generic
%D 2017
%T Small Tensor Operations on Advanced Architectures for High-Order Applications
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%B University of Tennessee Computer Science Technical Report
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-04
%G eng

%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48

%0 Generic
%D 2016
%T Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Veselin Dobrev
%A Ian Karlin
%A Tzanio Kolev
%A Stanimire Tomov
%A Jack Dongarra
%I moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster
%C Gatlinburg, TN
%8 2016-09
%G eng

%0 Generic
%D 2016
%T Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%I GPU Technology Conference (GTC16), Poster
%C San Jose, CA
%8 2016-04
%G eng

%0 Conference Paper
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%D 2016
%T On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K GPUs
%K variable small sizes
%X <p>  Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.  </p>  <p>  This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.  </p>
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%K plasma
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Paper
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%D 2016
%T High-performance Matrix-matrix Multiplications of Very Small Matrices
%A Ian Masliah
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Joël Falcou
%A Jack Dongarra
%X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.
%B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)
%I Springer International Publishing
%C Grenoble, France
%8 2016-08
%G eng

%0 Generic
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many  independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-01
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T High-Performance Tensor Contractions for GPUs
%A Ahmad Abdelfattah
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%K Applications
%K Batched linear algebra
%K FEM
%K gpu
%K Tensor contractions
%K Tensor HPC
%X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng

%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015

%0 Conference Paper
%B IEEE High Performance Extreme Computing Conference (HPEC'16)
%D 2016
%T LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi
%A Azzam Haidar
%A Stanimire Tomov
%A Konstantin Arturov
%A Murat Guney
%A Shane Story
%A Jack Dongarra
%X A wide variety of heterogeneous compute resources, ranging from multicore CPUs to GPUs and coprocessors, are available to modern computers, making it challenging to design unified numerical libraries that efficiently and productively use all these varied resources. For example, in order to efficiently use Intel’s Knights Langing (KNL) processor, the next-generation of Xeon Phi architectures, one must design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance. We propose a productive and portable programming model that allows us to write a serial-looking code, which, however, achieves parallelism and scalability by using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and the parallel execution. This is done through multiple techniques ranging from multi-level data partitioning to adaptive task grain sizes, and dynamic task scheduling. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. Finally, we outline the strengths and the effectiveness of this approach – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate current work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B IEEE High Performance Extreme Computing Conference (HPEC'16)
%I IEEE
%C Waltham, MA
%8 2016-09
%G eng

%0 Generic
%D 2016
%T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs
%A Tingxing Dong
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Ahmad Abdelfattah
%A Jack Dongarra
%X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2016-08
%G eng

%0 Conference Paper
%B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16)
%D 2016
%T Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations
%A Azzam Haidar
%A Benjamin Brock
%A Stanimire Tomov
%A Michael Guidry
%A Jay Jay Billings
%A Daniel Shyles
%A Jack Dongarra
%X We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work.
%B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16)
%I IEEE
%C Waltham, MA
%8 2016-09
%G eng

%0 Conference Paper
%B The International Supercomputing Conference (ISC High Performance 2016)
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K GEMM
%K GPU computing
%K HPC
%X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and  a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B The International Supercomputing Conference (ISC High Performance 2016)
%C Frankfurt, Germany
%8 2016-06
%G eng

%0 Book Section
%B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%E Julian M. Kunkel
%E Pavan Balaji
%E Jack Dongarra
%X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general.    This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings
%I Springer International Publishing
%P 21–38
%@ 978-3-319-41321-1
%G eng
%U http://dx.doi.org/10.1007/978-3-319-41321-1_2
%R 10.1007/978-3-319-41321-1_2

%0 Generic
%D 2016
%T Performance, Design, and Autotuning of Batched GEMM for GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Autotuning
%K Batched GEMM
%K GEMM
%K GPU computing
%K HPC
%X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key  component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
%B University of Tennessee Computer Science Technical Report
%I University of Tennessee
%8 2016-02
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K Cholesky Factorization
%K GPUs
%K Tuning
%X <p>Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.</p>    <p>This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.</p>
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng

%0 Generic
%D 2016
%T A Standard for Batched BLAS Routines
%A Pedro Valero-Lara
%A Jack Dongarra
%A Azzam Haidar
%A Samuel D. Relton
%A Stanimire Tomov
%A Mawussi Zounon
%I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16)
%C Paris, France
%8 2016-04
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%D 2016
%T Towards Achieving Performance Portability Using Directives for Accelerators
%A M. Graham Lopez
%A Larrea, V
%A Joubert, W
%A Hernandez, O
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer of- fload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86 64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86 64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)
%I Innovative Computing Laboratory, University of Tennessee
%C Salt Lake City, Utah
%8 2016-11
%G eng

%0 Conference Paper
%B EuroMPI/Asia 2015 Workshop
%D 2015
%T Batched Matrix Computations on Hardware Accelerators
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU.
%B EuroMPI/Asia 2015 Workshop
%C Bordeaux, France
%8 2015-09
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Batched Matrix Computations on Hardware Accelerators Based on GPUs
%A Azzam Haidar
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Jack Dongarra
%X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2015
%T Batched matrix computations on hardware accelerators based on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU.
%B International Journal of High Performance Computing Applications
%8 2015-02
%G eng
%R 10.1177/1094342014567546

%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%D 2015
%T Cholesky Across Accelerators
%A Asim YarKhan
%A Azzam Haidar
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)
%I IEEE
%C Elizabeth, NJ
%8 2015-08
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra
%A Mark Gates
%A Stanimire Tomov
%A Azzam Haidar
%X Accelerating dense linear algebra using GPUs admits two models: hybrid CPU-GPU and GPU-only. The hybrid model factors the panel on the CPU while updating the trailing matrix on the GPU, concentrating the GPU on high-performance matrix multiplies. The GPU-only model performs the entire computation on the GPU, avoiding costly data transfers to the CPU. We compare these two approaches for three QR-based algorithms: QR factorization, rank revealing QR, and reduction to Hessenberg.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T A Data Flow Divide and Conquer Algorithm for Multicore Architecture
%A Azzam Haidar
%A Jakub Kurzak
%A Gregoire Pichon
%A Mathieu Faverge
%K Eigensolver
%K lapack
%K Multicore
%K plasma
%K task-based programming
%X Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the Intel MKL library, and outperforms the best MRRR implementation for many matrices.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Generic
%D 2015
%T On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation
%A Michael Guidry
%A Azzam Haidar
%I Joint Institute for Computational Sciences Seminar Series, Presentation
%C Oak Ridge, TN
%8 2015-09
%G eng

%0 Conference Paper
%B ISC High Performance 2015
%D 2015
%T On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting low-level, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications.
%B ISC High Performance 2015
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%D 2015
%T Efficient Eigensolver Algorithms on Accelerator Based Architectures
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver.
%B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA)
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems
%A Raffaele Solcà
%A Anton Kozhevnikov
%A Azzam Haidar
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications.
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B 17th IEEE International Conference on High Performance Computing and Communications
%D 2015
%T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization
%A Azzam Haidar
%A Asim YarKhan
%A Chongxiao Cao
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.
%B 17th IEEE International Conference on High Performance Computing and Communications
%C Newark, NJ
%8 2015-08
%G eng

%0 Conference Paper
%B ISC High Performance
%D 2015
%T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations
%A Azzam Haidar
%A Tingxing Dong
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%B ISC High Performance
%I Springer
%C Frankfurt, Germany
%8 2015-07
%G eng

%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 2015-01
%G eng
%N 1
%R 10.3233/SPR-140404

%0 Conference Paper
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%D 2015
%T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing
%A Azzam Haidar
%A Stanimire Tomov
%A Piotr Luszczek
%A Jack Dongarra
%X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries.
%B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award)
%I IEEE
%C Waltham, MA
%8 2015-09
%G eng

%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng

%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%D 2015
%T Optimization for Performance and Energy for Batched Matrix Computations on GPUs
%A Azzam Haidar
%A Tingxing Dong
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the  algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the  costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8)
%I ACM
%C San Francisco, CA
%8 2015-02
%G eng
%R 10.1145/2716282.2716288

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K plasma
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504

%0 Conference Paper
%B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award
%D 2015
%T Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K Eigenvalues problem
%K Hessenberg reduction
%K Multi/Many-core
%K Stabilized Elementary Transformations
%X The solution of nonsymmetric eigenvalue problems, Ax = λx, can be accelerated substantially by first reducing A to an upper Hessenberg matrix H that has the same eigenvalues as A. This can be done using Householder orthogonal transformations, which is a well established standard, or stabilized elementary transformations. The latter approach, although having half the flops of the former, has been used less in practice, e.g., on computer architectures with well developed hierarchical memories, because of its memory-bound operations and the complexity in stabilizing it. In this paper we revisit the stabilized elementary transformations approach in the context of new architectures – both multicore CPUs and Xeon Phi coprocessors. We derive for a first time a blocking version of the algorithm. The blocked version reduces the memory-bound operations and we analyze its performance. A performance model is developed that shows the limitations of both approaches. The competitiveness of using stabilized elementary transformations has been quantified, highlighting that it can be 20 to 30% faster on current high-end multicore CPUs and Xeon Phi coprocessors.
%B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award
%C Alexandria, VA
%8 2015-04
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS 2015)
%D 2015
%T Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform
%A Khairul Kabir
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%B International Conference on Computational Science (ICCS 2015)
%C Reykjavík, Iceland
%8 2015-06
%G eng

%0 Generic
%D 2015
%T Towards a High-Performance Tensor Algebra Package for Accelerators
%A Marc Baboulin
%A Veselin Dobrev
%A Jack Dongarra
%A Christopher Earl
%A Joël Falcou
%A Azzam Haidar
%A Ian Karlin
%A Tzanio Kolev
%A Ian Masliah
%A Stanimire Tomov
%I moky Mountains Computational Sciences and Engineering Conference (SMC15)
%C Gatlinburg, TN
%8 2015-09
%G eng

%0 Conference Paper
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%D 2015
%T Towards Batched Linear Solvers on Accelerated Hardware Platforms
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K batched factorization
%K hardware accelerators
%K numerical linear algebra
%K numerical software libraries
%K one-sided factorization algorithms
%X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU.
%B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015
%I ACM
%C San Francisco, CA
%8 2015-02
%G eng

%0 Conference Proceedings
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%D 2015
%T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators
%A Azzam Haidar
%A Yulu Jia
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%A Jack Dongarra
%K dataflow scheduling
%K hardware accelerators
%K multi-grain parallelism
%X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.
%B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)
%I ACM
%C Austin, TX
%V No. 5
%8 2015-11
%G eng

%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem
%A Mark Gates
%A Azzam Haidar
%A Jack Dongarra
%X In the nonsymmetric eigenvalue problem, work has focused on the Hessenberg reduction and QR iteration, using efficient algorithms and fast, Level 3 BLAS routines. Comparatively, computation of eigenvectors performs poorly, limited to slow, Level 2 BLAS performance with little speedup on multi-core systems. It has thus become a dominant cost in the eigenvalue problem. To address this, we present improvements for the eigenvector computation to use Level 3 BLAS where applicable and parallelize the remaining triangular solves, achieving good parallel scaling and accelerating the overall eigenvalue problem more than three-fold.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng

%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1

%0 Conference Paper
%B International Conference on Parallel Processing (ICPP-2014)
%D 2014
%T A Fast Batched Cholesky Factorization on a GPU
%A Tingxing Dong
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%X Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms – nonblocked, blocked, and recursive blocked – were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1:8 speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMA by 1.5 in performance-per-watt for large matrices.
%B International Conference on Parallel Processing (ICPP-2014)
%C Minneapolis, MN
%8 2014-09
%G eng

%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
%A Azzam Haidar
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%K Computer science
%K factorization
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng

%0 Conference Paper
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%D 2014
%T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU
%A Tingxing Dong
%A Azzam Haidar
%A Piotr Luszczek
%A James Harris
%A Stanimire Tomov
%A Jack Dongarra
%X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation.
%B 16th IEEE International Conference on High Performance Computing and Communications (HPCC)
%I IEEE
%C Paris, France
%8 2014-08
%G eng

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2014
%T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems
%A Jack Dongarra
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%K dense linear algebra
%K hardware accelerators
%K task superscalar scheduling
%X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale.  In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design.  Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs).  Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns.  This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads.  In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems.  Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution.  This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed.  Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles.  We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.
%B Supercomputing Frontiers and Innovations
%V 1
%G eng
%N 1
%R http://dx.doi.org/10.14529/jsfi1401

%0 Conference Paper
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%D 2014
%T New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem
%A Azzam Haidar
%A Piotr Luszczek
%A Jack Dongarra
%X We describe a design and implementation of a multi-stage algorithm for computing eigenvectors of a dense symmetric matrix. We show that reformulating the existing algorithms is beneficial in terms of performance even if that doubles the computational complexity. Through detailed analysis, we show that the effect of the increase in the asymptotic operation count may be compensated by a much improved performance rate. Our performance results indicate that using our approach achieves very good speedup and scalability even when directly compared with the existing state-of-the-art software.
%B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper)
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng
%R 10.1109/IPDPSW.2014.130

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2014
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Azzam Haidar
%A Raffaele Solcà
%A Mark Gates
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%K Eigensolver
%K electronic structure calculations
%K generalized eigensolver
%K gpu
%K high performance
%K hybrid
%K Multicore
%K two-stage
%X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code.
%B International Journal of High Performance Computing Applications
%V 28
%P 196-209
%8 2014-05
%G eng
%N 2
%& 196
%R 10.1177/1094342013502097

%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%R 10.1109/ScalA.2014.8

%0 Conference Paper
%B IPDPS 2014
%D 2014
%T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment
%A Azzam Haidar
%A Chongxiao Cao
%A Jack Dongarra
%A Piotr Luszczek
%A Stanimire Tomov
%K algorithms
%K Computer science
%K CUDA
%K Heterogeneous systems
%K Intel Xeon Phi
%K linear algebra
%K nVidia
%K Tesla K20
%K Tesla M2090
%X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.
%B IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Generic
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K lapack
%K plasma
%K scalapack
%B University of Tennessee Computer Science Technical Report (also LAWN 283)
%I University of Tennessee
%8 2013-10
%G eng

%0 Conference Paper
%B Supercomputing 2013
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%B Supercomputing 2013
%C Denver, CO
%8 2013-11
%G eng

%0 Conference Proceedings
%B International Supercomputing Conference (ISC)
%D 2013
%T Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Raffaele Solcà
%A Thomas C. Schulthess
%X Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs.
%B International Supercomputing Conference (ISC)
%7 Lecture Notes in Computer Science
%I Springer Berlin Heidelberg
%C Leipzig, Germany
%V 7905
%P 67-80
%8 2013-06
%@ 978-3-642-38750-0
%G eng
%R 10.1007/978-3-642-38750-0_6

%0 Conference Paper
%B PPAM 2013
%D 2013
%T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K mic
%K xeon phi
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B PPAM 2013
%C Warsaw, Poland
%8 2013-09
%G eng

%0 Conference Paper
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%D 2013
%T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%E Allen D. Malony
%E Nemirovsky, Mario
%E Midkiff, Sam
%K eigenvalue
%K gpu communication
%K gpu computation
%K heterogeneous programming model
%K performance
%K reduction to tridiagonal
%K singular value decomposiiton
%K task parallelism
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%I ACM Press
%C Eugene, Oregon, USA
%8 2013-06
%@ 9781450321303
%G eng
%U http://dl.acm.org/citation.cfm?doid=2464996.2465438
%R 10.1145/2464996.2465438

%0 Journal Article
%J IPDPS 2012
%D 2012
%T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
%A Azzam Haidar
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B IPDPS 2012
%C Shanghai, China
%8 2012-05
%G eng

%0 Generic
%D 2012
%T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
%A Stanimire Tomov
%A Jack Dongarra
%A Azzam Haidar
%A Ichitaro Yamazaki
%A Tingxing Dong
%A Thomas Schulthess
%A Raffaele Solcà
%I GPU Technology Conference (GTC12), Presentation
%C San Jose, CA
%8 2012-05
%G eng

%0 Generic
%D 2012
%T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
%A Jack Dongarra
%A Tingxing Dong
%A Mark Gates
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation
%C Salt Lake City, UT
%8 2012-11
%G eng

%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Raffaele Solcà
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%A Thomas C. Schulthess
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 2012-11
%G eng

%0 Journal Article
%J SIAM Journal on Scientific Computing (Accepted)
%D 2012
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B SIAM Journal on Scientific Computing (Accepted)
%8 2012-07
%G eng

%0 Conference Proceedings
%B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May
%D 2011
%T 3-D parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver
%A Azzam Haidar
%A Luc Giraud
%A Hafedh Ben-Hadj-Ali
%A Florent Sourbier
%A Stéphane Operto
%A Jean Virieux
%B 73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May
%8 2011-00
%G eng

%0 Conference Proceedings
%B The Twentieth International Conference on Domain Decomposition Methods
%D 2011
%T Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Stephane Lanteri
%A Jean Roman
%B The Twentieth International Conference on Domain Decomposition Methods
%C La Jolla, California
%8 2011-02
%G eng
%U http://hal.inria.fr/inria-00577639

%0 Generic
%D 2011
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243)
%8 2011-03
%G eng

%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 2011-05
%G eng

%0 Journal Article
%J Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%D 2011
%T Parallel algebraic domain decomposition solver for the solution of augmented systems.
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April
%8 2011-00
%G eng

%0 Conference Proceedings
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%C Seattle, WA
%8 2011-11
%G eng

%0 Generic
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254)
%8 2011-08
%G eng

%0 Journal Article
%J To appear in Geophysical Prospecting journal.
%D 2011
%T Three-dimensional parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver.
%A Florent Sourbier
%A Azzam Haidar
%A Luc Giraud
%A Hafedh Ben-Hadj-Ali
%A Stéphane Operto
%A Jean Virieux
%B To appear in Geophysical Prospecting journal.
%8 2011-00
%G eng

%0 Journal Article
%J Submitted to SIAM Journal on Scientific Computing (SISC)
%D 2011
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices.
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B Submitted to SIAM Journal on Scientific Computing (SISC)
%8 2011-00
%G eng

%0 Journal Article
%J Submitted to Concurrency and Computations: Practice and Experience
%D 2010
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B Submitted to Concurrency and Computations: Practice and Experience
%8 2010-11
%G eng

%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng

%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng

%0 Journal Article
%J Sparse Days 2010 Meeting at CERFACS
%D 2010
%T MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%A Yohan Lee-Tin-Yien
%B Sparse Days 2010 Meeting at CERFACS
%C Toulouse, France
%8 2010-06
%G eng

%0 Journal Article
%J Numerical Mathematics: Theory, Methods and Applications
%D 2010
%T Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D
%A Luc Giraud
%A Azzam Haidar
%A Yousef Saad
%E C. Zhiming
%B Numerical Mathematics: Theory, Methods and Applications
%I Golbal Science Press
%C Beijing
%V 3
%P 64-82
%8 2010-00
%G eng

%0 Journal Article
%J PARA 2010
%D 2010
%T Towards a Complexity Analysis of Sparse Hybrid Linear Solvers
%A Emmanuel Agullo
%A Luc Giraud
%A Amina Guermouche
%A Azzam Haidar
%A Jean Roman
%B PARA 2010
%C Reykjavik, Iceland
%8 2010-06
%G eng

%0 Journal Article
%J Parallel Computing
%D 2010
%T Using multiple levels of parallelism to enhance the performance of domain decomposition solvers
%A Luc Giraud
%A Azzam Haidar
%A Stephane Pralet
%E Costas Bekas
%E Pascua D’Ambra
%E Ananth Grama
%E Yousef Saad
%E Petko Yanev
%B Parallel Computing
%I Elsevier journals
%V 36
%P 285-296
%8 2010-00
%G eng