%0 Generic
%D 2022
%T Communication Avoiding LU with Tournament Pivoting in SLATE
%A Rabab Alomairy
%A Mark Gates
%A Sebastien Cayrols
%A Dalal Sukkari
%A Kadir Akbudak
%A Asim YarKhan
%A Paul Bagwell
%A Jack Dongarra
%B SLATE Working Notes
%8 2022-01
%G eng
%0 Generic
%D 2022
%T PAQR: Pivoting Avoiding QR factorization
%A Wissam M. Sid-Lakhdar
%A Sebastien Cayrols
%A Daniel Bielich
%A Ahmad Abdelfattah
%A Piotr Luszczek
%A Mark Gates
%A Stanimire Tomov
%A Hans Johansen
%A David Williams-Young
%A Timothy A. Davis
%A Jack Dongarra
%X The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient. On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to QR. The aim of this paper is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems. After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems.
%B ICL Technical Report
%8 2022-06
%G eng
%0 Conference Paper
%B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems
%D 2022
%T Threshold Pivoting for Dense LU Factorization
%A Neil Lindquist
%A Mark Gates
%A Piotr Luszczek
%A Jack Dongarra
%X LU factorization is a key approach for solving large, dense systems of linear equations. Partial row pivoting is commonly used to ensure numerical stability; however, the data movement needed for the row interchanges can reduce performance. To improve this, we propose using threshold pivoting to find pivots almost as good as those selected by partial pivoting but that result in less data movement. Our theoretical analysis bounds the element growth similarly to partial pivoting; however, it also shows that the growth of threshold pivoting for a given matrix cannot be bounded by that of partial pivoting and vice versa. Additionally, we experimentally tested the approach on the Summit supercomputer. Threshold pivoting improved performance by up to 32% without a significant effect on accuracy. For a more aggressive configuration with up to one digit of accuracy lost, the improvement was as high as 44%.
%B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems
%I IEEE
%C Dallas, Texas
%8 2022-11
%G eng
%R 10.1109/ScalAH56622.2022.00010
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2021
%T A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
%A Abdelfattah, Ahmad
%A Costa, Timothy
%A Jack Dongarra
%A Mark Gates
%A Haidar, Azzam
%A Hammarling, Sven
%A Higham, Nicholas J
%A Kurzak, Jakub
%A Piotr Luszczek
%A Stanimire Tomov
%A others
%K Computations on matrices
%K Mathematical analysis
%K Mathematics of computing
%K Numerical analysis
%X This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software (TOMS)
%V 47
%P 1–23
%G eng
%R 10.1145/3431921
%0 Generic
%D 2021
%T SLATE Performance Improvements: QR and Eigenvalues
%A Kadir Akbudak
%A Paul Bagwell
%A Sebastien Cayrols
%A Mark Gates
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%8 2021-04
%G eng
%0 Generic
%D 2021
%T SLATE Port to AMD and Intel Platforms
%A Ahmad Abdelfattah
%A Mohammed Al Farhan
%A Cade Brown
%A Mark Gates
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%8 2021-04
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2021
%T A survey of numerical linear algebra methods utilizing mixed-precision arithmetic
%A Abdelfattah, Ahmad
%A Anzt, Hartwig
%A Boman, Erik G
%A Carson, Erin
%A Cojean, Terry
%A Jack Dongarra
%A Fox, Alyson
%A Mark Gates
%A Higham, Nicholas J
%A Li, Xiaoye S
%A others
%K GPUs
%K High-performance computing
%K linear algebra
%K Mixed-precision arithmetic
%K numerical mathematics
%X The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.
%B The International Journal of High Performance Computing Applications
%V 35
%P 344–369
%G eng
%R 10.1177/10943420211003313
%0 Conference Proceedings
%B Proceedings of the ACM International Conference on Supercomputing
%D 2021
%T Task-graph scheduling extensions for efficient synchronization and communication
%A Bak, Seonmyeong
%A Hernandez, Oscar
%A Mark Gates
%A Piotr Luszczek
%A Sarkar, Vivek
%K Compilers
%K Computing methodologies
%K Parallel computing methodologies
%K Parallel programming languages
%K Runtime environments
%K Software and its engineering
%K Software notations and tools
%X Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
%B Proceedings of the ACM International Conference on Supercomputing
%P 88–101
%G eng
%R 10.1145/3447818.3461616
%0 Journal Article
%J Journal of Computational Science
%D 2021
%T Translational process: Mathematical software perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K DATAFLOW scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Journal of Computational Science
%V 52
%P 101216
%G eng
%R 10.1016/j.jocs.2020.101216
%0 Generic
%D 2020
%T Clover: Computational Libraries Optimized via Exascale Research
%A Mark Gates
%A Stanimire Tomov
%A Hartwig Anzt
%A Piotr Luszczek
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2020
%T MAGMA Templates for Scalable Linear Algebra on Emerging Architectures
%A Mohammed Al Farhan
%A Ahmad Abdelfattah
%A Stanimire Tomov
%A Mark Gates
%A Dalal Sukkari
%A Azzam Haidar
%A Robert Rosenberg
%A Jack Dongarra
%X With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.
%B The International Journal of High Performance Computing Applications
%V 34
%P 645-658
%8 2020-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342020938421
%0 Generic
%D 2020
%T Performance Tuning SLATE
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-01
%G eng
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng
%0 Generic
%D 2020
%T SLATE Performance Report: Updates to Cholesky and LU Factorizations
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Mark Gates
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2020-10
%G eng
%0 Generic
%D 2020
%T SLATE: Software for Linear Algebra Targeting Exascale (POSTER)
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Generic
%D 2020
%T SLATE Tutorial
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Ali Charara
%A Jamie Finney
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Jack Dongarra
%I 2020 ECP Annual Meeting
%C Houston, TX
%8 2020-02
%G eng
%0 Generic
%D 2020
%T SLATE Users' Guide
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2020
%T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Erik Boman
%A Erin Carson
%A Terry Cojean
%A Jack Dongarra
%A Mark Gates
%A Thomas Gruetzmacher
%A Nicholas J. Higham
%A Sherry Li
%A Neil Lindquist
%A Yang Liu
%A Jennifer Loe
%A Piotr Luszczek
%A Pratik Nayak
%A Sri Pranesh
%A Siva Rajamanickam
%A Tobias Ribizel
%A Barry Smith
%A Kasia Swirydowicz
%A Stephen Thomas
%A Stanimire Tomov
%A Yaohung Tsai
%A Ichitaro Yamazaki
%A Urike Meier Yang
%B SLATE Working Notes
%I University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes
%0 Journal Article
%J Journal of Computational Science
%D 2020
%T Translational Process: Mathematical Software Perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K DATAFLOW scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Journal of Computational Science
%8 2020-09
%G eng
%R https://doi.org/10.1016/j.jocs.2020.101216
%0 Generic
%D 2020
%T Translational Process: Mathematical Software Perspective
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%K communication avoiding algorithms
%K data flow scheduling runtimes
%K hardware accelerators
%X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to worldclass high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack.
%B Innovative Computing Laboratory Technical Report
%8 2020-08
%G eng
%0 Conference Proceedings
%B ACM International Conference on Supercomputing (ICS '19)
%D 2019
%T Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%Y Rudolf Eigenmann
%Y Chen Ding
%Y Sally A. McKee
%B ACM International Conference on Supercomputing (ICS '19)
%I ACM
%C Phoenix, Arizona
%P 117–126
%8 2019-06
%@ 9781450360791
%G eng
%R https://dl.acm.org/doi/abs/10.1145/3330345.3330356
%0 Conference Proceedings
%B Euro-Par 2019: Parallel Processing
%D 2019
%T Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators
%A Kurzak, Jakub
%A Mark Gates
%A Charara, Ali
%A Asim YarKhan
%A Yamazaki, Ichitaro
%A Jack Dongarra
%E Yahyapour, Ramin
%B Euro-Par 2019: Parallel Processing
%I Springer
%V 11725
%P 495–506
%8 2019-08
%@ 978-3-030-29399-4
%G eng
%U https://link.springer.com/chapter/10.1007/978-3-030-29400-7_35
%R https://doi.org/10.1007/978-3-030-29400-7_35
%0 Conference Paper
%B 48th International Conference on Parallel Processing (ICPP 2019)
%D 2019
%T Massively Parallel Automated Software Tuning
%A Jakub Kurzak
%A Yaohung Tsai
%A Mark Gates
%A Ahmad Abdelfattah
%A Jack Dongarra
%X This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.
%B 48th International Conference on Parallel Processing (ICPP 2019)
%I ACM Press
%C Kyoto, Japan
%8 2019-08
%G eng
%R https://doi.org/10.1145/3337821.3337908
%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491
%0 Generic
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%I International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%C Denver, CO
%8 2019-11
%G eng
%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%X The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components.
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%R https://doi.org/10.1145/3295500.3356223
%0 Generic
%D 2019
%T SLATE Developers' Guide
%A Ali Charara
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-12
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2019
%T SLATE Mixed Precision Performance Report
%A Ali Charara
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng
%0 Generic
%D 2019
%T SLATE Working Note 12: Implementing Matrix Inversions
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-06
%G eng
%0 Generic
%D 2019
%T SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers
%A Mark Gates
%A Mohammed Al Farhan
%A Ali Charara
%A Jakub Kurzak
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-09
%G eng
%9 SLATE Working Notes
%0 Generic
%D 2018
%T Accelerating Linear Algebra with MAGMA
%A Stanimire Tomov
%A Mark Gates
%A Azzam Haidar
%I ECP Annual Meeting 2018, Tutorial
%C Knoxville, TN
%8 2018-02
%G eng
%0 Journal Article
%J Parallel Computing
%D 2018
%T Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K 2-stage
%K accelerator
%K Divide and conquer
%K gpu
%K Singular value decomposition
%K SVD
%X The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value decomposition (SVD) uses a one stage reduction to bidiagonal form, which is limited in performance by the memory bandwidth. To overcome this limitation, a two stage reduction to bidiagonal has been gaining popularity. It first reduces the matrix to band form using high performance Level 3 BLAS, then reduces the band matrix to bidiagonal form. As accelerators such as GPUs and co-processors are becoming increasingly widespread in high-performance computing, a question of great interest to many SVD users is how much the employment of a two stage reduction, as well as other current best practices in GPU computing, can accelerate this important routine. To fulfill this interest, we have developed an accelerated SVD employing a two stage reduction to bidiagonal and a number of other algorithms that are highly optimized for GPUs. Notably, we also parallelize and accelerate the divide and conquer algorithm used to solve the subsequent bidiagonal SVD. By accelerating all phases of the SVD algorithm, we provide a significant speedup compared to existing multi-core and GPU-based SVD implementations. In particular, using a P100 GPU, we illustrate a performance of up to 804 Gflop/s in double precision arithmetic to compute the full SVD of a 20k × 20k matrix in 90 seconds, which is 8.9 × faster than MKL on two 10 core Intel Haswell E5-2650 v3 CPUs, 3.7 × over the multi-core PLASMA two stage version, and 2.6 × over the previously accelerated one stage MAGMA version.
%B Parallel Computing
%V 74
%P 3–18
%8 2018-05
%G eng
%U https://www.sciencedirect.com/science/article/pii/S0167819117301758
%! Parallel Computing
%R 10.1016/j.parco.2017.10.004
%0 Journal Article
%J Proceedings of the IEEE
%D 2018
%T Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yaohung Tsai
%K Dense numerical linear algebra
%K performance autotuning
%X Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.
%B Proceedings of the IEEE
%V 106
%P 2040–2055
%8 2018-11
%G eng
%N 11
%R 10.1109/JPROC.2018.2868961
%0 Report
%D 2018
%T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
%A Jack Dongarra
%A Iain Duff
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jonathan Hogg
%A Pedro Valero Lara
%A Piotr Luszczek
%A Mawussi Zounon
%A Samuel D. Relton
%A Stanimire Tomov
%A Timothy Costa
%A Sarah Knepper
%X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility.
%8 2018-07
%G eng
%0 Journal Article
%J Journal of Advances in Modeling Earth Systems
%D 2018
%T Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling
%A Jian Sun
%A Joshua Fu
%A John Drake
%A Qingzhao Zhu
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%K compiler
%K CUDA
%K data transfer
%K gpu
%K hybrid
%K memory layout
%X Global chemistry‐climate models are computationally burdened as the chemical mechanisms become more complex and realistic. Optimization for graphics processing units (GPU) may make longer global simulation with regional detail possible, but limited study has been done to explore the potential benefit for the atmospheric chemistry modeling. Hence, in this study, the second‐order Rosenbrock solver of the chemistry module of CAM4‐Chem is ported to the GPU to gauge potential speed‐up. We find that on the CPU, the fastest performance is achieved using the Intel compiler with a block interleaved memory layout. Different combinations of compiler and memory layout lead to ~11.02× difference in the computational time. In contrast, the GPU version performs the best when using a combination of fully interleaved memory layout with block size equal to the warp size, CUDA streams for independent kernels, and constant memory. Moreover, the most efficient data transfer between CPU and GPU is gained by allocating the memory contiguously during the data initialization on the GPU. Compared to one CPU core, the speed‐up of using one GPU alone reaches a factor of ~11.7× for the computation alone and ~3.82× when the data transfer between CPU and GPU is considered. Using one GPU alone is also generally faster than the multithreaded implementation for 16 CPU cores in a compute node and the single‐source solution (OpenACC). The best performance is achieved by the implementation of the hybrid CPU/GPU version, but rescheduling the workload among the CPU cores is required before the practical CAM4‐Chem simulation.
%B Journal of Advances in Modeling Earth Systems
%V 10
%P 1952–1969
%8 2018-08
%G eng
%N 8
%R https://doi.org/10.1029/2018MS001276
%0 Generic
%D 2018
%T Implementation of the C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 07
%0 Generic
%D 2018
%T Least Squares Performance Report
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-12
%G eng
%9 SLATE Working Notes
%1 09
%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%1 08
%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 2018-04
%G eng
%1 05
%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 06
%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732
%0 Conference Paper
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2017
%T Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yu Pei
%A Jack Dongarra
%K batch computation
%K Cholesky Factorization
%K data layout
%K GPU computing
%K numerical linear algebra
%X Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Orlando, FL
%8 2017-06
%G eng
%R 10.1109/IPDPSW.2017.18
%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4
%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 2017-12
%G eng
%1 04
%0 Generic
%D 2017
%T C++ API for BLAS and LAPACK
%A Mark Gates
%A Piotr Luszczek
%A Ahmad Abdelfattah
%A Jakub Kurzak
%A Jack Dongarra
%A Konstantin Arturov
%A Cris Cecka
%A Chip Freitag
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%1 02
%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03
%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report
%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng
%0 Journal Article
%J Parallel Computing
%D 2017
%T Preconditioned Krylov Solvers on GPUs
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%A Moritz Kreutzer
%A Gerhard Wellein
%A Martin Kohler
%K gpu
%K ILU
%K Jacobi
%K Krylov solvers
%K Preconditioning
%X In this paper, we study the effect of enhancing GPU-accelerated Krylov solvers with preconditioners. We consider the BiCGSTAB, CGS, QMR, and IDR(s) Krylov solvers. For a large set of test matrices, we assess the impact of Jacobi and incomplete factorization preconditioning on the solvers’ numerical stability and time-to-solution performance. We also analyze how the use of a preconditioner impacts the choice of the fastest solver.
%B Parallel Computing
%8 2017-06
%G eng
%U http://www.sciencedirect.com/science/article/pii/S0167819117300777
%! Parallel Computing
%R 10.1016/j.parco.2017.05.006
%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%1 01
%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48
%0 Conference Paper
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%D 2016
%T Heterogeneous Streaming
%A Chris J. Newburn
%A Gaurav Bansal
%A Michael Wood
%A Luis Crivelli
%A Judit Planas
%A Alejandro Duran
%A Paulo Souza
%A Leonardo Borges
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%A Hartwig Anzt
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Ichitaro Yamazaki
%A Jesus Labarta
%K plasma
%X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application.
%B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015
%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Search Space Generation and Pruning System for Autotuners
%A Piotr Luszczek
%A Mark Gates
%A Jakub Kurzak
%A Anthony Danalis
%A Jack Dongarra
%X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng
%0 Conference Paper
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%D 2015
%T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs
%A Mark Gates
%A Hartwig Anzt
%A Jakub Kurzak
%A Jack Dongarra
%X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation.
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%I IEEE
%C Santa Clara, CA
%8 2015-11
%G eng
%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra
%A Mark Gates
%A Stanimire Tomov
%A Azzam Haidar
%X Accelerating dense linear algebra using GPUs admits two models: hybrid CPU-GPU and GPU-only. The hybrid model factors the panel on the CPU while updating the trailing matrix on the GPU, concentrating the GPU on high-performance matrix multiplies. The GPU-only model performs the entire computation on the GPU, avoiding costly data transfers to the CPU. We compare these two approaches for three QR-based algorithms: QR factorization, rank revealing QR, and reduction to Hessenberg.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng
%0 Journal Article
%J Scientific Programming
%D 2015
%T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Azzam Haidar
%A Jack Dongarra
%A Khairul Kabir
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%A Yulu Jia
%K communication and computation overlap
%K dynamic runtime scheduling using dataflow dependences
%K hardware accelerators and coprocessors
%K Intel Xeon Phi processor
%K Many Integrated Cores
%K numerical linear algebra
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B Scientific Programming
%V 23
%8 2015-01
%G eng
%N 1
%R 10.3233/SPR-140404
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2015
%T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
%A Jakub Kurzak
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%8 2015-11
%G eng
%0 Generic
%D 2015
%T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I ISC High Performance (ISC15), Intel Booth Presentation
%C Frankfurt, Germany
%8 2015-06
%G eng
%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K plasma
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K plasma
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 2015-04
%G eng
%N 5
%R 10.1002/cpe.3306
%0 Conference Paper
%B VECPAR 2014
%D 2014
%T Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem
%A Mark Gates
%A Azzam Haidar
%A Jack Dongarra
%X In the nonsymmetric eigenvalue problem, work has focused on the Hessenberg reduction and QR iteration, using efficient algorithms and fast, Level 3 BLAS routines. Comparatively, computation of eigenvectors performs poorly, limited to slow, Level 2 BLAS performance with little speedup on multi-core systems. It has thus become a dominant cost in the eigenvalue problem. To address this, we present improvements for the eigenvector computation to use Level 3 BLAS where applicable and parallelize the remaining triangular solves, achieving good parallel scaling and accelerating the overall eigenvalue problem more than three-fold.
%B VECPAR 2014
%C Eugene, OR
%8 2014-06
%G eng
%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1
%0 Conference Paper
%B International Workshop on OpenCL
%D 2014
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B International Workshop on OpenCL
%C Bristol University, England
%8 2014-05
%G eng
%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2014
%T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks
%A Azzam Haidar
%A Raffaele Solcà
%A Mark Gates
%A Stanimire Tomov
%A Thomas C. Schulthess
%A Jack Dongarra
%K Eigensolver
%K electronic structure calculations
%K generalized eigensolver
%K gpu
%K high performance
%K hybrid
%K Multicore
%K two-stage
%X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code.
%B International Journal of High Performance Computing Applications
%V 28
%P 196-209
%8 2014-05
%G eng
%N 2
%& 196
%R 10.1177/1094342013502097
%0 Conference Paper
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%D 2014
%T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors
%A Azzam Haidar
%A Chongxiao Cao
%A Ichitaro Yamazaki
%A Jack Dongarra
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.
%B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14)
%I IEEE
%C New Orleans, LA
%8 2014-11
%G eng
%R 10.1109/ScalA.2014.8
%0 Generic
%D 2013
%T clMAGMA: High Performance Dense Linear Algebra with OpenCL
%A Chongxiao Cao
%A Jack Dongarra
%A Peng Du
%A Mark Gates
%A Piotr Luszczek
%A Stanimire Tomov
%X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components.
%B University of Tennessee Technical Report (Lawn 275)
%I University of Tennessee
%8 2013-03
%G eng
%0 Conference Paper
%B PPAM 2013
%D 2013
%T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K mic
%K xeon phi
%X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.
%B PPAM 2013
%C Warsaw, Poland
%8 2013-09
%G eng
%0 Conference Paper
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%D 2013
%T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication
%A Azzam Haidar
%A Mark Gates
%A Stanimire Tomov
%A Jack Dongarra
%E Allen D. Malony
%E Nemirovsky, Mario
%E Midkiff, Sam
%K eigenvalue
%K gpu communication
%K gpu computation
%K heterogeneous programming model
%K performance
%K reduction to tridiagonal
%K singular value decomposiiton
%K task parallelism
%X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores.
%B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13)
%I ACM Press
%C Eugene, Oregon, USA
%8 2013-06
%@ 9781450321303
%G eng
%U http://dl.acm.org/citation.cfm?doid=2464996.2465438
%R 10.1145/2464996.2465438
%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119
%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng
%0 Journal Article
%J ICCS 2012
%D 2012
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%B ICCS 2012
%C Omaha, NE
%8 2012-06
%G eng
%0 Generic
%D 2012
%T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
%A Jack Dongarra
%A Tingxing Dong
%A Mark Gates
%A Azzam Haidar
%A Stanimire Tomov
%A Ichitaro Yamazaki
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation
%C Salt Lake City, UT
%8 2012-11
%G eng
%0 Generic
%D 2012
%T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
%A Jack Dongarra
%A Mark Gates
%A Yulu Jia
%A Khairul Kabir
%A Piotr Luszczek
%A Stanimire Tomov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12)
%C Salt Lake City, UT
%8 2012-11
%G eng
%0 Generic
%D 2012
%T MAGMA Tutorial
%A Mark Gates
%I Keeneland Workshop
%C Atlanta, GA
%8 2012-02
%G eng
%0 Journal Article
%D 2011
%T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
%A Hartwig Anzt
%A Stanimire Tomov
%A Mark Gates
%A Jack Dongarra
%A Vincent Heuveline
%K magma
%8 2011-12
%G eng