%0 Journal Article
%J ACM Transactions on Parallel Computing
%D 2020
%T Load-Balancing Sparse Matrix Vector Product Kernels on GPUs
%A Hartwig Anzt
%A Terry Cojean
%A Chen Yen-Chen
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Stanimire Tomov
%A Yuhsiang M. Tsai
%A Weichung Wang
%X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format.
%B ACM Transactions on Parallel Computing
%V 7
%8 2020-03
%G eng
%N 1
%R https://doi.org/10.1145/3380930

%0 Conference Paper
%B European Conference on Parallel Processing (Euro-Par 2020)
%D 2020
%T Multiprecision Block-Jacobi for Iterative Triangular Solves
%A Fritz Goebel
%A Hartwig Anzt
%A Terry Cojean
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Block-Jacobi
%K graphics processing units (GPUs)
%K incomplete factorization preconditioning
%K multiprecision
%K sparse linear algebra
%X Recent research efforts have shown that Jacobi and block-Jacobi relaxation methods can be used as an effective and highly parallel approach for the solution of sparse triangular linear systems arising in the application of ILU-type preconditioners. Simultaneously, a few independent works have focused on designing efficient high performance adaptive-precision block-Jacobi preconditioning (block-diagonal scaling), in the context of the iterative solution of sparse linear systems, on manycore architectures. In this paper, we bridge the gap between relaxation methods based on regular splittings and preconditioners by demonstrating that iterative refinement can be leveraged to construct a relaxation method from the preconditioner. In addition, we exploit this insight to construct a highly-efficient sparse triangular system solver for graphics processors that combines iterative refinement with the block-Jacobi preconditioner available in the Ginkgo library.
%B European Conference on Parallel Processing (Euro-Par 2020)
%I Springer
%8 2020-08
%G eng
%R https://doi.org/10.1007/978-3-030-57675-2_34

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T Adaptive Precision in Block-Jacobi Preconditioning for Iterative Sparse Linear System Solvers
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Nicholas J. Higham
%A Enrique S. Quintana-Orti
%K adaptive precision
%K block-Jacobi preconditioning
%K communication reduction
%K energy efficiency
%K Krylov subspace methods
%K sparse linear systems
%X Summary We propose an adaptive scheme to reduce communication overhead caused by data movement by selectively storing the diagonal blocks of a block-Jacobi preconditioner in different precision formats (half, single, or double). This specialized preconditioner can then be combined with any Krylov subspace method for the solution of sparse linear systems to perform all arithmetic in double precision. We assess the effects of the adaptive precision preconditioner on the iteration count and data transfer cost of a preconditioned conjugate gradient solver. A preconditioned conjugate gradient method is, in general, a memory bandwidth-bound algorithm, and therefore its execution time and energy consumption are largely dominated by the costs of accessing the problem's data in memory. Given this observation, we propose a model that quantifies the time and energy savings of our approach based on the assumption that these two costs depend linearly on the bit length of a floating point number. Furthermore, we use a number of test problems from the SuiteSparse matrix collection to estimate the potential benefits of the adaptive block-Jacobi preconditioning scheme.
%B Concurrency and Computation: Practice and Experience
%V 31
%P e4460
%8 2019-03
%G eng
%U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4460
%R https://doi.org/10.1002/cpe.4460

%0 Conference Paper
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2019
%T Are we Doing the Right Thing? – A Critical Analysis of the Academic HPC Community
%A Hartwig Anzt
%A Goran Flegar
%X Like in any other research field, academically surviving in the High Performance Computing (HPC) community generally requires to publish papers, in the bast case many of them and in high-ranked journals or at top-tier conferences. As a result, the number of scientific papers published each year in this relatively small community easily outnumbers what a single researcher can read. At the same time, many of the proposed and analyzed strategies, algorithms, and hardware-optimized implementations never make it beyond the prototype stage, as they are abandoned once they served the single purpose of yielding (another) publication. In a time and field where high-quality manpower is a scarce resource, this is extremely inefficient. In this position paper we promote a radical paradigm shift towards accepting high-quality software patches to community software packages as legitimate conference contributions. In consequence, the reputation and appointability of researchers is no longer based on the classical scientific metrics, but on the quality and documentation of open source software contributions - effectively improving and accelerating the collaborative development of community software.
%B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R 10.1109/IPDPSW.2019.00122

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2019
%T A Customized Precision Format Based on Mantissa Segmentation for Accelerating Sparse Linear Algebra
%A Thomas Gruetzmacher
%A Terry Cojean
%A Goran Flegar
%A Fritz Göbel
%A Hartwig Anzt
%B Concurrency and Computation: Practice and Experience
%V 40319
%8 2019-01
%G eng
%N 262
%R https://doi.org/10.1002/cpe.5418

%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2019
%T ParILUT – A Parallel Threshold ILU for GPUs
%A Hartwig Anzt
%A Tobias Ribizel
%A Goran Flegar
%A Edmond Chow
%A Jack Dongarra
%X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%I IEEE
%C Rio de Janeiro, Brazil
%8 2019-05
%G eng
%R https://doi.org/10.1109/IPDPS.2019.00033

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2019
%T Toward a Modular Precision Ecosystem for High-Performance Computing
%A Hartwig Anzt
%A Goran Flegar
%A Thomas Gruetzmacher
%A Enrique S. Quintana-Orti
%K conjugate gradient
%K GPUs
%K Jacobi method
%K Modular precision
%K multicore processors
%K PageRank
%K parallel numerical linear algebra
%X With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today’s high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a “modular precision ecosystem” that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm—two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.
%B The International Journal of High Performance Computing Applications
%V 33
%P 1069-1078
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019846547

%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%D 2019
%T Towards Continuous Benchmarking
%A Hartwig Anzt
%A Yen Chen Chen
%A Terry Cojean
%A Jack Dongarra
%A Goran Flegar
%A Pratik Nayak
%A Enrique S. Quintana-Orti
%A Yuhsiang M. Tsai
%A Weichung Wang
%X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics.
%B Platform for Advanced Scientific Computing Conference (PASC 2019)
%I ACM Press
%C Zurich, Switzerland
%8 2019-06
%@ 9781450367707
%G eng
%R https://doi.org/10.1145/3324989.3325719

%0 Journal Article
%J Parallel Computing
%D 2019
%T Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K Batched algorithms
%K Block-Jacobi
%K Gauss–Jordan elimination
%K Graphics processor
%K matrix inversion
%K sparse linear systems
%X In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variable-size batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix–vector multiplication kernel that transforms the linear systems’ right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVIDIA’s K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver.
%B Parallel Computing
%V 81
%P 131-146
%8 2019-01
%G eng
%R https://doi.org/10.1016/j.parco.2017.12.006

%0 Conference Paper
%B SBAC-PAD
%D 2018
%T Variable-Size Batched Condition Number Calculation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Thomas Gruetzmacher
%B SBAC-PAD
%C Lyon, France
%8 2018-09
%G eng
%U https://ieeexplore.ieee.org/document/8645907

%0 Conference Proceedings
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%D 2017
%T Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K block-Jacobi preconditioner
%K Gauss-Jordan elimination
%K graphics processing units (GPUs)
%K iterative methods
%K matrix inversion
%K sparse linear systems
%X In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan elimination CUDA kernel for matrix inversion that embeds an implicit pivoting technique and handles the entire inversion process in the GPU registers. In addition, we integrate extraction and insertion CUDA kernels to rapidly set up the block-Jacobi preconditioner.    Our experiments compare the performance of our implementation against a sequence of batched routines from the MAGMA library realizing the inversion via the LU factorization with partial pivoting. Furthermore, we evaluate the costs of different strategies for the block-Jacobi extraction and insertion steps, using a variety of sparse matrices from the SuiteSparse matrix collection. Finally, we assess the efficiency of the complete block-Jacobi preconditioner generation in the context of an iterative solver applied to a set of computational science problems, and quantify its benefits over a scalar Jacobi preconditioner.
%B Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores
%S PMAM'17
%I ACM
%C New York, NY, USA
%P 1–10
%8 2017-02
%@ 978-1-4503-4883-6
%G eng
%U http://doi.acm.org/10.1145/3026937.3026940
%R 10.1145/3026937.3026940

%0 Generic
%D 2017
%T Flexible Batched Sparse Matrix Vector Product on GPUs
%A Hartwig Anzt
%A Collins, Gary
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%I ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%C Denver, Colorado
%8 2017-11
%G eng

%0 Conference Paper
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%D 2017
%T Flexible Batched Sparse Matrix-Vector Product on GPUs
%A Hartwig Anzt
%A Gary Collins
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%X We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order to handle a batch of matrices which differ in size, nonzero count, and nonzero distribution. Furthermore, they support three most commonly used sparse storage formats: CSR, COO and ELL. Our experimental results on a state-of-the-art GPU reveal performance improvements of up to 25X compared to non-batched SpMV routines.
%B 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17)
%I ACM Press
%C Denver, CO
%8 2017-11
%G eng
%R http://dx.doi.org/10.1145/3148226.3148230

%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library  to allow for easy integration as third-party library into a top-level software ecosystem. The  design choices are based on extensive consultation with other software library developers, in  particular the Trilinos software development team. The interface documentation is at this point  not exhaustive, but a first proposal for setting a standard. Although the interface description  targets the MAGMA-sparse software module, we hope that the design choices carry beyond this  specific library, and are attractive for adoption in other packages.  This report is not intended as static document, but will be updated over time to reflect the agile  software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report

%0 Conference Proceedings
%B International Conference on Computational Science (ICCS 2017)
%D 2017
%T Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%A Andres E. Thomas
%X In this work we present new kernels for the generation and application of block-Jacobi precon-ditioners that accelerate the iterative solution of sparse linear systems on graphics processing units (GPUs). Our approach departs from the conventional LU factorization and decomposes the diagonal blocks of the matrix using the Gauss-Huard method. When enhanced with column pivoting, this method is as stable as LU with partial/row pivoting. Due to extensive use of GPU registers and integration of implicit pivoting, our variable size batched Gauss-Huard implementation outperforms the batched version of LU factorization. In addition, the application kernel combines the conventional two-stage triangular solve procedure, consisting of a backward solve followed by a forward solve, into a single stage that performs both operations simultaneously.
%B International Conference on Computational Science (ICCS 2017)
%I Procedia Computer Science
%C Zurich, Switzerland
%V 108
%P 1783-1792
%8 2017-06
%G eng
%R https://doi.org/10.1016/j.procs.2017.05.186

%0 Conference Paper
%B 46th International Conference on Parallel Processing (ICPP)
%D 2017
%T Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning
%A Hartwig Anzt
%A Jack Dongarra
%A Goran Flegar
%A Enrique S. Quintana-Orti
%K graphics processing units
%K Jacobian matrices
%K Kernel
%K linear systems
%K Parallel processing
%K Sparse matrices
%X We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrassingly parallel scenario in the context of block-Jacobi preconditioning that is relevant for the iterative solution of sparse linear systems.
%B 46th International Conference on Parallel Processing (ICPP)
%I IEEE
%C Bristol, United Kingdom
%8 2017-08
%G eng
%U http://ieeexplore.ieee.org/abstract/document/8025283/?reload=true
%R 10.1109/ICPP.2017.18