%0 Conference Paper
%B Parallel Processing and Applied Mathematics (PPAM 2022)
%D 2023
%T Mixed Precision Algebraic Multigrid on GPUs
%A Tsai, Yu-Hsiang Mike
%A Natalie Beams
%A Anzt, Hartwig
%E Wyrzykowski, Roman
%E Dongarra, Jack
%E Deelman, Ewa
%E Karczewski, Konrad
%K Algebraic multigrid
%K GPUs
%K mixed precision
%K Portability
%X In this paper, we present the first GPU-native platform-portable algebraic multigrid (AMG) implementation that allows the user to use different precision formats for the distinct multigrid levels. The AMG we present uses an aggregation size 2 parallel graph match as the AMG coarsening strategy. The implementation provides a high level of flexibility in terms of configuring the bottom-level solver and the precision format for the distinct levels. We present convergence and performance results on the GPUs from AMD, Intel, and NVIDIA, and compare against corresponding functionality available in other libraries.
%B Parallel Processing and Applied Mathematics (PPAM 2022)
%I Springer International Publishing
%C Cham
%V 13826
%8 2023-04
%@ 978-3-031-30441-5
%G eng
%U https://link.springer.com/10.1007/978-3-031-30442-2
%R 10.1007/978-3-031-30442-2_9

%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2021
%T A survey of numerical linear algebra methods utilizing mixed-precision arithmetic
%A Abdelfattah, Ahmad
%A Anzt, Hartwig
%A Boman, Erik G
%A Carson, Erin
%A Cojean, Terry
%A Jack Dongarra
%A Fox, Alyson
%A Mark Gates
%A Higham, Nicholas J
%A Li, Xiaoye S
%A others
%K GPUs
%K High-performance computing
%K linear algebra
%K Mixed-precision arithmetic
%K numerical mathematics
%X The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to scientific computing applications. Especially with the hardware integration of low-precision special-function units designed for machine learning applications, the traditional numerical algorithms community urgently needs to reconsider the floating point formats used in the distinct operations to efficiently leverage the available compute power. In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra problems.
%B The International Journal of High Performance Computing Applications
%V 35
%P 344–369
%G eng
%R 10.1177/10943420211003313

%0 Conference Paper
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%D 2016
%T On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K GPUs
%K variable small sizes
%X <p>  Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.  </p>  <p>  This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.  </p>
%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Paper
%B International Conference on Computational Science (ICCS'16)
%D 2016
%T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs
%A Ahmad Abdelfattah
%A Azzam Haidar
%A Stanimire Tomov
%A Jack Dongarra
%K batched computation
%K Cholesky Factorization
%K GPUs
%K Tuning
%X <p>Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.</p>    <p>This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.</p>
%B International Conference on Computational Science (ICCS'16)
%C San Diego, CA
%8 2016-06
%G eng