%0 Generic %D 2021 %T Gingko: A Sparse Linear Algebrea Library for HPC %A Hartwig Anzt %A Natalie Beams %A Terry Cojean %A Fritz Göbel %A Thomas Grützmacher %A Aditya Kashi %A Pratik Nayak %A Tobias Ribizel %A Yuhsiang M. Tsai %I 2021 ECP Annual Meeting %8 2021-04 %G eng %0 Journal Article %J Journal of Open Source Software %D 2020 %T Ginkgo: A High Performance Numerical Linear Algebra Library %A Hartwig Anzt %A Terry Cojean %A Yen-Chen Chen %A Fritz Goebel %A Thomas Gruetzmacher %A Pratik Nayak %A Tobias Ribizel %A Yu-Hsiang Tsai %X Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability. The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead. Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK (Davis, 2004) and CHOLMOD (Chen, Davis, Hager, & Rajamanickam, 2008) for solving linear systems with direct methods and PETSc (Balay et al., 2020), Trilinos (“The Trilinos Project Website,” 2020), Eigen (Guennebaud, Jacob, & others, 2010) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use. Ginkgo is also a part of the xSDK effort (Bartlett et al., 2017) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo. %B Journal of Open Source Software %V 5 %8 2020-08 %G eng %N 52 %R https://doi.org/10.21105/joss.02260 %0 Generic %D 2020 %T Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster) %A Hartwig Anzt %A Terry Cojean %A Yen-Chen Chen %A Fritz Goebel %A Thomas Gruetzmacher %A Pratik Nayak %A Tobias Ribizel %A Yu-Hsiang Tsai %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic %A Ahmad Abdelfattah %A Hartwig Anzt %A Erik Boman %A Erin Carson %A Terry Cojean %A Jack Dongarra %A Mark Gates %A Thomas Gruetzmacher %A Nicholas J. Higham %A Sherry Li %A Neil Lindquist %A Yang Liu %A Jennifer Loe %A Piotr Luszczek %A Pratik Nayak %A Sri Pranesh %A Siva Rajamanickam %A Tobias Ribizel %A Barry Smith %A Kasia Swirydowicz %A Stephen Thomas %A Stanimire Tomov %A Yaohung Tsai %A Ichitaro Yamazaki %A Urike Meier Yang %B SLATE Working Notes %I University of Tennessee %8 2020-07 %G eng %9 SLATE Working Notes %0 Conference Paper %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops %D 2019 %T Approximate and Exact Selection on GPUs %A Tobias Ribizel %A Hartwig Anzt %X We present a novel algorithm for parallel selection on GPUs. The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for - and exploiting the characteristics of - "pleasant" data distributions. At the same time, as the SampleSelect does not work on the actual values but the ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. Additionally to the exact SampleSelect, we address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %R 10.1109/IPDPSW.2019.00088 %0 Journal Article %J Parallel Computing %D 2019 %T Parallel Selection on GPUs %A Tobias Ribizel %A Hartwig Anzt %K approximate selection %K gpu %K kth order statistics %K multiselection %K parallel selection algorithm %X We present a novel parallel selection algorithm for GPUs capable of handling single rank selection (single selection) and multiple rank selection (multiselection). The algorithm requires no assumptions on the input data distribution, and has a much lower recursion depth compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always leveraging the respectively-available low-level communication features, and assess the performance on server-line hardware. The computational complexity of our SampleSelect algorithm is comparable to specialized algorithms designed for – and exploiting the characteristics of – “pleasant” data distributions. At the same time, as the proposed SampleSelect algorithm does not work on the actual element values but on the element ranks of the elements only, it is robust to the input data and can complete significantly faster for adversarial data distributions. We also address the use case of approximate selection by designing a variant that radically reduces the computational cost while preserving high approximation accuracy. %B Parallel Computing %V 91 %8 2020-03 %G eng %U https://www.sciencedirect.com/science/article/pii/S0167819119301796 %! Parallel Computing %R https://doi.org/10.1016/j.parco.2019.102588 %0 Conference Paper %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2019 %T ParILUT – A Parallel Threshold ILU for GPUs %A Hartwig Anzt %A Tobias Ribizel %A Goran Flegar %A Edmond Chow %A Jack Dongarra %X In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations. %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %R https://doi.org/10.1109/IPDPS.2019.00033