%0 Journal Article %J ACM Transactions on Parallel Computing %D 2020 %T Load-Balancing Sparse Matrix Vector Product Kernels on GPUs %A Hartwig Anzt %A Terry Cojean %A Chen Yen-Chen %A Jack Dongarra %A Goran Flegar %A Pratik Nayak %A Stanimire Tomov %A Yuhsiang M. Tsai %A Weichung Wang %X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format. %B ACM Transactions on Parallel Computing %V 7 %8 2020-03 %G eng %N 1 %R https://doi.org/10.1145/3380930 %0 Conference Paper %B Platform for Advanced Scientific Computing Conference (PASC 2019) %D 2019 %T Towards Continuous Benchmarking %A Hartwig Anzt %A Yen Chen Chen %A Terry Cojean %A Jack Dongarra %A Goran Flegar %A Pratik Nayak %A Enrique S. Quintana-Orti %A Yuhsiang M. Tsai %A Weichung Wang %X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics. %B Platform for Advanced Scientific Computing Conference (PASC 2019) %I ACM Press %C Zurich, Switzerland %8 2019-06 %@ 9781450367707 %G eng %R https://doi.org/10.1145/3324989.3325719