Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs

Tingxing Dong; Azzam Haidar; Stanimire Tomov; Jack Dongarra

Submitted by scrawford on Thu, 12/06/2018 - 16:49

Title	Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs
Publication Type	Journal Article
Year of Publication	2018
Authors	Dong, T., A. Haidar, S. Tomov, and J. Dongarra
Journal	Journal of Computational Science
Volume	26
Pagination	237–245
Date Published	2018-05
Keywords	Batched, Eigenvalue and singular value problems, hardware accelerators, numerical linear algebra, Two-sided factorization algorithms
Abstract	The acceleration of many small-sized linear algebra problems has become extremely challenging for current many-core architectures, and in particular GPUs. Standard interfaces have been proposed for some of these problems, called batched problems, so that they get targeted for optimization and used in a standard way in applications, calling them directly from highly optimized, standard numerical libraries, like (batched) BLAS and LAPACK. While most of the developments have been for one-sided factorizations and solvers, many important applications – from big data analytics to information retrieval, low-rank approximations for solvers and preconditioners – require two-sided factorizations, and most notably the SVD factorization. To address these needs and the parallelization challenges related to them, we developed a number of new batched computing techniques and designed batched Basic Linear Algebra Subroutines (BLAS) routines, and in particular the Level-2 BLAS GEMV and the Level-3 BLAS GEMM routines, to solve them. We propose a device functions-based methodology and big-tile setting techniques in our batched BLAS design. The different optimization techniques result in many software versions that must be tuned, for which we adopt an auto-tuning strategy to automatically derive the optimized instances of the routines. We illustrate our batched BLAS approach to optimize batched SVD bi-diagonalization progressively on GPUs. The progression is illustrated on an NVIDIA K40c GPU, and also, ported and presented on AMD Fiji Nano GPU, using AMD's Heterogeneous–Compute Interface for Portability (HIP) C++ runtime API. We demonstrate achieving 80% of the theoretically achievable peak performance for the overall algorithm, and significant acceleration of the Level-2 BLAS GEMV and Level-3 BLAS GEMM needed compared to vendor-optimized libraries on GPUs and multicore CPUs. The optimization techniques in this paper are applicable to the other two-sided factorizations as well.
DOI	10.1016/j.jocs.2018.01.007

Project Tags:

magma

matedor

File:

icl-utk-1239-2018.pdf

External Publication Flag: