Batched BLAS

SC17 BoF Session

November 14, 2017 in Denver, Colorado

Historically, most design efforts in the HPC community have been made in the direction of solving large linear algebra problems that were handled by the original set of Basic Linear Algebra Subroutines. But in recent years the state-of-the-art approaches for addressing large-scale problems are undergoing a tremendous change. It is becoming increasingly common in many scientific fields to decompose very large-scale simulation problems into multitude of very small linear algebra operations that can be computed in parallel. The representative applications from a variety of scientific fields that exhibit this kind of computing patterns include tensor contractions codes for the quantum Hall effect, astrophysics calculations, metabolic networks applications, CFD and the resulting PDE solvers that use the direct and multifrontal solvers, high-order FEM solver schemes for hydrodynamics, mixed direct-iterative preconditioned solvers, quantum chemistry calculation, image analysis, and signal processing. Unfortunately, applications with many small matrix or tensor operations can exhibit very poor performance using the standard optimized vendor linear algebra libraries. Different strategies, including the use of compiler technologies and autotuning schemes, have been investigated to adapt the existing libraries to small matrix problems without satisfactory performance. These problems are too small to use modern HPC systems and the associated optimized libraries at full efficiency. Nevertheless, the fact that one has to solve thousands of these problems independently suggests it is worth designing new linear algebra libraries. Consequently, batched BLAS algorithms have been introduced to solve thousands of small BLAS operations with only one function call. The computational science community and commercial outfits focused on intesive data analysis actively work on implementations that fulfill the need for optimized batched BLAS-like kernels. The Intel Math Kernel Library has released batched matrix-matrix multiplication as well as batched triangular solver (Batched TRSM). Additionally, NVIDIA cuBLAS includes the same, batched GEMM and triangular solve, along with batched versions for more advanced numerical linear algebra routines. The MAGMA library provides open source implementations for a number of batched BLAS routines for the GPU accelerators. At the same time, some application developers design and implement their own batched BLAS-like kernels. The gradual introduction of batched BLAS routines in vendor libraries and important research software demonstrates awareness of the need for batched BLAS functionality, which is very encouraging. To fully empower batched BLAS based applications, the community needs to make an effort towards standardization of the batched BLAS routines. The batched BLAS interfaces currently provided by Intel MKL, NVIDIA cuBLAS, MAGMA, and other libraries differ significantly from each other, which results in a serious portability issue. The increasing gap between modern GPU architectures, co-processors, and regular multi-core CPUs overburden the effort in providing a standard interface for batched BLAS functions. The calling interfaces and optimal data layout for data storage of a batch of small matrices necessary for good performance vary depending on architecture. To propose an objective standard without a severe performance penalty for any architecture, a first attempt was made by analyzing the benefits and drawbacks of existing batched BLAS interfaces. This BOF continued these efforts.

Session format

The structure was for the vendors to describe what they propose in terms of hardware and mathematical software for their HPC systems. This was followed by various summary of standardization and specification reports from hardware and software vendors as well as the open source library and application developers on what they need in terms of numerical linear algebra software for today's and future systems. The authors of the Batched, Reproducible, and Reduced Precision BLAS presented the current proposal and various implementations (reference ones and more hardware-specific ones). We had discussion about various aspects of the current state of the standard and the future plans. Specific periods of time were allotted for interaction between the audience (the community) and the presenters and leaders. A moderated discussion also took place.

Presenter	Title	File
Jack Dongarra	Welcome and introduction
Sven Hammerling	BLAS Standardization Process	Download
Azzam Haidar	Batched BLAS Standard	Download
Siva Rajamanickam	Batch BLAS Applications
Jason Riedy	A Proposal for a Next-Generation BLAS	Download
Murat (Efe) Guney, Intel	Batch APIs and DNN USE CASE	Download
Cris Cecka, NVIDIA	Batched and Tensor Computations	Download