A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, which can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware.


Jack Dongarra, Iain Duff, Mark Gates, Azzam Haidar, Sven Hammarling, Nicholas J. Higham, Jonathan Hogg, Pedro Valero Lara, Mawussi Zounon, Samuel D. Relton, and Stanimire Tomov, "A Proposed API for Batched Basic Linear Algebra Subprograms”, Draft Report, May 2016.
Samuel D. Relton, Pedro Valero-Lara, and Mawussi Zounon, "A Comparison of Potential Interfaces for Batched BLAS Computations”, NLAFET Working Note 5, August 2016.
Peter Ahrens, Hong Diep Nguyen, and James Demmel, "Efficient Reproducible Floating Point Summation and BLAS”, Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report no. UCB/EECS-2015-229, December 2015.

Batched BLAS SC18 Handout

Batched BLAS Poster

Batched BLAS Slides