A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, which can be solved independently, before collating the results. To standardize the interface to these routines, the community is developing an extension to the BLAS standard (the batched BLAS), enabling users to perform thousands of small BLAS operations in parallel whilst making efficient use of their hardware.
Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification”, July 2018. | "
A Comparison of Potential Interfaces for Batched BLAS Computations”, NLAFET Working Note 5, August 2016. | "
A Proposed API for Batched Basic Linear Algebra Subprograms”, Draft Report, May 2016. | "
Efficient Reproducible Floating Point Summation and BLAS”, Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report no. UCB/EECS-2015-229, December 2015. | "