Methodology

One-sided matrix factorizations =================================

The one-sided LU, Cholesky, and QR factorizations form a basis for solving linear systems. A general recommendation is to use LU for general n-by-n matrices, Cholesky for symmetric/Hermitian positive definite (SPD) matrices, and QR for solving least squares problems,

min || A x - b ||

for general m-by-n, m > n matrices.

We use hybrid algorithms where the computation is split between the GPU and and the CPU. In general for the one-sided factorizations, the panels are factored on the CPU and the trailing sub-matrix updates on the GPU. Look-ahead techniques are used to overlap the CPU and GPU work and some communications.

In both the CPU and GPU interfaces the matrix to be factored resides in the GPU memory, and CPU-GPU transfers are associated only with the panels. The resulting matrix is accumulated (on the CPU or GPU according to the interface) along the computation, as a byproduct of the algorithm, vs. sending the the entire matrix when needed. In the CPU interface, the original transfer of the matrix to the GPU is overlapped with the factorization of the first panel. In this sense the CPU and GPU interfaces, although similar, are not derivatives of each other as they have different communication patterns.

Although the solution step has O(n) times less floating point operations than the factorization, it is still very important to optimize it. Solving a triangular system of equations can be very slow because the computation is bandwidth limited and naturally not parallel. Various approaches have been proposed in the past. We use an approach where diagonal blocks of A are explicitly inverted and used in a block algorithm. This results in a high performance, numerically stable algorithm, especially when used with triangular matrices coming from numerically stable factorization algorithms (e.g., as in LAPACK and MAGMA).

For instances when the GPU's single precision performance is much higher than its double precision performance, MAGMA provides a second set of solvers, based on the mixed precision iterative refinement technique. The solvers are based again on correspondingly the LU, QR, and Cholesky factorizations, and are designed to solve linear problems in double precision accuracy but at a speed that is characteristic for the much faster single precision computations. The idea is to use single precision for the bulk of the computation, namely the factorization step, and than use that factorization as a preconditioner in a simple iterative refinement process in double precision arithmetic. This often results in the desired high performance and high accuracy solvers.

Two-sided matrix factorizations =================================

As the one-sided matrix factorizations are the basis for various linear solvers, the two-sided matrix factorizations are the basis for eigen-solvers, and therefore form an important class of dense linear algebra routines. The two-sided factorizations have been traditionally more difficult to achieve high performance. The reason is that the two-sided factorizations involve large matrix-vector products which are memory bound, and as the gap between compute and communication power increases exponentially, these memory bound operations become an increasingly more difficult to handle bottleneck. GPUs though offer an attractive possibility to accelerate them. Indeed, having a high bandwidth (e.g. 10 times larger than current CPU bus bandwidths), GPUs can accelerate matrix-vector products significantly (10 to 30 times). Here, the panel factorization itself is hybrid, while the trailing matrix update is performed on the GPU.


Generated on 3 May 2015 for MAGMA by  doxygen 1.6.1