MAGMA  2.0.0
Matrix Algebra for GPU and Multicore Architectures
Routine names

The interface for MAGMA is similar to LAPACK, to facilitate porting existing codes.

Many routines have the same base names and the same arguments as LAPACK. In some cases, MAGMA needs larger workspaces or some additional arguments in order to implement an efficient algorithm.

There are several classes of routines in MAGMA:

  1. Driver routines – Solve an entire problem.
  2. Computational routines – Solve one piece of a problem.
  3. BLAS routines – Basic Linear Algebra Subroutines. These form the basis for linear algebra algorithms.
  4. Auxiliary routines – Additional BLAS-like routines, many originally defined in LAPACK.
  5. Utility routines – Additional routines, many specific to GPU programming.

A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section.

Driver & computational routines have a magma_ prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done.

Suffix Example Description
none magma_dgetrf hybrid CPU/GPU routine where the matrix is initially in CPU host memory.
_m magma_dgetrf_m hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory.
_gpu magma_dgetrf_gpu hybrid CPU/GPU routine where the matrix is initially in GPU device memory.
_mgpu magma_dgetrf_mgpu hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories.

In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 2-3 letter routine name. For example, DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization).

Precision Description
s single real precision (float)
d double real precision (double)
c single-complex precision (magmaFloatComplex)
z double-complex precision (magmaDoubleComplex)
sc single-complex input with single precision result (e.g., scnrm2)
dz double-complex input with double precision result (e.g., dznrm2)
ds mixed-precision algorithm (double and single, e.g., dsgesv)
zc mixed-precision algorithm (double-complex and single-complex, e.g., zcgesv)
Matrix type Description
ge general matrix
sy symmetric matrix, can be real or complex
he Hermitian (complex) matrix
po positive definite, symmetric (real) or Hermitian (complex) matrix
tr triangular matrix
or orthogonal (real) matrix
un unitary (complex) matrix

Driver routines

Driver routines solve an entire problem.

Name Description
gesv, posv solve linear system, AX = B
gels least squares solve, AX = B
geev, syev, heev eigenvalue solver, AX = X Lambda
syevd, heevd eigenvalue solver using divide & conquer
sygvd, hegvd generalized eigenvalue solver, AX = BX Lambda
gesvd singular value decomposition (SVD), A = U Sigma V^H
gesdd SVD using divide & conquer

Computational routines

Computational routines solve one piece of a problem. Typically, driver routines call several computational routines to solve the entire problem. Here, curly braces { } group similar routines. Starred * routines are not yet implemented in MAGMA.

Name Description
: Triangular factorizations : Description
getrf, potrf triangular factorization (LU, Cholesky)
getrs, potrs triangular forward and back solve
getri, potri triangular inverse
getf2, potf2 triangular panel factorization (BLAS-2)
. Orthogonal factorizations Description
ge{qrf, qlf, lqf, rqf*} QR, QL, LQ, RQ factorization
geqp3 QR with column pivoting (BLAS-3)
or{mqr, mql, mlq, mrq*} multiply by Q after factorization (real)
un{mqr, mql, mlq, mrq*} multiply by Q after factorization (complex)
or{gqr, gql*, glq*, grq*} generate Q after factorization (real)
un{gqr, gql*, glq*, grq*} generate Q after factorization (complex)
geqr2 QR panel factorization (BLAS-2)
. Eigenvalue & SVD Description
gehrd Hessenberg reduction (in geev)
sytrd, hetrd tridiagonal reduction (in syev, heev)
gebrd bidiagonal reduction (in gesvd)

There are many other computational routines that are mostly internal to MAGMA and LAPACK, and not commonly called by end users.

BLAS routines

BLAS routines follow a similar naming scheme: precision, matrix type (for level 2 & 3), routine name. For BLAS routines, the magma_ prefix indicates a wrapper around CUBLAS (e.g., magma_zgemm calls cublasZgemm), while the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines are GPU native and take the matrix in GPU memory. The descriptions here are simplified, omitting scalars (alpha & beta) and transposes.

BLAS-1: vector operations

These do O(n) operations on O(n) data and are memory-bound.

Name Description
copy copy vector, y = x
scal scale vector, y = alpha*y
swap swap two vectors, y <—> x
axpy y = alpha*x + y
nrm2 vector 2-norm
amax vector max-norm
asum vector one-norm
dot dot product (real), x^T y
dotu dot product (complex), unconjugated, x^T y
dotc dot product (complex), conjugated, x^H y

BLAS-2: matrix-vector operations

These do O(n^2) operations on O(n^2) data and are memory-bound.

Name Description
gemv general matrix-vector product, y = A*x
symv, hemv symmetric/Hermitian matrix-vector product, y = A*x
syr, her symmetric/Hermitian rank-1 update, A = A + x*x^H
syr2, her2 symmetric/Hermitian rank-2 update, A = A + x*y^H + y*x^H
trmv triangular matrix-vector product, y = A*x
trsv triangular solve, one right-hand side (RHS), solve Ax = b

BLAS-3: matrix-matrix operations

These do O(n^3) operations on O(n^2) data and are compute-bound. Level 3 BLAS are significantly more efficient than the memory-bound level 1 and level 2 BLAS.

Name Description
gemm general matrix-matrix multiply, C = C + A*B
symm, hemm symmetric/Hermitian matrix-matrix multiply, C = C + A*B, A is symmetric
syrk, herk symmetric/Hermitian rank-k update, C = C + A*A^H, C is symmetric
syr2k, her2k symmetric/Hermitian rank-2k update, C = C + A*B^H + B*A^H, C is symmetric
trmm triangular matrix-matrix multiply, B = A*B or B*A, A is triangular
trsm triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular

Auxiliary routines

Additional BLAS-like routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.

For auxiliary routines, the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory.

Name Description
geadd add general matrices (like axpy), B = alpha*A + B
laswp swap rows (in getrf)
laset set matrix to constant
lacpy copy matrix
lascl scale matrix
lange norm, general matrix
lansy norm, symmetric matrix
lanhe norm, Hermitian matrix
lantr norm, triangular matrix
lag2 convert general matrix from one precision to another (e.g., dlag2s is double to single)
lat2 convert triangular matrix from one precision to another
larf apply Householder elementary reflector
larfg generate Householder elementary reflector
larfb apply block Householder elementary reflector
larft form T for block Householder elementary reflector
symmetrize copy lower triangle to upper triangle, or vice-versa
transpose transpose matrix

Utility routines

Memory Allocation

MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof.

Name Description
magma_*malloc_cpu allocate CPU memory that is aligned for better performance & reproducibility
magma_free_cpu free CPU memory allocated with malloc_cpu
magma_*malloc_pinned allocate CPU memory that is pinned (page-locked)
magma_free_pinned free CPU memory allocated with malloc_pinned
magma_*malloc allocate GPU memory
magma_free free GPU memory

where * is one of the four precisions, s d c z, or i for magma_int_t, or none for an un-typed version.

Communication

The name of communication routines is from the CPU's point of view.

Name Description
setmatrix send matrix to GPU
setvector send vector to GPU
getmatrix get matrix from GPU
getvector get vector from GPU