![]() |
MAGMA
1.6.3
Matrix Algebra for GPU and Multicore Architectures
|
The interface for MAGMA is similar to LAPACK, to facilitate porting existing codes.
Many routines have the same base names and the same arguments as LAPACK. In some cases, MAGMA needs larger workspaces or some additional arguments in order to implement an efficient algorithm.
There are several classes of routines in MAGMA:
A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section.
Driver & computational routines have a magma_
prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done.
Suffix | Example | Description |
---|---|---|
none | magma_dgetrf | hybrid CPU/GPU routine where the matrix is initially in CPU host memory. |
_m | magma_dgetrf_m | hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory. |
_gpu | magma_dgetrf_gpu | hybrid CPU/GPU routine where the matrix is initially in GPU device memory. |
_mgpu | magma_dgetrf_mgpu | hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories. |
In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 2-3 letter routine name. For example, DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization).
Precision | Description |
---|---|
s | single real precision (float) |
d | double real precision (double) |
c | single-complex precision (magmaFloatComplex) |
z | double-complex precision (magmaDoubleComplex) |
sc | single-complex input with single precision result (e.g., scnrm2) |
dz | double-complex input with double precision result (e.g., dznrm2) |
ds | mixed-precision algorithm (double and single, e.g., dsgesv) |
zc | mixed-precision algorithm (double-complex and single-complex, e.g., zcgesv) |
Matrix type | Description |
---|---|
ge | general matrix |
sy | symmetric matrix, can be real or complex |
he | Hermitian (complex) matrix |
po | positive definite, symmetric (real) or Hermitian (complex) matrix |
tr | triangular matrix |
or | orthogonal (real) matrix |
un | unitary (complex) matrix |
Driver routines solve an entire problem.
Name | Description |
---|---|
gesv, posv | solve linear system, AX = B |
gels | least squares solve, AX = B |
geev, syev, heev | eigenvalue solver, AX = X Lambda |
syevd, heevd | eigenvalue solver using divide & conquer |
sygvd, hegvd | generalized eigenvalue solver, AX = BX Lambda |
gesvd | singular value decomposition (SVD), A = U Sigma V^H |
gesdd | SVD using divide & conquer |
Computational routines solve one piece of a problem. Typically, driver routines call several computational routines to solve the entire problem. Here, curly braces { } group similar routines. Starred * routines are not yet implemented in MAGMA.
Name | Description |
---|---|
: Triangular factorizations : | Description |
getrf, potrf | triangular factorization (LU, Cholesky) |
getrs, potrs | triangular forward and back solve |
getri, potri | triangular inverse |
getf2, potf2 | triangular panel factorization (BLAS-2) |
. Orthogonal factorizations | Description |
ge{qrf, qlf, lqf, rqf*} | QR, QL, LQ, RQ factorization |
geqp3 | QR with column pivoting (BLAS-3) |
or{mqr, mql, mlq, mrq*} | multiply by Q after factorization (real) |
un{mqr, mql, mlq, mrq*} | multiply by Q after factorization (complex) |
or{gqr, gql*, glq*, grq*} | generate Q after factorization (real) |
un{gqr, gql*, glq*, grq*} | generate Q after factorization (complex) |
geqr2 | QR panel factorization (BLAS-2) |
. Eigenvalue & SVD | Description |
gehrd | Hessenberg reduction (in geev) |
sytrd, hetrd | tridiagonal reduction (in syev, heev) |
gebrd | bidiagonal reduction (in gesvd) |
There are many other computational routines that are mostly internal to MAGMA and LAPACK, and not commonly called by end users.
BLAS routines follow a similar naming scheme: precision, matrix type (for level 2 & 3), routine name. For BLAS routines, the magma_ prefix indicates a wrapper around CUBLAS (e.g., magma_zgemm calls cublasZgemm), while the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines are GPU native and take the matrix in GPU memory. The descriptions here are simplified, omitting scalars (alpha & beta) and transposes.
These do O(n) operations on O(n) data and are memory-bound.
Name | Description |
---|---|
copy | copy vector, y = x |
scal | scale vector, y = alpha*y |
swap | swap two vectors, y <—> x |
axpy | y = alpha*x + y |
nrm2 | vector 2-norm |
amax | vector max-norm |
asum | vector one-norm |
dot | dot product (real), x^T y |
dotu | dot product (complex), unconjugated, x^T y |
dotc | dot product (complex), conjugated, x^H y |
These do O(n^2) operations on O(n^2) data and are memory-bound.
Name | Description |
---|---|
gemv | general matrix-vector product, y = A*x |
symv, hemv | symmetric/Hermitian matrix-vector product, y = A*x |
syr, her | symmetric/Hermitian rank-1 update, A = A + x*x^H |
syr2, her2 | symmetric/Hermitian rank-2 update, A = A + x*y^H + y*x^H |
trmv | triangular matrix-vector product, y = A*x |
trsv | triangular solve, one right-hand side (RHS), solve Ax = b |
These do O(n^3) operations on O(n^2) data and are compute-bound. Level 3 BLAS are significantly more efficient than the memory-bound level 1 and level 2 BLAS.
Name | Description |
---|---|
gemm | general matrix-matrix multiply, C = C + A*B |
symm, hemm | symmetric/Hermitian matrix-matrix multiply, C = C + A*B, A is symmetric |
syrk, herk | symmetric/Hermitian rank-k update, C = C + A*A^H, C is symmetric |
syr2k, her2k | symmetric/Hermitian rank-2k update, C = C + A*B^H + B*A^H, C is symmetric |
trmm | triangular matrix-matrix multiply, B = A*B or B*A, A is triangular |
trsm | triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular |
Additional BLAS-like routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.
For auxiliary routines, the magmablas_ prefix indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory.
Name | Description |
---|---|
geadd | add general matrices (like axpy), B = alpha*A + B |
laswp | swap rows (in getrf) |
laset | set matrix to constant |
lacpy | copy matrix |
lascl | scale matrix |
lange | norm, general matrix |
lansy | norm, symmetric matrix |
lanhe | norm, Hermitian matrix |
lantr | norm, triangular matrix |
lag2 | convert general matrix from one precision to another (e.g., dlag2s is double to single) |
lat2 | convert triangular matrix from one precision to another |
larf | apply Householder elementary reflector |
larfg | generate Householder elementary reflector |
larfb | apply block Householder elementary reflector |
larft | form T for block Householder elementary reflector |
symmetrize | copy lower triangle to upper triangle, or vice-versa |
transpose | transpose matrix |
MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof.
Name | Description |
---|---|
magma_*malloc_cpu | allocate CPU memory that is aligned for better performance & reproducibility |
magma_free_cpu | free CPU memory allocated with malloc_cpu |
magma_*malloc_pinned | allocate CPU memory that is pinned (page-locked) |
magma_free_pinned | free CPU memory allocated with malloc_pinned |
magma_*malloc | allocate GPU memory |
magma_free | free GPU memory |
where * is one of the four precisions, s d c z, or i for magma_int_t, or none for an un-typed version.
The name of communication routines is from the CPU's point of view.
Name | Description |
---|---|
setmatrix | send matrix to GPU |
setvector | send vector to GPU |
getmatrix | get matrix from GPU |
getvector | get vector from GPU |