The interface for MAGMA is similar to LAPACK, to facilitate porting existing codes.
Many routines have the same base names and the same arguments as LAPACK. In some cases, MAGMA needs larger workspaces or some additional arguments in order to implement an efficient algorithm.
There are several classes of routines in MAGMA:
1. driver -- Solve an entire problem.
2. comp -- Solve one piece of a problem.
3. blas -- Basic Linear Algebra Subroutines. These form the basis for linear algebra algorithms.
4. aux -- Additional BLAS-like routines, many originally defined in LAPACK.
5. util -- Additional routines, many specific to GPU programming.
A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section.
Driver & computational routines have a `magma_` prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done.
Suffix | Example | Description ----------- | ----------- | ----------- none | magma_dgetrf | hybrid CPU/GPU routine where the matrix is initially in CPU host memory. _m | magma_dgetrf_m | hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory. _gpu | magma_dgetrf_gpu | hybrid CPU/GPU routine where the matrix is initially in GPU device memory. _mgpu | magma_dgetrf_mgpu | hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories.
In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 2-3 letter routine name. For example, DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization).
Precision | Description ----------- | ----------- s | single real precision (float) d | double real precision (double) c | single-complex precision (magmaFloatComplex) z | double-complex precision (magmaDoubleComplex) sc | single-complex input with single precision result (e.g., scnrm2) dz | double-complex input with double precision result (e.g., dznrm2) ds | mixed-precision algorithm (double and single, e.g., dsgesv) zc | mixed-precision algorithm (double-complex and single-complex, e.g., zcgesv)
Matrix type | Description ----------- | ----------- ge | general matrix sy | symmetric matrix, can be real or complex he | Hermitian (complex) matrix po | positive definite, symmetric (real) or Hermitian (complex) matrix tr | triangular matrix or | orthogonal (real) matrix un | unitary (complex) matrix
Driver routines {driver} ================================= Driver routines solve an entire problem.
Name | Description ----------- | ----------- gesv, posv | solve linear system, AX = B gels | least squares solve, AX = B geev, syev, heev | eigenvalue solver, AX = X Lambda syevd, heevd | eigenvalue solver using divide & conquer sygvd, hegvd | generalized eigenvalue solver, AX = BX Lambda gesvd | singular value decomposition (SVD), A = U Sigma V^H gesdd | SVD using divide & conquer
Computational routines {comp} ================================= Computational routines solve one piece of a problem. Typically, driver routines call several computational routines to solve the entire problem. Here, curly braces { } group similar routines. Starred * routines are not yet implemented in MAGMA.
Name | Description ----------- | ----------- : **Triangular factorizations** : | **Description** getrf, potrf | triangular factorization (LU, Cholesky) getrs, potrs | triangular forward and back solve getri, potri | triangular inverse getf2, potf2 | triangular panel factorization (BLAS-2) . **Orthogonal factorizations** | **Description** ge{qrf, qlf, lqf, rqf*} | QR, QL, LQ, RQ factorization geqp3 | QR with column pivoting (BLAS-3) or{mqr, mql, mlq, mrq*} | multiply by Q after factorization (real) un{mqr, mql, mlq, mrq*} | multiply by Q after factorization (complex) or{gqr, gql*, glq*, grq*} | generate Q after factorization (real) un{gqr, gql*, glq*, grq*} | generate Q after factorization (complex) geqr2 | QR panel factorization (BLAS-2) . **Eigenvalue & SVD** | **Description** gehrd | Hessenberg reduction (in geev) sytrd, hetrd | tridiagonal reduction (in syev, heev) gebrd | bidiagonal reduction (in gesvd)
There are many other computational routines that are mostly internal to MAGMA and LAPACK, and not commonly called by end users.
BLAS routines {blas} ================================= BLAS routines follow a similar naming scheme: precision, matrix type (for level 2 & 3), routine name. For BLAS routines, the **magma_ prefix** indicates a wrapper around CUBLAS (e.g., magma_zgemm calls cublasZgemm), while the **magmablas_ prefix** indicates our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines are GPU native and take the matrix in GPU memory. The descriptions here are simplified, omitting scalars (alpha & beta) and transposes.
BLAS-1: vector operations --------------------------------- These do O(n) operations on O(n) data and are memory-bound.
Name | Description ----------- | ----------- copy | copy vector, y = x scal | scale vector, y = alpha*y swap | swap two vectors, y <---> x axpy | y = alpha*x + y nrm2 | vector 2-norm amax | vector max-norm asum | vector one-norm dot | dot product (real), x^T y dotu | dot product (complex), unconjugated, x^T y dotc | dot product (complex), conjugated, x^H y
BLAS-2: matrix-vector operations --------------------------------- These do O(n^2) operations on O(n^2) data and are memory-bound.
Name | Description ----------- | ----------- gemv | general matrix-vector product, y = A*x symv, hemv | symmetric/Hermitian matrix-vector product, y = A*x syr, her | symmetric/Hermitian rank-1 update, A = A + x*x^H syr2, her2 | symmetric/Hermitian rank-2 update, A = A + x*y^H + y*x^H trmv | triangular matrix-vector product, y = A*x trsv | triangular solve, one right-hand side (RHS), solve Ax = b
BLAS-3: matrix-matrix operations --------------------------------- These do O(n^3) operations on O(n^2) data and are compute-bound. Level 3 BLAS are significantly more efficient than the memory-bound level 1 and level 2 BLAS.
Name | Description ----------- | ----------- gemm | general matrix-matrix multiply, C = C + A*B symm, hemm | symmetric/Hermitian matrix-matrix multiply, C = C + A*B, A is symmetric syrk, herk | symmetric/Hermitian rank-k update, C = C + A*A^H, C is symmetric syr2k, her2k | symmetric/Hermitian rank-2k update, C = C + A*B^H + B*A^H, C is symmetric trmm | triangular matrix-matrix multiply, B = A*B or B*A, A is triangular trsm | triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular
Auxiliary routines {aux} ================================= Additional BLAS-like routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose.
For auxiliary routines, the **magmablas_ prefix** indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory.
Name | Description ----------- | ----------- geadd | add general matrices (like axpy), B = alpha*A + B laswp | swap rows (in getrf) laset | set matrix to constant lacpy | copy matrix lascl | scale matrix lange | norm, general matrix lansy | norm, symmetric matrix lanhe | norm, Hermitian matrix lantr | norm, triangular matrix lag2 | convert general matrix from one precision to another (e.g., dlag2s is double to single) lat2 | convert triangular matrix from one precision to another larf | apply Householder elementary reflector larfg | generate Householder elementary reflector larfb | apply block Householder elementary reflector larft | form T for block Householder elementary reflector symmetrize | copy lower triangle to upper triangle, or vice-versa transpose | transpose matrix
Utility routines {util} =================================
Memory Allocation --------------------------------- MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof.
Name | Description ----------- | ----------- magma_*malloc_cpu | allocate CPU memory that is aligned for better performance & reproducibility magma_free_cpu | free CPU memory allocated with malloc_cpu magma_*malloc_pinned | allocate CPU memory that is pinned (page-locked) magma_free_pinned | free CPU memory allocated with malloc_pinned magma_*malloc | allocate GPU memory magma_free | free GPU memory
where * is one of the four precisions, s d c z, or i for magma_int_t, or none for an un-typed version.
Communication --------------------------------- The name of communication routines is from the CPU's point of view.
Name | Description ----------- | ----------- setmatrix | send matrix to GPU setvector | send vector to GPU getmatrix | get matrix from GPU getvector | get vector from GPU