This file is processed with Doxygen. See docs/html/index.html for formatted version. /** ******************************************************************************** @mainpage MAGMA Users' Guide Univ. of Tennessee, Knoxville \n Univ. of California, Berkeley \n Univ. of Colorado, Denver @date November 2016 The goal of the MAGMA project is to create a new generation of linear algebra libraries that achieves the fastest possible time to an accurate solution on heterogeneous architectures, starting with current multicore + multi-GPU systems. To address the complex challenges stemming from these systems' heterogeneity, massive parallelism, and the gap between compute speed and CPU-GPU communication speed, MAGMA's research is based on the idea that optimal software solutions will themselves have to hybridize, combining the strengths of different algorithms within a single framework. Building on this idea, the goal is to design linear algebra algorithms and frameworks for hybrid multicore and multi-GPU systems that can enable applications to fully exploit the power that each of the hybrid components offers. Designed to be similar to LAPACK in functionality, data storage, and interface, the MAGMA library allows scientists to easily port their existing software components from LAPACK to MAGMA, to take advantage of the new hybrid architectures. MAGMA users do not have to know CUDA in order to use the library. There are two types of LAPACK-style interfaces. The first one, referred to as the *CPU interface*, takes the input and produces the result in the CPU's memory. The second, referred to as the *GPU interface*, takes the input and produces the result in the GPU's memory. In both cases, a hybrid CPU/GPU algorithm is used. Also included is MAGMA BLAS, a complementary to CUBLAS routines. ******************************************************************************** @page authors Collaborators - A. Abdelfattah (UT Knoxville) - H. Anzt (UT Knoxville) - M. Baboulin (U Paris-Sud) - C. Cao (UT Knoxville) - J. Demmel (UC Berkeley) - T. Dong (UT Knoxville) - J. Dongarra (UT Knoxville) - P. Du (UT Knoxville) - M. Faverge (INRIA) - M. Gates (UT Knoxville) - A. Haidar (UT Knoxville) - M. Horton (UT Knoxville) - J. Kurzak (UT Knoxville) - J. Langou (UC Denver) - H. Ltaief (KAUST) - P. Luszczek (UT Knoxville) - T. Mary (IRIT, Team APO) - R. Nath (UT Knoxville) - R. Solca (ETH Zurich) - S. Tomov (UT Knoxville) - V. Volkov (UC Berkeley) - I. Yamazaki (UT Knoxville) ******************************************************************************** @page installing Installing MAGMA First, create a file make.inc, using one of the examples as a template. Set environment variables for where external packages are installed, either in your .cshrc/.bashrc file, or in the make.inc file itself. For instance, to specify where CUDA and Intel MKL are installed, using csh: setenv CUDADIR /usr/local/cuda # sourcing this script sets MKLROOT source /opt/intel/composer/bin/compilervars.csh intel64 using sh/bash: CUDADIR=/usr/local/cuda source /opt/intel/composer/bin/compilervars.sh intel64 If you do not have a Fortran compiler, comment out FORT. MAGMA's Fortran 90 interface and Fortran testers will not be built. Also, many testers will not be able to check their results -- they will print an error message, e.g.: magma/testing> ./testing_dgehrd -N 100 -c ... Cannot check results: dhst01_ unavailable, since there was no Fortran compiler. 100 --- ( --- ) 0.70 ( 0.00) 0.00e+00 0.00e+00 ok To compile shared library, first add -fPIC to CFLAGS, FFLAGS, etc. in make.inc; see example in make.inc.mkl-shared. Then: make shared make test make install prefix=/usr/local/magma To compile static library: make lib make test make install prefix=/usr/local/magma This installs libmagma.a and libmagma.so (if compiled) in \$(prefix)/lib, and the MAGMA header files in \$(prefix)/include. It also installs \$(prefix)/lib/pkgconfig/magma.pc for pkg-config. ******************************************************************************** @page testing Running tests The testing directory includes tests of most MAGMA functions. These are useful as examples, though they contain additional testing features that your application does not need, or would do differently. The \ref example "example" directory has a simple example without all this additional framework. See the testing/run_tests.py file to run a standard set of test sizes and options. Currentyl, nearly all the tests pass with --tol 100. With the default tolerance, usually 30, some tests will signal false negatives, where the test is okay but just slightly above the accuracy tolerance check. ******************************************************************************** @page example Example The example directory shows a simple, standalone example. This shows how to use MAGMA, apart from the MAGMA Makefiles and other special framework that we've developed for the tests. ******************************************************************************** @page routines Overview The interface for MAGMA is similar to LAPACK, to facilitate porting existing codes. Many routines have the same base names and the same arguments as LAPACK. In some cases, MAGMA needs larger workspaces or some additional arguments in order to implement an efficient algorithm. There are several classes of routines in MAGMA: 1. \ref driver -- Solve an entire problem. 2. \ref comp -- Solve one piece of a problem. 3. \ref blas -- Basic Linear Algebra Subroutines. These form the basis for linear algebra algorithms. 4. \ref aux -- Additional BLAS-like routines, many originally defined in LAPACK. 5. \ref util -- Additional routines, many specific to GPU programming. A brief summary of routines is given here. Full descriptions of individual routines are given in the Modules section. Driver & computational routines have a `magma_` prefix. These are generally hybrid CPU/GPU algorithms. A suffix indicates in what memory the matrix starts and ends, not where the computation is done. Suffix | Example | Description ----------- | ----------- | ----------- none | magma_dgetrf | hybrid CPU/GPU routine where the matrix is initially in CPU host memory. _m | magma_dgetrf_m | hybrid CPU/multiple-GPU routine where the matrix is initially in CPU host memory. _gpu | magma_dgetrf_gpu | hybrid CPU/GPU routine where the matrix is initially in GPU device memory. _mgpu | magma_dgetrf_mgpu | hybrid CPU/multiple-GPU routine where the matrix is distributed across multiple GPUs' device memories. In general, MAGMA follows LAPACK's naming conventions. The base name of each routine has a one letter precision (occasionally two letters), two letter matrix type, and usually a 2-3 letter routine name. For example, DGETRF is D (double-precision), GE (general matrix), TRF (triangular factorization). Precision | Description ----------- | ----------- s | single real precision (float) d | double real precision (double) c | single-complex precision (magmaFloatComplex) z | double-complex precision (magmaDoubleComplex) sc | single-complex input with single precision result (e.g., scnrm2) dz | double-complex input with double precision result (e.g., dznrm2) ds | mixed-precision algorithm (double and single, e.g., dsgesv) zc | mixed-precision algorithm (double-complex and single-complex, e.g., zcgesv) Matrix type | Description ----------- | ----------- ge | general matrix sy | symmetric matrix, can be real or complex he | Hermitian (complex) matrix po | positive definite, symmetric (real) or Hermitian (complex) matrix tr | triangular matrix or | orthogonal (real) matrix un | unitary (complex) matrix Driver routines {#driver} ================================= Driver routines solve an entire problem. Name | Description ----------- | ----------- \ref magma_gesv "gesv" | solve linear system, AX = B, A is general (non-symmetric) \ref magma_posv "posv" | solve linear system, AX = B, A is symmetric/Hermitian positive definite \ref magma_hesv "hesv" | solve linear system, AX = B, A is symmetric indefinite \ref magma_gels "gels" | least squares solve, AX = B, A is rectangular \ref magma_geev "geev" | non-symmetric eigenvalue solver, AX = X Lambda \ref magma_heev "syev/heev" | symmetric eigenvalue solver, AX = X Lambda \ref magma_heev "syevd/heevd" | symmetric eigenvalue solver, AX = X Lambda, using divide & conquer \ref magma_hegv "sygvd/hegvd" | symmetric generalized eigenvalue solver, AX = BX Lambda \ref magma_gesvd "gesvd" | singular value decomposition (SVD), A = U Sigma V^H \ref magma_gesvd "gesdd" | singular value decomposition (SVD), A = U Sigma V^H, using divide & conquer Computational routines {#comp} ================================= Computational routines solve one piece of a problem. Typically, driver routines call several computational routines to solve the entire problem. Here, curly braces { } group similar routines. Starred * routines are not yet implemented in MAGMA. Name | Description ----------- | ----------- : **Triangular factorizations** : | **Description** \ref magma_getrf "getrf", \ref magma_potrf "potrf", \ref magma_hetrf "hetrf" | triangular factorization (LU, Cholesky, Indefinite) \ref magma_getrs "getrs", \ref magma_potrs "potrs", \ref magma_hetrs "hetrs" | triangular forward and back solve \ref magma_getri "getri", \ref magma_potri "potri" | triangular inverse \ref magma_getf2 "getf2", \ref magma_potf2 "potf2" | triangular panel factorization (BLAS-2) . **Orthogonal factorizations** | **Description** ge{\ref magma_geqrf "qrf", \ref magma_geqlf "qlf", \ref magma_gelqf "lqf", rqf*} | QR, QL, LQ, RQ factorization \ref magma_geqp3 "geqp3" | QR with column pivoting (BLAS-3) or{\ref magma_unmqr "mqr", \ref magma_unmql "mql", \ref magma_unmlq "mlq", mrq*} | multiply by Q after factorization (real) un{\ref magma_unmqr "mqr", \ref magma_unmql "mql", \ref magma_unmlq "mlq", mrq*} | multiply by Q after factorization (complex) or{\ref magma_ungqr "gqr", gql*, glq*, grq*} | generate Q after factorization (real) un{\ref magma_ungqr "gqr", gql*, glq*, grq*} | generate Q after factorization (complex) . **Eigenvalue & SVD** | **Description** \ref magma_gehrd "gehrd" | Hessenberg reduction (in geev) \ref magma_hetrd "sytrd/hetrd" | tridiagonal reduction (in syev, heev) \ref magma_gebrd "gebrd" | bidiagonal reduction (in gesvd) There are many other computational routines that are mostly internal to MAGMA and LAPACK, and not commonly called by end users. BLAS routines {#blas} ================================= BLAS routines follow a similar naming scheme: precision, matrix type (for level 2 & 3), routine name. For BLAS routines, the **magma_ prefix** indicates a wrapper around CUBLAS (e.g., magma_zgemm calls cublasZgemm), while the **magmablas_ prefix** indicates our own MAGMA implementation (e.g., magmablas_zgemm). All MAGMA BLAS routines are GPU native and take the matrix in GPU memory. The descriptions here are simplified, omitting scalars (alpha & beta) and transposes. BLAS-1: vector operations --------------------------------- These do O(n) operations on O(n) data and are memory-bound. Name | Description ----------- | ----------- \ref magma_copy "copy" | copy vector, y = x \ref magma_scal "scal" | scale vector, y = alpha*y \ref magma_swap "swap" | swap two vectors, y <---> x \ref magma_axpy "axpy" | y = alpha*x + y \ref magma_nrm2 "nrm2" | vector 2-norm \ref magma_iamax "amax" | vector max-norm \ref magma_asum "asum" | vector one-norm \ref magma__dot "dot" | dot product (real), x^T y \ref magma__dot "dotu" | dot product (complex), unconjugated, x^T y \ref magma__dot "dotc" | dot product (complex), conjugated, x^H y BLAS-2: matrix-vector operations --------------------------------- These do O(n^2) operations on O(n^2) data and are memory-bound. Name | Description ----------- | ----------- \ref magma_gemv "gemv" | general matrix-vector product, y = A*x \ref magma_symv "symv", \ref magma_hemv "hemv" | symmetric/Hermitian matrix-vector product, y = A*x \ref magma_syr "syr", \ref magma_her "her" | symmetric/Hermitian rank-1 update, A = A + x*x^H \ref magma_syr2 "syr2", \ref magma_her2 "her2" | symmetric/Hermitian rank-2 update, A = A + x*y^H + y*x^H \ref magma_trmv "trmv" | triangular matrix-vector product, y = A*x \ref magma_trsv "trsv" | triangular solve, one right-hand side (RHS), solve Ax = b BLAS-3: matrix-matrix operations --------------------------------- These do O(n^3) operations on O(n^2) data and are compute-bound. Level 3 BLAS are significantly more efficient than the memory-bound level 1 and level 2 BLAS. Name | Description ----------- | ----------- \ref magma_gemm "gemm" | general matrix-matrix multiply, C = C + A*B \ref magma_symm "symm", \ref magma_hemm "hemm" | symmetric/Hermitian matrix-matrix multiply, C = C + A*B, A is symmetric \ref magma_syrk "syrk", \ref magma_herk "herk" | symmetric/Hermitian rank-k update, C = C + A*A^H, C is symmetric \ref magma_syr2k "syr2k", \ref magma_her2k "her2k" | symmetric/Hermitian rank-2k update, C = C + A*B^H + B*A^H, C is symmetric \ref magma_trmm "trmm" | triangular matrix-matrix multiply, B = A*B or B*A, A is triangular \ref magma_trsm "trsm" | triangular solve, multiple RHS, solve A*X = B or X*A = B, A is triangular Auxiliary routines {#aux} ================================= Additional BLAS-like routines, many originally defined in LAPACK. These follow a similar naming scheme: precision, then "la", then the routine name. MAGMA implements these common ones on the GPU, plus adds a few such as symmetrize and transpose. For auxiliary routines, the **magmablas_ prefix** indicates our own MAGMA implementation (e.g., magmablas_zlaswp). All MAGMA auxiliary routines are GPU native and take the matrix in GPU memory. Name | Description ----------- | ----------- \ref magma_geadd "geadd" | add general matrices (like axpy), B = alpha*A + B \ref magma_laswp "laswp" | swap rows (in getrf) \ref magma_laset "laset" | set matrix to constant \ref magma_lacpy "lacpy" | copy matrix \ref magma_lascl "lascl" | scale matrix \ref magma_lange "lange" | norm, general matrix \ref magma_lanhe "lansy/lanhe" | norm, symmetric/Hermitian matrix \ref magma_lantr "lantr" | norm, triangular matrix \ref magma_lag2 "lag2" | convert general matrix from one precision to another (e.g., dlag2s is double to single) \ref magma_lat2 "lat2" | convert triangular matrix from one precision to another \ref magma_larf "larf" | apply Householder elementary reflector \ref magma_larfg "larfg" | generate Householder elementary reflector \ref magma_larfb "larfb" | apply block Householder elementary reflector \ref magma_larft "larft" | form T for block Householder elementary reflector \ref magma_symmetrize "symmetrize" | copy lower triangle to upper triangle, or vice-versa \ref magma_transpose "transpose" | transpose matrix Utility routines {#util} ================================= Memory Allocation --------------------------------- MAGMA can use regular CPU memory allocated with malloc or new, but it may achieve better performance using aligned and, especially, pinned memory. There are typed versions of these (e.g., magma_zmalloc) that avoid the need to cast and use sizeof, and un-typed versions (e.g., magma_malloc) that are more flexible but require a (void**) cast and multiplying the number of elements by sizeof. Name | Description ----------- | ----------- magma_malloc_cpu() | allocate CPU memory that is aligned for better performance & reproducibility magma_malloc_pinned() | allocate CPU memory that is pinned (page-locked) magma_malloc() | allocate GPU memory magma_free_cpu() | free CPU memory allocated with magma_malloc_cpu magma_free_pinned() | free CPU memory allocated with magma_malloc_pinned magma_free() | free GPU memory allocated with magma_malloc where * is one of the four precisions, s d c z, or i for magma_int_t, or none for an un-typed version. Communication --------------------------------- The name of communication routines is from the CPU's point of view. Name | Description ----------- | ----------- \ref magma_setmatrix "setmatrix" | send matrix to GPU \ref magma_setvector "setvector" | send vector to GPU \ref magma_getmatrix "getmatrix" | get matrix from GPU \ref magma_getvector "getvector" | get vector from GPU Data types & complex numbers {#types} ================================= Integers --------------------------------- MAGMA uses **magma_int_t** for integers. Normally, this is mapped to the C/C++ int type. Most systems today use the LP64 convention, meaning long and pointers are 64-bit, while int is 32-bit. MAGMA also supports the ILP64 convention as an alternative, where int, long, and pointers are all 64-bit. To use this, we typedef magma_int_t to be long long. To use ILP64, define MAGMA_ILP64 or MKL_ILP64 when compiling, and link with an ILP64 BLAS and LAPACK library; see make.inc.mkl-ilp64 for an example. Complex numbers --------------------------------- MAGMA supports complex numbers. Unfortunately, there is not a single standard for how to implement complex numbers in C/C++. Fortunately, most implementations are identical on a binary level, so you can freely cast from one to another. The MAGMA types are: **magmaFloatComplex**, which in CUDA MAGMA is a typedef of cuFloatComplex, and **magmaDoubleComplex**, which in CUDA MAGMA is a typedef of cuDoubleComplex. For C, we provide macros to manipulate complex numbers. For C++ support, include the magma_operators.h header, which provides overloaded C++ operators and functions. C macro | C++ operator | Description ----------- | ----------- | ----------- c = MAGMA_*_MAKE(r,i) | | create complex number from real & imaginary parts r = MAGMA_*_REAL(a) | r = real(a) | return real part i = MAGMA_*_IMAG(a) | i = imag(a) | return imaginary part c = MAGMA_*_NEGATE(a) | c = -a; | negate c = MAGMA_*_ADD(a,b) | c = a + b; | add c = MAGMA_*_SUB(a,b) | c = a - b; | subtract c = MAGMA_*_MUL(a,b) | c = a * b; | multiply c = MAGMA_*_DIV(a,b) | c = a / b; | divide c = MAGMA_*_CNJG(a) | c = conj(a) | conjugate r = MAGMA_*_ABS(a) | r = fabs(a) | 2-norm, sqrt( real(a)^2 + imag(a)^2 ) r = MAGMA_*_ABS1(a) | r = abs1(a) | 1-norm, abs(real(a)) + abs(imag(a)) . **Constants** | | **Description** c = MAGMA_*_ZERO | | zero c = MAGMA_*_ONE | | one c = MAGMA_*_NAN | | not-a-number (e.g., 0/0) c = MAGMA_*_INF | | infinity (e.g., 1/0, overflow) where * is one of the four precisions, S D C Z. Conventions for variables {#variables} ================================= Here are general guidelines for variable names; there are of course exceptions to these. - Uppercase letters indicate matrices: A, B, C, X. - Lowercase letters indicate vectors: b, x, y, z. - "d" prefix indicates matrix or vector on GPU device: dA, dB, dC, dX; db, dx, dy, dz. - Greek words indicate scalars: alpha, beta. - m, n, k are matrix dimensions. Typically, the order of arguments is: - options (uplo, etc.) - matrix sizes (m, n, k, etc.), - input matrices & vectors (A, lda, x, incx, etc.) - output matrices & vectors - workspaces (work, lwork, etc.) - info error code LAPACK and MAGMA use column-major matrices. For matrix X with dimension (lda,n), element X(i, j) is X[ i + j*lda ]. For symmetric, Hermitian, and triangular matrices, only the lower or upper triangle is accessed, as specified by the uplo argument; the other triangle is ignored. lda is the leading dimension of matrix A; similarly ldb for B, ldda for dA, etc. It should immediately follow the matrix pointer in the argument list. The leading dimension can be the number of rows, or if A is a sub-matrix of a larger parent matrix, lda is the leading dimension (e.g., rows) of the parent matrix. On the GPU, it is often beneficial to round the leading dimension up to a multiple of 32, to provide better performance. This aligns memory reads so they are coalesced. This is provided by the magma_roundup function: ldda = magma_roundup( m, 32 ); The formula ((m + 31)/32)*32 also works, relying on floored integer division, but the roundup function is clearer to use. On the CPU, it is often beneficial to ensure that the leading dimension is **not** a multiple of the page size (often 4 KiB), to minimize TLB misses. For vectors, incx is the increment or stride between elements of vector x. In all cases, incx != 0. In most cases, if incx < 0, then the vector is indexed in reverse order, for instance, using Matlab notation, incx = 1 means x( 1 : 1 : n ) incx = 2 means x( 1 : 2 : 2*n-1 ) while incx = -1 means x( n : -1 : 1 ) incx = -2 means x( 2*n-1 : -2 : 1 ) For several routines (amax, amin, asum, nrm2, scal), the order is irrelevant, so negative incx are not allowed; incx > 0. Constants {#constants} ================================= MAGMA defines a few constant parameters, such as `MagmaTrans, MagmaNoTrans`, that are equivalent of CBLAS and LAPACK parameters. The naming and numbering of these parameters follow that of [CBLAS from Netlib](http://www.netlib.org/blas/blast-forum/cblas.tgz) and the [C Interface to LAPACK from Netlib](http://www.netlib.org/lapack/lapwrapc/), and [PLASMA](http://icl.utk.edu/plasma/). MAGMA includes functions, `lapack_xyz_const()`, which take MAGMA's integer constants and return LAPACK's string constants, where `xyz` is a MAGMA type such as `uplo`, `trans`, etc. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose", "Upper", "Lower", etc. Similarly, there are functions to go from MAGMA's integer constants to CBLAS, OpenCL's clBLAS, and CUDA's cuBLAS integer constants. There are also functions, `magma_xyz_const()`, to go in the opposite direction, from LAPACK's string constants to MAGMA's integer constants. The most common constants are those defined for BLAS routines: - enum { MagmaNoTrans, MagmaTrans, MagmaConjTrans } magma_order_t Whether a matrix is not transposed, transposed, or conjugate-transposed. For a real matrix, Trans and ConjTrans have the same meaning. - enum { MagmaLower, MagmaUpper, MagmaFull } magma_uplo_t Whether the lower or upper triangle of a matrix is given, or the full matrix. - enum { MagmaLeft, MagmaRight } magma_side_t Whether the matrix is on the left or right. - enum { MagmaUnit, MagmaNonUnit } magma_diag_t Whether the diagonal is assumed to be unit (all ones) or not. Additional constants for specific routines are defined in the documentation for the routines. Because MAGMA, CBLAS, LAPACK, CUBLAS, and clBlas use potentially different constants, converters between them are provided. These convert LAPACK constants to MAGMA constants. Note that the meaning of LAPACK constants depends on the context: 'N' can mean False, NoTrans, NonUnit, NoVec, etc. Here, curly braces { } group similar constants. . | Function | . | Description ----------- | ---- | ---- | ----------- magma_bool_t | magma_bool_const | ( character ) | Map 'N', 'Y' \n to MagmaTrue, MagmaFalse magma_order_t | magma_order_const | ( character ) | Map 'R', 'C' \n to MagmaRowMajor, MagmaColMajor magma_trans_t | magma_trans_const | ( character ) | Map 'N', 'T', 'C' \n to MagmaNoTrans, MagmaTrans, MagmaConjTrans magma_uplo_t | magma_uplo_const | ( character ) | Map 'L', 'U' \n to MagmaLower, MagmaUpper magma_diag_t | magma_diag_const | ( character ) | Map 'N', 'U' \n to MagmaNonUnit, MagmaUnit magma_side_t | magma_side_const | ( character ) | Map 'L', 'R' \n to MagmaLeft, MagmaRight magma_norm_t | magma_norm_const | ( character ) | Map 'O', '1', '2', 'F', 'E', 'I', 'M' \n to Magma{One, Two, Frobenius, Inf, Max}Norm magma_dist_t | magma_dist_const | ( character ) | Map 'U', 'S', 'N' \n to MagmaDist{Uniform, Symmetric, Normal} magma_vec_t | magma_vec_const | ( character ) | Map 'V', 'N', 'I', 'A', 'S', 'O' \n to MagmaVec, Magma{No, I, All, Some, Overwrite}Vec magma_range_t | magma_range_const | ( character ) | Map 'A', 'V', 'I' \n to MagmaRange{All, V, I} magma_vect_t | magma_vect_const | ( character ) | Map 'Q', 'P' \n to MagmaQ, MagmaP magma_direct_t | magma_direct_const | ( character ) | Map 'F', 'B' \n to MagmaForward, MagmaBackward magma_storev_t | magma_storev_const | ( character ) | Map 'C', 'R' \n to MagmaColumnwise, MagmaRowwise These do the inverse map, converting MAGMA to LAPACK constants. From the standpoint of LAPACK, only the first letter of each string is significant. Nevertheless, the functions return meaningful strings, such as "No transpose", "Transpose". Substitute `lapacke` for `lapack` to get version that returns single char instead of string (const char*). . | Function | . | Description ----------- | ---- | ---- | ----------- const char* | lapack_bool_const | ( magma_bool_t ) | Inverse of magma_bool_const() const char* | lapack_order_const | ( magma_order_t ) | Inverse of magma_order_const() const char* | lapack_trans_const | ( magma_trans_t ) | Inverse of magma_trans_const() const char* | lapack_uplo_const | ( magma_uplo_t ) | Inverse of magma_uplo_const() const char* | lapack_diag_const | ( magma_diag_t ) | Inverse of magma_diag_const() const char* | lapack_side_const | ( magma_side_t ) | Inverse of magma_side_const() const char* | lapack_norm_const | ( magma_norm_t ) | Inverse of magma_norm_const() const char* | lapack_dist_const | ( magma_dist_t ) | Inverse of magma_dist_const() const char* | lapack_vec_const | ( magma_vec_t ) | Inverse of magma_vec_const() const char* | lapack_range_const | ( magma_range_t ) | Inverse of magma_range_const() const char* | lapack_vect_const | ( magma_vect_t ) | Inverse of magma_vect_const() const char* | lapack_direct_const | ( magma_direct_t ) | Inverse of magma_direct_const() const char* | lapack_storev_const | ( magma_storev_t ) | Inverse of magma_storev_const() const char* | lapack_const | ( constant ) | Map any MAGMA constant, Magma*, to an LAPACK string constant char | lapacke_const | ( constant ) | Map any MAGMA constant, Magma*, to an LAPACKE character To convert MAGMA to Nvidia's CUBLAS constants: . | Function | . | Description ----------- | ---- | ---- | ----------- cublasOperation_t | cublas_trans_const | ( trans ) | Map MagmaNoTrans, MagmaTrans, MagmaConjTrans \n to CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C cublasFillMode_t | cublas_uplo_const | ( uplo ) | Map MagmaLower, MagmaUpper \n to CUBLAS_FILL_MODE_LOWER, CUBLAS_FILL_MODE_UPPER cublasDiagType_t | cublas_diag_const | ( diag ) | Map MagmaNonUnit, MagmaUnit \n to CUBLAS_DIAG_NON_UNIT, CUBLAS_DIAG_UNIT cublasSideMode_t | cublas_side_const | ( side ) | Map MagmaLeft, MagmaRight \n to CUBLAS_SIDE_LEFT, CUBLAS_SIDE_Right To convert MAGMA to AMD's clBlas constants: . | Function | . | Description ----------- | ---- | ---- | ----------- clblasOrder | clblas_order_const | ( order ) | Map MagmaRowMajor, MagmaColMajor \n to clAmdBlasRowMajor, clAmdBlasColumnMajor clblasTranspose | clblas_trans_const | ( trans ) | Map MagmaNoTrans, MagmaTrans, MagmaConjTrans \n to clAmdBlasNoTrans, clAmdBlasTrans, clAmdBlasConjTrans clblasUplo | clblas_uplo_const | ( uplo ) | Map MagmaLower, MagmaUpper \n to clAmdBlasLower, clAmdBlasUpper clblasDiag | clblas_diag_const | ( diag ) | Map MagmaNonUnit, MagmaUnit \n to clAmdBlasNonUnit, clAmdBlasUnit clblasSide | clblas_side_const | ( side ) | Map MagmaLeft, MagmaRight \n to clAmdBlasLeft, clAmdBlasRight To convert MAGMA to CBLAS constants: . | Function | . | Description ----------- | ---- | ---- | ----------- enum CBLAS_ORDER | cblas_order_const | ( order ) | Map MagmaRowMajor, MagmaColMajor \n to CblasRowMajor, CblasColMajor enum CBLAS_TRANSPOSE | cblas_trans_const | ( trans ) | Map MagmaNoTrans, MagmaTrans, MagmaConjTrans \n to CblasNoTrans, CblasTrans, CblasConjTrans enum CBLAS_UPLO | cblas_uplo_const | ( uplo ) | Map MagmaLower, MagmaUpper \n to CblasLower, CblasUpper enum CBLAS_DIAG | cblas_diag_const | ( diag ) | Map MagmaNonUnit, MagmaUnit \n to CblasNonUnit, CblasUnit enum CBLAS_SIDE | cblas_side_const | ( side ) | Map MagmaLeft, MagmaRight \n to CblasLeft, CblasRight Errors {#errors} ================================= Driver and computational routines, and a few BLAS/auxiliary routines, currently return errors both as a return value and in the info argument. The return value and info should always be identical. In general, the meaning is as given in this table. Predefined error codes are large negative numbers. Using the symbolic constants below is preferred, but the numeric values can be found in include/magma_types.h. Info | Description ----------- | ----------- info = 0 (MAGMA_SUCCESS) | Successful exit info < 0, but small | For info = -i, the i-th argument had an illegal value info > 0 | Function-specific error such as singular matrix MAGMA_ERR_DEVICE_ALLOC | Could not allocate GPU device memory MAGMA_ERR_HOST_ALLOC | Could not allocate CPU host memory MAGMA_ERR_ILLEGAL_VALUE | An argument had an illegal value (deprecated; instead it should return -i to say the i-th argument was bad) MAGMA_ERR_INVALID_PTR | Can't free pointer MAGMA_ERR_NOT_IMPLEMENTED | Function or option not implemented MAGMA_ERR_NOT_SUPPORTED | Function or option not supported on the current architecture magma_xerbla() is called to report errors (mostly bad arguments) to the user. magma_strerror() returns string description of an error code. Methodology {#methodology} ================================= One-sided matrix factorizations --------------------------------- The one-sided LU, Cholesky, and QR factorizations form a basis for solving linear systems. A general recommendation is to use LU for general n-by-n matrices, Cholesky for symmetric/Hermitian positive definite (SPD) matrices, and QR for solving least squares problems, min || A x - b || for general m-by-n, m > n matrices. We use hybrid algorithms where the computation is split between the GPU and and the CPU. In general for the one-sided factorizations, the panels are factored on the CPU and the trailing sub-matrix updates on the GPU. Look-ahead techniques are used to overlap the CPU and GPU work and some communications. In both the CPU and GPU interfaces the matrix to be factored resides in the GPU memory, and CPU-GPU transfers are associated only with the panels. The resulting matrix is accumulated (on the CPU or GPU according to the interface) along the computation, as a byproduct of the algorithm, vs. sending the the entire matrix when needed. In the CPU interface, the original transfer of the matrix to the GPU is overlapped with the factorization of the first panel. In this sense the CPU and GPU interfaces, although similar, are not derivatives of each other as they have different communication patterns. Although the solution step has O(n) times less floating point operations than the factorization, it is still very important to optimize it. Solving a triangular system of equations can be very slow because the computation is bandwidth limited and naturally not parallel. Various approaches have been proposed in the past. We use an approach where diagonal blocks of A are explicitly inverted and used in a block algorithm. This results in a high performance, numerically stable algorithm, especially when used with triangular matrices coming from numerically stable factorization algorithms (e.g., as in LAPACK and MAGMA). For instances when the GPU's single precision performance is much higher than its double precision performance, MAGMA provides a second set of solvers, based on the mixed precision iterative refinement technique. The solvers are based again on correspondingly the LU, QR, and Cholesky factorizations, and are designed to solve linear problems in double precision accuracy but at a speed that is characteristic for the much faster single precision computations. The idea is to use single precision for the bulk of the computation, namely the factorization step, and than use that factorization as a preconditioner in a simple iterative refinement process in double precision arithmetic. This often results in the desired high performance and high accuracy solvers. Two-sided matrix factorizations --------------------------------- As the one-sided matrix factorizations are the basis for various linear solvers, the two-sided matrix factorizations are the basis for eigen-solvers, and therefore form an important class of dense linear algebra routines. The two-sided factorizations have been traditionally more difficult to achieve high performance. The reason is that the two-sided factorizations involve large matrix-vector products which are memory bound, and as the gap between compute and communication power increases exponentially, these memory bound operations become an increasingly more difficult to handle bottleneck. GPUs though offer an attractive possibility to accelerate them. Indeed, having a high bandwidth (e.g. 10 times larger than current CPU bus bandwidths), GPUs can accelerate matrix-vector products significantly (10 to 30 times). Here, the panel factorization itself is hybrid, while the trailing matrix update is performed on the GPU. */