MAGMA  2.7.1
Matrix Algebra for GPU and Multicore Architectures
 All Classes Files Functions Friends Groups Pages
her2k: Hermitian rank 2k update

\( C = \alpha A B^T + \alpha B A^T + \beta C \) where \( C \) is Hermitian More...

Functions

void magma_cher2k (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, magmaFloatComplex alpha, magmaFloatComplex_const_ptr dA, magma_int_t ldda, magmaFloatComplex_const_ptr dB, magma_int_t lddb, float beta, magmaFloatComplex_ptr dC, magma_int_t lddc, magma_queue_t queue)
 Perform Hermitian rank-2k update. More...
 
void magma_zher2k (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, magmaDoubleComplex alpha, magmaDoubleComplex_const_ptr dA, magma_int_t ldda, magmaDoubleComplex_const_ptr dB, magma_int_t lddb, double beta, magmaDoubleComplex_ptr dC, magma_int_t lddc, magma_queue_t queue)
 Perform Hermitian rank-2k update. More...
 
void magmablas_cher2k_mgpu2 (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, magmaFloatComplex alpha, magmaFloatComplex_ptr dA[], magma_int_t ldda, magma_int_t a_offset, magmaFloatComplex_ptr dB[], magma_int_t lddb, magma_int_t b_offset, float beta, magmaFloatComplex_ptr dC[], magma_int_t lddc, magma_int_t c_offset, magma_int_t ngpu, magma_int_t nb, magma_queue_t queues[][20], magma_int_t nqueue)
 CHER2K performs one of the Hermitian rank 2k operations. More...
 
void magmablas_dsyr2k_mgpu2 (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, double alpha, magmaDouble_ptr dA[], magma_int_t ldda, magma_int_t a_offset, magmaDouble_ptr dB[], magma_int_t lddb, magma_int_t b_offset, double beta, magmaDouble_ptr dC[], magma_int_t lddc, magma_int_t c_offset, magma_int_t ngpu, magma_int_t nb, magma_queue_t queues[][20], magma_int_t nqueue)
 DSYR2K performs one of the symmetric rank 2k operations. More...
 
void magmablas_ssyr2k_mgpu2 (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, float alpha, magmaFloat_ptr dA[], magma_int_t ldda, magma_int_t a_offset, magmaFloat_ptr dB[], magma_int_t lddb, magma_int_t b_offset, float beta, magmaFloat_ptr dC[], magma_int_t lddc, magma_int_t c_offset, magma_int_t ngpu, magma_int_t nb, magma_queue_t queues[][20], magma_int_t nqueue)
 SSYR2K performs one of the symmetric rank 2k operations. More...
 
void magmablas_zher2k_mgpu2 (magma_uplo_t uplo, magma_trans_t trans, magma_int_t n, magma_int_t k, magmaDoubleComplex alpha, magmaDoubleComplex_ptr dA[], magma_int_t ldda, magma_int_t a_offset, magmaDoubleComplex_ptr dB[], magma_int_t lddb, magma_int_t b_offset, double beta, magmaDoubleComplex_ptr dC[], magma_int_t lddc, magma_int_t c_offset, magma_int_t ngpu, magma_int_t nb, magma_queue_t queues[][20], magma_int_t nqueue)
 ZHER2K performs one of the Hermitian rank 2k operations. More...
 

Detailed Description

\( C = \alpha A B^T + \alpha B A^T + \beta C \) where \( C \) is Hermitian

Function Documentation

void magma_cher2k ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
magmaFloatComplex  alpha,
magmaFloatComplex_const_ptr  dA,
magma_int_t  ldda,
magmaFloatComplex_const_ptr  dB,
magma_int_t  lddb,
float  beta,
magmaFloatComplex_ptr  dC,
magma_int_t  lddc,
magma_queue_t  queue 
)

Perform Hermitian rank-2k update.

\( C = \alpha A B^H + \alpha B A^H \beta C \) (trans == MagmaNoTrans), or
\( C = \alpha A^H B + \alpha B^H A \beta C \) (trans == MagmaConjTrans),
where \( C \) is Hermitian.

Parameters
[in]uploWhether the upper or lower triangle of C is referenced.
[in]transOperation to perform on A and B.
[in]nNumber of rows and columns of C. n >= 0.
[in]kNumber of columns of A and B (for MagmaNoTrans) or rows of A and B (for MagmaConjTrans). k >= 0.
[in]alphaScalar \( \alpha \)
[in]dACOMPLEX array on GPU device. If trans == MagmaNoTrans, the n-by-k matrix A of dimension (ldda,k), ldda >= max(1,n);
otherwise, the k-by-n matrix A of dimension (ldda,n), ldda >= max(1,k).
[in]lddaLeading dimension of dA.
[in]dBCOMPLEX array on GPU device. If trans == MagmaNoTrans, the n-by-k matrix B of dimension (lddb,k), lddb >= max(1,n);
otherwise, the k-by-n matrix B of dimension (lddb,n), lddb >= max(1,k).
[in]lddbLeading dimension of dB.
[in]betaScalar \( \beta \)
[in,out]dCCOMPLEX array on GPU device. The n-by-n Hermitian matrix C of dimension (lddc,n), lddc >= max(1,n).
[in]lddcLeading dimension of dC.
[in]queuemagma_queue_t Queue to execute in.
void magma_zher2k ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
magmaDoubleComplex  alpha,
magmaDoubleComplex_const_ptr  dA,
magma_int_t  ldda,
magmaDoubleComplex_const_ptr  dB,
magma_int_t  lddb,
double  beta,
magmaDoubleComplex_ptr  dC,
magma_int_t  lddc,
magma_queue_t  queue 
)

Perform Hermitian rank-2k update.

\( C = \alpha A B^H + \alpha B A^H \beta C \) (trans == MagmaNoTrans), or
\( C = \alpha A^H B + \alpha B^H A \beta C \) (trans == MagmaConjTrans),
where \( C \) is Hermitian.

Parameters
[in]uploWhether the upper or lower triangle of C is referenced.
[in]transOperation to perform on A and B.
[in]nNumber of rows and columns of C. n >= 0.
[in]kNumber of columns of A and B (for MagmaNoTrans) or rows of A and B (for MagmaConjTrans). k >= 0.
[in]alphaScalar \( \alpha \)
[in]dACOMPLEX_16 array on GPU device. If trans == MagmaNoTrans, the n-by-k matrix A of dimension (ldda,k), ldda >= max(1,n);
otherwise, the k-by-n matrix A of dimension (ldda,n), ldda >= max(1,k).
[in]lddaLeading dimension of dA.
[in]dBCOMPLEX_16 array on GPU device. If trans == MagmaNoTrans, the n-by-k matrix B of dimension (lddb,k), lddb >= max(1,n);
otherwise, the k-by-n matrix B of dimension (lddb,n), lddb >= max(1,k).
[in]lddbLeading dimension of dB.
[in]betaScalar \( \beta \)
[in,out]dCCOMPLEX_16 array on GPU device. The n-by-n Hermitian matrix C of dimension (lddc,n), lddc >= max(1,n).
[in]lddcLeading dimension of dC.
[in]queuemagma_queue_t Queue to execute in.
void magmablas_cher2k_mgpu2 ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
magmaFloatComplex  alpha,
magmaFloatComplex_ptr  dA[],
magma_int_t  ldda,
magma_int_t  a_offset,
magmaFloatComplex_ptr  dB[],
magma_int_t  lddb,
magma_int_t  b_offset,
float  beta,
magmaFloatComplex_ptr  dC[],
magma_int_t  lddc,
magma_int_t  c_offset,
magma_int_t  ngpu,
magma_int_t  nb,
magma_queue_t  queues[][20],
magma_int_t  nqueue 
)

CHER2K performs one of the Hermitian rank 2k operations.

C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C,

or

C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C,

where alpha and beta are scalars with beta real, C is an n by n Hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

Parameters
[in]uplomagma_uplo_t. On entry, UPLO specifies whether the upper or lower triangular part of the array C is to be referenced as follows:
  • = MagmaUpper: Only the upper triangular part of C is to be referenced.
  • = MagmaLower: Only the lower triangular part of C is to be referenced.
      current only Lower case is implemented.
    
[in]transmagma_trans_t. On entry, TRANS specifies the operation to be performed as follows:
  • = MagmaNoTrans: C := alpha*A*B**H + conj( alpha )*B*A**H + beta*C.
  • = Magma_ConjTrans: C := alpha*A**H*B + conj( alpha )*B**H*A + beta*C.
      current only NoTrans case is implemented.
    
[in]nINTEGER. On entry, N specifies the order of the matrix C. N must be at least zero.
[in]kINTEGER. On entry with TRANS = MagmaNoTrans, K specifies the number of columns of the matrices A and B, and on entry with TRANS = Magma_ConjTrans, K specifies the number of rows of the matrices A and B. K must be at least zero.
[in]alphaCOMPLEX. On entry, ALPHA specifies the scalar alpha.
[in]dACOMPLEX array of DIMENSION ( LDA, ka ), where ka is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array A must contain the matrix A, otherwise the leading k by n part of the array A must contain the matrix A.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddaINTEGER. On entry, LDA specifies the first dimension of A as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDA must be at least max( 1, n ), otherwise LDA must be at least max( 1, k ).
[in]a_offsetINTEGER Row offset to start sub-matrix of dA. Uses dA(a_offset:a_offset+n, :). 0 <= a_offset < ldda.
[in]dBCOMPLEX array of DIMENSION ( LDB, kb ), where kb is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array B must contain the matrix B, otherwise the leading k by n part of the array B must contain the matrix B.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddbINTEGER. On entry, LDB specifies the first dimension of B as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDB must be at least max( 1, n ), otherwise LDB must be at least max( 1, k ).
[in]b_offsetINTEGER Row offset to start sub-matrix of dB. Uses dB(b_offset:b_offset+n, :). 0 <= b_offset < lddb.
[in]betaREAL. On entry, BETA specifies the scalar beta.
[in,out]dCCOMPLEX array of DIMENSION ( LDC, n ). Before entry with UPLO = MagmaUpper, the leading n by n upper triangular part of the array C must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of C is not referenced. On exit, the upper triangular part of the array C is overwritten by the upper triangular part of the updated matrix.
Before entry with UPLO = MagmaLower, the leading n by n lower triangular part of the array C must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of C is not referenced. On exit, the lower triangular part of the array C is overwritten by the lower triangular part of the updated matrix.
Note that the imaginary parts of the diagonal elements need not be set, they are assumed to be zero, and on exit they are set to zero. [TODO: verify]

[TODO: describe distribution: 1D column block-cyclic across GPUs.]

Parameters
[in]lddcINTEGER. On entry, LDC specifies the first dimension of C as declared in the calling (sub) program. LDC must be at least max( 1, n ).
[in]c_offsetINTEGER. Row and column offset to start sub-matrix of dC. Uses dC(c_offset:c_offset+n, c_offset:c_offset+n). 0 <= c_offset < lddc.
[in]ngpuINTEGER. Number of GPUs over which matrix C is distributed.
[in]nbINTEGER. Block size used for distribution of C.
[in]queuesarray of CUDA queues, of dimension NGPU by 20. Streams to use for running multiple GEMMs in parallel. Only up to NSTREAM queues are used on each GPU.
[in]nqueueINTEGER. Number of queues to use on each device
void magmablas_dsyr2k_mgpu2 ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
double  alpha,
magmaDouble_ptr  dA[],
magma_int_t  ldda,
magma_int_t  a_offset,
magmaDouble_ptr  dB[],
magma_int_t  lddb,
magma_int_t  b_offset,
double  beta,
magmaDouble_ptr  dC[],
magma_int_t  lddc,
magma_int_t  c_offset,
magma_int_t  ngpu,
magma_int_t  nb,
magma_queue_t  queues[][20],
magma_int_t  nqueue 
)

DSYR2K performs one of the symmetric rank 2k operations.

C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C,

or

C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C,

where alpha and beta are scalars with beta real, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

Parameters
[in]uplomagma_uplo_t. On entry, UPLO specifies whether the upper or lower triangular part of the array C is to be referenced as follows:
  • = MagmaUpper: Only the upper triangular part of C is to be referenced.
  • = MagmaLower: Only the lower triangular part of C is to be referenced.
      current only Lower case is implemented.
    
[in]transmagma_trans_t. On entry, TRANS specifies the operation to be performed as follows:
  • = MagmaNoTrans: C := alpha*A*B**H + conj( alpha )*B*A**H + beta*C.
  • = MagmaTrans: C := alpha*A**H*B + conj( alpha )*B**H*A + beta*C.
      current only NoTrans case is implemented.
    
[in]nINTEGER. On entry, N specifies the order of the matrix C. N must be at least zero.
[in]kINTEGER. On entry with TRANS = MagmaNoTrans, K specifies the number of columns of the matrices A and B, and on entry with TRANS = MagmaTrans, K specifies the number of rows of the matrices A and B. K must be at least zero.
[in]alphaDOUBLE PRECISION. On entry, ALPHA specifies the scalar alpha.
[in]dADOUBLE PRECISION array of DIMENSION ( LDA, ka ), where ka is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array A must contain the matrix A, otherwise the leading k by n part of the array A must contain the matrix A.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddaINTEGER. On entry, LDA specifies the first dimension of A as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDA must be at least max( 1, n ), otherwise LDA must be at least max( 1, k ).
[in]a_offsetINTEGER Row offset to start sub-matrix of dA. Uses dA(a_offset:a_offset+n, :). 0 <= a_offset < ldda.
[in]dBDOUBLE PRECISION array of DIMENSION ( LDB, kb ), where kb is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array B must contain the matrix B, otherwise the leading k by n part of the array B must contain the matrix B.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddbINTEGER. On entry, LDB specifies the first dimension of B as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDB must be at least max( 1, n ), otherwise LDB must be at least max( 1, k ).
[in]b_offsetINTEGER Row offset to start sub-matrix of dB. Uses dB(b_offset:b_offset+n, :). 0 <= b_offset < lddb.
[in]betaDOUBLE PRECISION. On entry, BETA specifies the scalar beta.
[in,out]dCDOUBLE PRECISION array of DIMENSION ( LDC, n ). Before entry with UPLO = MagmaUpper, the leading n by n upper triangular part of the array C must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of C is not referenced. On exit, the upper triangular part of the array C is overwritten by the upper triangular part of the updated matrix.
Before entry with UPLO = MagmaLower, the leading n by n lower triangular part of the array C must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of C is not referenced. On exit, the lower triangular part of the array C is overwritten by the lower triangular part of the updated matrix.
Note that the imaginary parts of the diagonal elements need not be set, they are assumed to be zero, and on exit they are set to zero. [TODO: verify]

[TODO: describe distribution: 1D column block-cyclic across GPUs.]

Parameters
[in]lddcINTEGER. On entry, LDC specifies the first dimension of C as declared in the calling (sub) program. LDC must be at least max( 1, n ).
[in]c_offsetINTEGER. Row and column offset to start sub-matrix of dC. Uses dC(c_offset:c_offset+n, c_offset:c_offset+n). 0 <= c_offset < lddc.
[in]ngpuINTEGER. Number of GPUs over which matrix C is distributed.
[in]nbINTEGER. Block size used for distribution of C.
[in]queuesarray of CUDA queues, of dimension NGPU by 20. Streams to use for running multiple GEMMs in parallel. Only up to NSTREAM queues are used on each GPU.
[in]nqueueINTEGER. Number of queues to use on each device
void magmablas_ssyr2k_mgpu2 ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
float  alpha,
magmaFloat_ptr  dA[],
magma_int_t  ldda,
magma_int_t  a_offset,
magmaFloat_ptr  dB[],
magma_int_t  lddb,
magma_int_t  b_offset,
float  beta,
magmaFloat_ptr  dC[],
magma_int_t  lddc,
magma_int_t  c_offset,
magma_int_t  ngpu,
magma_int_t  nb,
magma_queue_t  queues[][20],
magma_int_t  nqueue 
)

SSYR2K performs one of the symmetric rank 2k operations.

C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C,

or

C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C,

where alpha and beta are scalars with beta real, C is an n by n symmetric matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

Parameters
[in]uplomagma_uplo_t. On entry, UPLO specifies whether the upper or lower triangular part of the array C is to be referenced as follows:
  • = MagmaUpper: Only the upper triangular part of C is to be referenced.
  • = MagmaLower: Only the lower triangular part of C is to be referenced.
      current only Lower case is implemented.
    
[in]transmagma_trans_t. On entry, TRANS specifies the operation to be performed as follows:
  • = MagmaNoTrans: C := alpha*A*B**H + conj( alpha )*B*A**H + beta*C.
  • = MagmaTrans: C := alpha*A**H*B + conj( alpha )*B**H*A + beta*C.
      current only NoTrans case is implemented.
    
[in]nINTEGER. On entry, N specifies the order of the matrix C. N must be at least zero.
[in]kINTEGER. On entry with TRANS = MagmaNoTrans, K specifies the number of columns of the matrices A and B, and on entry with TRANS = MagmaTrans, K specifies the number of rows of the matrices A and B. K must be at least zero.
[in]alphaREAL. On entry, ALPHA specifies the scalar alpha.
[in]dAREAL array of DIMENSION ( LDA, ka ), where ka is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array A must contain the matrix A, otherwise the leading k by n part of the array A must contain the matrix A.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddaINTEGER. On entry, LDA specifies the first dimension of A as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDA must be at least max( 1, n ), otherwise LDA must be at least max( 1, k ).
[in]a_offsetINTEGER Row offset to start sub-matrix of dA. Uses dA(a_offset:a_offset+n, :). 0 <= a_offset < ldda.
[in]dBREAL array of DIMENSION ( LDB, kb ), where kb is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array B must contain the matrix B, otherwise the leading k by n part of the array B must contain the matrix B.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddbINTEGER. On entry, LDB specifies the first dimension of B as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDB must be at least max( 1, n ), otherwise LDB must be at least max( 1, k ).
[in]b_offsetINTEGER Row offset to start sub-matrix of dB. Uses dB(b_offset:b_offset+n, :). 0 <= b_offset < lddb.
[in]betaREAL. On entry, BETA specifies the scalar beta.
[in,out]dCREAL array of DIMENSION ( LDC, n ). Before entry with UPLO = MagmaUpper, the leading n by n upper triangular part of the array C must contain the upper triangular part of the symmetric matrix and the strictly lower triangular part of C is not referenced. On exit, the upper triangular part of the array C is overwritten by the upper triangular part of the updated matrix.
Before entry with UPLO = MagmaLower, the leading n by n lower triangular part of the array C must contain the lower triangular part of the symmetric matrix and the strictly upper triangular part of C is not referenced. On exit, the lower triangular part of the array C is overwritten by the lower triangular part of the updated matrix.
Note that the imaginary parts of the diagonal elements need not be set, they are assumed to be zero, and on exit they are set to zero. [TODO: verify]

[TODO: describe distribution: 1D column block-cyclic across GPUs.]

Parameters
[in]lddcINTEGER. On entry, LDC specifies the first dimension of C as declared in the calling (sub) program. LDC must be at least max( 1, n ).
[in]c_offsetINTEGER. Row and column offset to start sub-matrix of dC. Uses dC(c_offset:c_offset+n, c_offset:c_offset+n). 0 <= c_offset < lddc.
[in]ngpuINTEGER. Number of GPUs over which matrix C is distributed.
[in]nbINTEGER. Block size used for distribution of C.
[in]queuesarray of CUDA queues, of dimension NGPU by 20. Streams to use for running multiple GEMMs in parallel. Only up to NSTREAM queues are used on each GPU.
[in]nqueueINTEGER. Number of queues to use on each device
void magmablas_zher2k_mgpu2 ( magma_uplo_t  uplo,
magma_trans_t  trans,
magma_int_t  n,
magma_int_t  k,
magmaDoubleComplex  alpha,
magmaDoubleComplex_ptr  dA[],
magma_int_t  ldda,
magma_int_t  a_offset,
magmaDoubleComplex_ptr  dB[],
magma_int_t  lddb,
magma_int_t  b_offset,
double  beta,
magmaDoubleComplex_ptr  dC[],
magma_int_t  lddc,
magma_int_t  c_offset,
magma_int_t  ngpu,
magma_int_t  nb,
magma_queue_t  queues[][20],
magma_int_t  nqueue 
)

ZHER2K performs one of the Hermitian rank 2k operations.

C := alpha*A*B**H + conjg( alpha )*B*A**H + beta*C,

or

C := alpha*A**H*B + conjg( alpha )*B**H*A + beta*C,

where alpha and beta are scalars with beta real, C is an n by n Hermitian matrix and A and B are n by k matrices in the first case and k by n matrices in the second case.

Parameters
[in]uplomagma_uplo_t. On entry, UPLO specifies whether the upper or lower triangular part of the array C is to be referenced as follows:
  • = MagmaUpper: Only the upper triangular part of C is to be referenced.
  • = MagmaLower: Only the lower triangular part of C is to be referenced.
      current only Lower case is implemented.
    
[in]transmagma_trans_t. On entry, TRANS specifies the operation to be performed as follows:
  • = MagmaNoTrans: C := alpha*A*B**H + conj( alpha )*B*A**H + beta*C.
  • = Magma_ConjTrans: C := alpha*A**H*B + conj( alpha )*B**H*A + beta*C.
      current only NoTrans case is implemented.
    
[in]nINTEGER. On entry, N specifies the order of the matrix C. N must be at least zero.
[in]kINTEGER. On entry with TRANS = MagmaNoTrans, K specifies the number of columns of the matrices A and B, and on entry with TRANS = Magma_ConjTrans, K specifies the number of rows of the matrices A and B. K must be at least zero.
[in]alphaCOMPLEX*16. On entry, ALPHA specifies the scalar alpha.
[in]dACOMPLEX*16 array of DIMENSION ( LDA, ka ), where ka is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array A must contain the matrix A, otherwise the leading k by n part of the array A must contain the matrix A.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddaINTEGER. On entry, LDA specifies the first dimension of A as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDA must be at least max( 1, n ), otherwise LDA must be at least max( 1, k ).
[in]a_offsetINTEGER Row offset to start sub-matrix of dA. Uses dA(a_offset:a_offset+n, :). 0 <= a_offset < ldda.
[in]dBCOMPLEX*16 array of DIMENSION ( LDB, kb ), where kb is k when TRANS = MagmaNoTrans, and is n otherwise. Before entry with TRANS = MagmaNoTrans, the leading n by k part of the array B must contain the matrix B, otherwise the leading k by n part of the array B must contain the matrix B.

[TODO: describe distribution: duplicated on all GPUs.]

Parameters
[in]lddbINTEGER. On entry, LDB specifies the first dimension of B as declared in the calling (sub) program. When TRANS = MagmaNoTrans then LDB must be at least max( 1, n ), otherwise LDB must be at least max( 1, k ).
[in]b_offsetINTEGER Row offset to start sub-matrix of dB. Uses dB(b_offset:b_offset+n, :). 0 <= b_offset < lddb.
[in]betaDOUBLE PRECISION. On entry, BETA specifies the scalar beta.
[in,out]dCCOMPLEX*16 array of DIMENSION ( LDC, n ). Before entry with UPLO = MagmaUpper, the leading n by n upper triangular part of the array C must contain the upper triangular part of the Hermitian matrix and the strictly lower triangular part of C is not referenced. On exit, the upper triangular part of the array C is overwritten by the upper triangular part of the updated matrix.
Before entry with UPLO = MagmaLower, the leading n by n lower triangular part of the array C must contain the lower triangular part of the Hermitian matrix and the strictly upper triangular part of C is not referenced. On exit, the lower triangular part of the array C is overwritten by the lower triangular part of the updated matrix.
Note that the imaginary parts of the diagonal elements need not be set, they are assumed to be zero, and on exit they are set to zero. [TODO: verify]

[TODO: describe distribution: 1D column block-cyclic across GPUs.]

Parameters
[in]lddcINTEGER. On entry, LDC specifies the first dimension of C as declared in the calling (sub) program. LDC must be at least max( 1, n ).
[in]c_offsetINTEGER. Row and column offset to start sub-matrix of dC. Uses dC(c_offset:c_offset+n, c_offset:c_offset+n). 0 <= c_offset < lddc.
[in]ngpuINTEGER. Number of GPUs over which matrix C is distributed.
[in]nbINTEGER. Block size used for distribution of C.
[in]queuesarray of CUDA queues, of dimension NGPU by 20. Streams to use for running multiple GEMMs in parallel. Only up to NSTREAM queues are used on each GPU.
[in]nqueueINTEGER. Number of queues to use on each device