Create distributed matrix on gpus with no cpu to gpu copy

dsambit · Post by **dsambit** » Sun Oct 14, 2018 6:45 pm

Hi,

I want to use _mgpu functions in magma as we need to work with matrix sizes of N=40,000 and more whose memory can't fit in a single gpu.
Lets say I am working with k gpus, and I am able to independently compute N/k different column blocks of the matrix directly on k gpus.
Next, lets say I want to use magma_int_t magma_dpotrf_mgpu( magma_int_t ngpu, magma_uplo_t uplo, magma_int_t n, magmaDouble_ptr d_lA[], magma_int_t ldda, magma_int_t * info).
I am wondering if there a way to use the distributed matrix memory which is already on the gpus in the magmaDouble_ptr argument without the need to copy the memory to cpu?
Currently, if I understand correctly the magmaDouble_ptr is obtained using magma_dsetmatrix_1D_col_bcyclic(...), which uses the full matrix memory already present on the cpu.
I wouldn't like to copy to cpu and again back to gpu as this will involve additional costs and the full matrix memory might be beyond the cpu memory as well.

Thanks a lot in advance,

Best,
Sambit

Stan Tomov · Post by **Stan Tomov** » Sun Oct 14, 2018 11:53 pm

The magma_dpotrf_mgpu function requires that the matrix is distributed among the GPUs in 1D block cyclic way, where nb is obtained by magma_get_dpotrf_nb(n). The magma_dsetmatrix_1D_col_bcyclic function is just one example on how one can get to this distribution starting from CPU memory. If you already had it in the required form, you can use magma_dpotrf_mgpu directly, otherwise some modifications would be required. For example, if you had it in 1D block cyclic but nb is something else, you can modify line 89 in magma_dpotrf_mgpu to take the correct nb.

We are thinking to update the auxiliary functions for these types of routines, so we are interested in what formats may be of interest, e.g., how do you have the matrix distributed and how do you describe the distribution? We can provide directly support for different distributions or at least auxiliary routines that move efficiently data from some user-specified format to 1D block cyclic that is taken by the routine, compute, and redistributed back to the user-specified format.

Stan

dsambit · Post by **dsambit** » Mon Oct 15, 2018 12:41 am

Hi Stan,

Thanks for your reply.

I think I can work with the nb provided by magma_get_dpotrf_nb(n).

how do you have the matrix distributed and how do you describe the distribution?

This is what I am really confused about. Let's say I have the nb blocks of memory across n gpus, but it seems somehow
I have to gather the array of pointers to these memory blocks in different gpus and communicate them to the gpu which
will subsequently call magma_dpotrf_mgpu. I am very new to cuda programming- after a little bit of digging I found
that this requires cuda unified memory across multiple gpus using functions like cudaDeviceEnablePeerAccess and cudaMemsetAsync .

We can provide directly support for different distributions or at least auxiliary routines that move efficiently data from some user-specified format to 1D block cyclic that is taken by the routine, compute, and redistributed back to the user-specified format.

Thanks, this would be very helpful indeed. I am thinking if an auxiliary function can be written which takes nb memory blocks in different gpus which are allocated normally using cudaMalloc (without unified memory) and copy them to a cuda unified memory format, which can then be used for any _mgpu functions.

Just an aside question- Are there an _mgpu functions for a) symmetric eigendecomposition and b) gemm or any plans to do so in the future. I couldn't find any in the documentation.

Best,
Sambit

mgates3 · Post by **mgates3** » Mon Oct 15, 2018 4:56 pm

Sorry, the docs are a little confusing about what is on the CPU and what is on the GPU. The array of pointers, d_lA, is itself on the CPU, e.g.,

Code: Select all

    double* d_lA[ MagmaMaxGPUs ];

(With CUDA, magmaDouble_ptr is just an alias for double*, needed for OpenCL compatibility.)

The pointers in d_lA point to data that is allocated on the multiple GPUs, e.g.,

Code: Select all

    for (int dev = 0; dev < ngpu; ++dev) {
        magma_setdevice( dev );
        magma_dmalloc( &d_lA[ dev ], max_size );
    }

See the code in magma/testing/testing_dpotrf_mgpu.cpp for an example of setting up the matrices. The CPU calls magma_dpotrf_mgpu, not the GPU. There is no need for unified memory.

Presently, the multi-GPU eigenvalue codes (e.g., dsyevd_m) are all CPU interfaces where the input and output matrices are on the CPU.

We're also working on a multi-GPU gemm that has a CPU interface, similar to cublasXt but optimized for certain gemm shapes. A multi-GPU interface for gemm is a bit more complicated because good distributions of A, B, and C depend on their relative sizes.

-mark

dsambit · Post by **dsambit** » Tue Oct 16, 2018 1:50 pm

Hi Mark,

See the code in magma/testing/testing_dpotrf_mgpu.cpp for an example of setting up the matrices. The CPU calls magma_dpotrf_mgpu, not the GPU. There is no need for unified memory.

Thanks for the clarification and directing me to the appropriate example code.

We're also working on a multi-GPU gemm that has a CPU interface, similar to cublasXt but optimized for certain gemm shapes. A multi-GPU interface for gemm is a bit more complicated because good distributions of A, B, and C depend on their relative sizes.

I think for now I can then use cublasXt for out of gpu memory matrix-matrix multiplication.

I my last post I had missed asking about one more function- potri where a out of memory multi-gpu implementation would be very beneficial in our application.
What we are trying to do is orthogonalize N column vectors of a X matrix (size M x N) where N can be 40,000 and more. This is the most computationally expensive step in our electronic-structure code when N is very large.
Our orthogonalization algorithm is follows:
Step1) Compute overlap matrix S=X^{T} * X (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)
Step 2) Compute cholesky factorization of S=L*L^{T} (This can be done out of gpu memory using magma_dpotrf_mgpu)
Step 3) Compute L^{-1} (out of gpu memory for potri?)
Step 4) Compute X=X*L^{-1} (don't need out of gpu memory implementation here as we plan to do this in a blocked manner on the gpus)

I am trying to find a path to do all the above four steps without copying data back to cpu.
I am wondering if this is possible currently using MAGMA.

Thank you,
Sambit

mgates3 · Post by **mgates3** » Tue Oct 16, 2018 2:12 pm

I don't see a reason to create an explicit inverse here. Inverses are usually slower and less accurate than doing a solve. Just do a triangular solve (trsm) with L on the right. I.e., solve

Xhat L = X

for Xhat, instead of Xhat = X L^{-1}. MAGMA has trsm_m (CPU interface for multi-GPU), but no out-of-gpu-memory version.

Note the accuracy and orthogonality of the Cholesky QR method may be worse than doing an actual QR factorization (geqrf). See Matlab example below. Ichitaro Yamazaki also has a paper about Cholesky QR.

Code: Select all

>> X = rand( m, n );
>> S = X' * X;
>> L = chol(S);
>> Xhat = X / L;  % triangular solve (trsm) with L on right
>> norm( Xhat' * Xhat - eye(n, n) )
ans =
   1.4121e-14

>> [Q, R] = qr( X, 0 );
>> norm( Q' * Q - eye(n, n) )
ans =
   1.1641e-15

-mark

Stan Tomov · Post by **Stan Tomov** » Tue Oct 16, 2018 3:15 pm

The routine described is actually in MAGMA, called dgegqr_gpu version 4, along with a few other versions also described there. You can test them and see how they are called, e.g., with

Code: Select all

./testing_dgegqr_gpu --version 4 -N 10000,64 -c

The implementation itself is very simple using the magma building blocks:

Code: Select all

 // ================== Cholesky QR       ================================   
magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
                     dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dgetmatrix( n, n, dwork, n, work, n, queue );
lapackf77_dpotrf( "u", &n, work, &n, info );
magma_dsetmatrix( n, n, work, n, dwork, n, queue );
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
                     m, n, c_one, dwork, n, dA, ldda, queue );
// ================== end of ikind == 4 ================================

Note that this case was targeting tall and skinny matrices so we explicitly copy the overlap matrix to the CPU and do Cholesky there. As N is large in your case you can call directly magma_dpotrf_gpu, i.e., something like this:

Code: Select all

magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
                     dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dpotrf_gpu(MagmaUpper, n, dwork, n, &info);
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
                     m, n, c_one, dwork, n, dA, ldda, queue );

Stan

dsambit · Post by **dsambit** » Tue Oct 16, 2018 3:36 pm

Hi Mark,

I don't see a reason to create an explicit inverse here. Inverses are usually slower and less accurate than doing a solve. Just do a triangular solve (trsm) with L on the right. I.e., solve

We do not use triangular solve currently as our X matrix is a parallel distributed memory (M_global x N) where M_global can go beyond 1e+6 degrees of freedom.
The parallel layout of X comes from the finite element discretization partitioning. We could possibly copy X into a ScaLAPACK block cyclic layout and do triangular solve,
but I am not sure if that would be efficient.

Note the accuracy and orthogonality of the Cholesky QR method may be worse than doing an actual QR factorization (geqrf). See Matlab example below. Ichitaro Yamazaki also has a paper about Cholesky QR.

We have benchmarked the accuracy and robustness of the Cholesky QR in our electronic structure calculations against the actual QR.
The reason we use Cholesky QR is because its prefactor is much lower than actual QR in a parallel implementation.

-Sambit

mgates3 · Post by **mgates3** » Tue Oct 16, 2018 3:54 pm

You could simply compute trsm in chunks (potentially in parallel). E.g., in Matlab:

Code: Select all

% compute nb rows at a time
nb = 100;
for i = 1 : nb : m
    ib = min( nb, m - i + 1 );
    Xhat( i:i+ib-1, : ) = X( i:i+ib-1, : ) / L;
end

I guess it depends on your distribution. It seems trsm (X / L) wouldn't be harder than trmm (X * L^{-1}).

-mark

dsambit · Post by **dsambit** » Tue Oct 16, 2018 10:51 pm

Thank you for your suggestions.
I have enough information now to get started with something similar to Stan's suggestion:

Code: Select all

magma_dgemm( MagmaConjTrans, MagmaNoTrans, n, n, m, c_one,
                     dA, ldda, dA, ldda, c_zero, dwork, n, queue );
magma_dpotrf_gpu(MagmaUpper, n, dwork, n, &info);
magma_dtrsm( MagmaRight, MagmaUpper, MagmaNoTrans, MagmaNonUnit,
                     m, n, c_one, dwork, n, dA, ldda, queue );

I guess it depends on your distribution. It seems trsm (X / L) wouldn't be harder than trmm (X * L^{-1}).

Thanks, I will think about this approach as well.

Best,
Sambit

MAGMA Forum

Create distributed matrix on gpus with no cpu to gpu copy

Create distributed matrix on gpus with no cpu to gpu copy

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop

Re: Create distributed matrix on gpus with no cpu to gpu cop