Fortran Subarrays on GPU in RC4

fletchjp · Post by **fletchjp** » Sun Mar 13, 2011 6:50 am

I have some FORTRAN code which I am porting to work with MAGMA. I am using magma_dgetrf_gpu and magma_dgetrs_gpu which work fine. I am now working on speeding up the building of the matrix.

I am calculating some values on the GPU which form one row of the matrix. At the moment I copy them back a row at a time to the matrix on the CPU and then copy the whole matrix back to the GPU. This is clearly wasteful:

The device pointers are defined as in testing_dgetrf_gpu_f.f in RC$:

Code: Select all

      real, dimension(4)            :: devptrA, devptrB

My code to transfer one row looks like this (I am storing the transpose as the elements are then adjacent):

Code: Select all

      call cublas_get_matrix(n, 1, size_of_elt, devptrD, n, 
     $                       G(1,jrow),n)

G is an array on the CPU. This is followed later by the following:

Code: Select all

!---- devPtrA = G
      call cublas_set_matrix(n, n, size_of_elt, G, ldda, devptrA, ldda)

What I would like to do is something like this:

Code: Select all

      call cublas_dcopy(n,devptrD,1,devptrXXX,1)

where devptrXXX needs to point to the correct location in devptrA. I have been looking around for an example of this and cannot find one.

If I can crack this I can save two complete matrix transfers and the memory of the array on the CPU.

It would help to have some explanation for the design decision to change the type of these pointers from RC3 to RC4

Please help if you can.

Thanks

John

fletchjp · Post by **fletchjp** » Mon Mar 14, 2011 4:49 am

I have been working on a solution to my own problem and here is one which works with gfortran (4.4.3) on Ubuntu Linux 10.4 (64 bit).

I could not find a way to add anything to the pointers as defined in the MAGMA examples. So I looked at the code in cuda/src/fortran.c and added a new routine of my own. I spotted that all the interface routines did was to cast the FORTRAN pointer to a C pointer, so I have added the offset at that point. The function name has a trailing underscore for FORTRAN linkage.

Code: Select all

/* dcopy_offset.c
   This is a special version of the cublas_dcopy routine with an extra
   argument for the offset.  
*/

#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <stdlib.h>
#if defined(__GNUC__)
#include <stdint.h>
#endif /* __GNUC__ */
#include "cublas.h"   /* CUBLAS public header file  */

#include "fortran_common.h"
#include "fortran.h"

/* Note _ at end of name for linking with gfortran */
void cublas_dcopy_offset_ (const int *n, const devptr_t *devPtrx, const int *incx, 
                     const devptr_t *devPtry, const int *incy, const int *offset)
{
    double *x = (double *)(*devPtrx);
    double *y = (double *)(*devPtry+*offset);
    cublasDcopy (*n, x, *incx, y, *incy);
}

Compiling requires access to the directories cuda/include and cuda/src:

Code: Select all

	gcc -O3 -DADD_ -DGPUSHMEM=200 -I$(CUDA_INCLUDE) -I$(CUDA_SRC) -c dcopy_offset.c

I found this reference: http://www.gsic.titech.ac.jp/~ccwww/teb ... la5_e.html
which helped with the calculation of the offset. The calculation is implemented as a FORTRAN function.

Code: Select all

      INTEGER FUNCTION IDX2F(i,j,ld) 
        IDX2F = ((((j)-1)*(ld))+((i)-1))
      end

The usage is like this:

Code: Select all

      call cublas_dcopy_offset(n,devptrD,1,devptrA,1,
     &                         IDX2F(1,jrow,ldda)*size_of_elt)

I have tested this and it works in my environment. The function name on the C function will not need the underscore in some other environments.

I plan to extend this to allow the offset to be on the first parameter, or both.

I hope this helps in the developments.

Please let me have any comments on whether there is a neater way.

John

mateo70 · Post by **mateo70** » Tue Mar 15, 2011 5:08 pm

John,

thanks for this, we were searching how to handle that too. And we were thinking about providing a function to the user to compute the correct pointer on the device, but it will probably work for MAGMA and not for cublas.

Mathieu

fletchjp · Post by **fletchjp** » Wed Mar 16, 2011 8:57 am

Transfers have a big setup time. I have an application where I transfer 6 blocks 1 by N (N=8500 in this case).

I would like to transfer one block 6 by N and set up pointers to 6 contiguous blocks of N. That would save 5 setups per occurence. I have to do this 4250 times, so it is a big saving.

One way to solve this would be to have a FORTRAN subroutine which took a pointer and an offset and gave back a new pointer which was the old pointer plus the offset.

John

fletchjp · Post by **fletchjp** » Sat Mar 19, 2011 7:57 pm

I have now written the following C function to handle pointer arithmetic from MAGMA FORTRAN. The same headers as in my previous posting.

Code: Select all

void devptr_offset_(devptr_t *devPtr2,const devptr_t *devPtr1,const int *offset, const int *size)
{
  *devPtr2 = (*devPtr1 + (*offset)*(*size));
}

Usage is like this from a FORTRAN program:

Code: Select all

      stat = cublas_alloc(n*2, size_of_elt, devPtrD)
      call devptr_offset(devPtrD1,devPtrD,n,size_of_elt)

Then devPtrD points to start of the array and devPtrD1 to the second n.

Note the need to give the size of each element.

I have applied this to my problem with the six vectors which I discussed above and find that the gain in time is smaller than I expected. I needed to put the arrays in a common block on the CPU so that I could control their relative location. Each vector is about 6000 double precision variables.

John

mateo70 · Post by **mateo70** » Mon Mar 21, 2011 12:07 pm

Thanks john,

that's close to what we planned to include in the next release with the fortran interface. I'm just busy with other projects right now, so I don't have a date for this final release.
The prototype we were thinking about is:

Code: Select all

 magma_[zcds]offset( NewPtr, OldPtr, LDA, I, J)

Mathieu

fletchjp · Post by **fletchjp** » Mon Mar 21, 2011 1:11 pm

I guess you will use the type letter to deduce the size of the elements.

Could you do a one dimensional version as well?

John

mateo70 · Post by **mateo70** » Wed Apr 06, 2011 2:46 pm

Yes, I will do that.
Hopefully I looked at this post before to do it

.

Mathieu

MAGMA Forum

Fortran Subarrays on GPU in RC4

Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4

Re: Fortran Subarrays on GPU in RC4