MAGMA + pycuda + my own CUDA kernels

mrader1248 · Post by **mrader1248** » Thu Sep 28, 2017 5:09 am

Hey all,

currently I have a problem when using MAGMA, pycuda and my own CUDA kernels together and I am not sure whether this is an issue with MAGMA, pycuda or my knowledge of CUDA.

In the code attached below, I am using CUDA via pycuda and also load MAGMA to perform a QR decomposition. In the end I also call a kernel tril to obtain R (of the QR decomposition). However, when I call the kernel, the program crashes with

Code: Select all

pycuda._driver.LogicError: cuFuncSetBlockShape failed: invalid resource handle

If I swap the code parts 1 and 2, the program gives the correct result. So my question is: Is there a specific sequence of calling MAGMA functions and compiling/calling my own kernels? Or is this maybe a pycuda issue?

Thank you in advance!

Code: Select all

import ctypes
import numpy as np

import pycuda.autoinit
import pycuda.compiler
import pycuda.gpuarray as gpuarray

c_int_type = ctypes.c_longlong
c_ptr_type = ctypes.c_void_p

_libmagma = ctypes.cdll.LoadLibrary("libmagma.so")

magma_init = _libmagma.magma_init
magma_init.restype = int
magma_init.argtypes = []

magma_finalize = _libmagma.magma_finalize
magma_finalize.restype = int
magma_finalize.argtypes = []

_magma_get_zgeqrf_nb = _libmagma.magma_get_zgeqrf_nb
_magma_get_zgeqrf_nb.restype = int
_magma_get_zgeqrf_nb.argtypes = [
    c_int_type, c_int_type # m, n
]

_magma_zgeqrf_gpu = _libmagma.magma_zgeqrf_gpu
_magma_zgeqrf_gpu.restype = int
_magma_zgeqrf_gpu.argtypes = [
    c_int_type, c_int_type, # m, n,
    c_ptr_type, c_int_type, # dA, ldda,
    c_ptr_type, c_ptr_type, # tau, dT
    c_ptr_type              # info
]

magma_init()

### PART 1 ###

cuda_source_module = pycuda.compiler.SourceModule("""
#include <pycuda-complex.hpp>

__global__ void tril(pycuda::complex<double> * a, int m, int mn) {
    int k = threadIdx.x + blockIdx.x * blockDim.x;
    int kx = k / m;
    int ky = k % m;
    if (k < mn) {
        if (kx < ky) {
            a[k] = 0.0;
        }
    }
}
""")

_tril_raw = cuda_source_module.get_function("tril")

def _tril(a, blocksize=1024):
    gridsize = int(np.ceil(a.size/(1.*blocksize)))
    m, n = a.shape
    _tril_raw(a, np.int32(m), np.int32(m*n), 
              block=(blocksize,1,1), grid=(gridsize,1))

### PART 1 END ###

### PART 2 ###

m = 3
n = 2
a = np.random.rand(m,n) + 1j*np.random.rand(m,n)
a = np.asfortranarray(a)
a_gpu = gpuarray.to_gpu(a)

k = min(m, n)
nb = _magma_get_zgeqrf_nb(m, n)
tau = np.empty(k, dtype=np.complex128)
Tsize = (2*k + int(np.ceil(n / 32.))*32)*nb
T = gpuarray.empty(Tsize, dtype=np.complex128)
info = c_int_type()

_magma_zgeqrf_gpu(m, n, int(a_gpu.gpudata), m, 
                  int(tau.ctypes.data), int(T.gpudata), 
                  ctypes.byref(info))

### PART 2 END ###

_tril(a_gpu)

print a_gpu

magma_finalize()

mgates3 · Post by **mgates3** » Thu Sep 28, 2017 10:07 am

I haven't tried MAGMA with Python, so I'm not sure what might be going on here. One oddity I noticed is that you set c_int_type to long long, which I presume is 64-bit. Commonly, MAGMA is compiled with 32-bit int. Only if you compiled MAGMA with ILP64 and linked with an ILP64 BLAS & LAPACK library, then long long would be correct.

Some side comments:

You can use magmablas_zlaset to set the upper triangle, lower triangle, or whole matrix to zero or any other value. It can set the diagonal and off-diagonal differently.

You probably want to use magma_zgeqrf2_gpu, which has LAPACK-complaint arguments, instead of magma_zgeqrf_gpu. The main difference is that in magma_zgeqrf_gpu the diagonal blocks of the R matrix are stored in dT, so it's harder to extract R. (The reason is somewhat historical, but was done to make solves using QR faster.)

-mark

mrader1248 · Post by **mrader1248** » Fri Sep 29, 2017 4:37 am

Thank you very much for your comments!

Regarding the int32/int64 thing: I actually took the long long type from the scikit-cuda binding. However, I also tested int32 and int64 explicitly with the same result.
https://github.com/lebedov/scikit-cuda/ ... a/magma.py

Thank you for pointing out the magma functions I could use instead. Although, the tril kernel I used was just for giving a minimal example (I am using other kernels as well in my application).

Maybe the real problem is addressed here:
https://github.com/inducer/pycuda/issues/153
Andreas Klöckner (the pycuda developer) pointed out, that pycuda uses the "driver API" and MAGMA uses the "runtime API" and this might cause my problems. As I am not a CUDA expert, I am still thinking of how to solve my problems, but is it possible to use MAGMA also via the "driver API"?

mgates3 · Post by **mgates3** » Sun Oct 01, 2017 2:16 pm

That's interesting to know about the driver vs. runtime API. Yes, we always use the runtime API. If I recall correctly it avoids issues with running on different machines that have different (e.g., newer) driver versions installed. (One of our clients has this requirement.)

It could be that always running some dummy MAGMA kernel first, before a PyCUDA kernel (or vice-versa) during initialization may solve the problem.

-mark

mrader1248 · Post by **mrader1248** » Mon Oct 02, 2017 4:58 am

The answers seems to be given here:
https://stackoverflow.com/a/20553411/8683723

This is how I am interpreting it:
1. import pycuda.autoinit
2. Call magma_setdevice and don't call magma_init.
3. Now you are free to do whatever you like.

But now I'm wondering if MAGMA will work without calling magma_init? And, which parameters to use for calling magma_setdevice (depending on single-/multi-GPU setup)?

mrader1248 · Post by **mrader1248** » Mon Oct 02, 2017 8:27 am

I guess, I finally managed to get it working (for details see my example below). I took the final idea from this post: https://stackoverflow.com/a/10415966/8683723

Apparently one also has to call cudaFree(0) right after magma_setdevice.

Code: Select all

import ctypes
import numpy as np
import sys

import pycuda.autoinit
import pycuda.compiler
import pycuda.gpuarray as gpuarray

c_int_type = ctypes.c_int
c_ptr_type = ctypes.c_void_p

libcudart = ctypes.cdll.LoadLibrary("libcudart.so")

cudaFree = libcudart.cudaFree
cudaFree.restype = int
cudaFree.argtypes = [c_ptr_type]

libmagma = ctypes.cdll.LoadLibrary("libmagma.so")

magma_init = libmagma.magma_init
magma_init.restype = int
magma_init.argtypes = []

_magma_getdevices = libmagma.magma_getdevices
_magma_getdevices.restype = None
_magma_getdevices.argtypes = [c_ptr_type, c_int_type, c_ptr_type]

def magma_getdevices():
    size = 32
    devices = np.empty(size, dtype=c_int_type)
    num_devices = c_int_type()
    _magma_getdevices(devices.ctypes.data, size, ctypes.byref(num_devices))
    if num_devices.value == size:
        print "warning: maybe not all devices were detected!"
    return devices[:num_devices.value].copy()

magma_setdevice = libmagma.magma_setdevice
magma_setdevice.restype = None
magma_setdevice.argtypes = [c_int_type]

magma_get_zgeqrf_nb = libmagma.magma_get_zgeqrf_nb
magma_get_zgeqrf_nb.restype = int
magma_get_zgeqrf_nb.argtypes = [
    c_int_type, c_int_type # m, n
]

magma_zgeqrf_gpu = libmagma.magma_zgeqrf_gpu
magma_zgeqrf_gpu.restype = int
magma_zgeqrf_gpu.argtypes = [
    c_int_type, c_int_type, # m, n,
    c_ptr_type, c_int_type, # dA, ldda,
    c_ptr_type, c_ptr_type, # tau, dT
    c_ptr_type              # info
]

status = magma_init()
print "magma_init, status:", status

device_ids = magma_getdevices()
print "magma_getdevices, device_ids =", device_ids

magma_setdevice(device_ids[0])
status = cudaFree(0)
print "cudaFree, status:", status

cuda_source = """
#include <pycuda-complex.hpp>

__global__ void tril(pycuda::complex<double> * a, int m, int mn) {
    int k = threadIdx.x + blockIdx.x * blockDim.x;
    int kx = k / m;
    int ky = k % m;
    if (k < mn) {
        if (kx < ky) {
            a[k] = 0.0;
        }
    }
}
"""
cuda_source_module = pycuda.compiler.SourceModule(cuda_source)

tril_raw = cuda_source_module.get_function("tril")

def tril(a, blocksize=1024):
    gridsize = int(np.ceil(a.size/(1.*blocksize)))
    m, n = a.shape
    tril_raw(a, np.int32(m), np.int32(m*n), 
             block=(blocksize,1,1), grid=(gridsize,1))

m = 3
n = 2
a = np.random.rand(m,n) + 1j*np.random.rand(m,n)
a = np.asfortranarray(a)
a_gpu = gpuarray.to_gpu(a)

k = min(m, n)
nb = magma_get_zgeqrf_nb(m, n)
tau = np.empty(k, dtype=np.complex128)
Tsize = (2*k + int(np.ceil(n / 32.))*32)*nb
T = gpuarray.empty(Tsize, dtype=np.complex128)
info = c_int_type()

magma_zgeqrf_gpu(m, n, int(a_gpu.gpudata), m, 
                  int(tau.ctypes.data), int(T.gpudata), 
                  ctypes.byref(info))

tril(a_gpu)

print a_gpu

mgates3 · Post by **mgates3** » Mon Oct 02, 2017 2:39 pm

magma_init() caches what CUDA devices are installed. Not calling magma_init() may cause some MAGMA kernels that depend on the CUDA architecture version to fail. However, they should print a warning like:

Code: Select all

Error in xyz: MAGMA not initialized (call magma_init() first) or bad device

Further, if you use the old magma.h instead of the new magma_v2.h, you may run into issues with queues that weren't allocated. I suggest using magma_v2.h and adding MAGMA_NO_V1 to CFLAGS in your make.inc.

As for calling magma_setdevice, use the index of the GPU you want to use. The first device, 0, is often fine. If you are using multi-GPUs, then currently all the codes use GPUs 0 to the number of GPUs requested (e.g., $MAGMA_NUM_GPUS); the codes set the devices themselves. You may need to initialize a CUDA driver context on each GPU, though.

The magma_getdevices is a bit of old code, originally intended to help with portability to OpenCL. It should just return an array with 0 to the number of GPUs - 1, so you can skip it and just use the normal CUDA device indices.

-mark

mrader1248 · Post by **mrader1248** » Wed Oct 04, 2017 5:12 am

Thank you very much for your help! Finally everything works. :)

mrader1248 · Post by **mrader1248** » Fri Oct 06, 2017 6:15 am

After some testing, I finally understood, what you meant in the beginning with "You probably want to use magma_zgeqrf2_gpu". When I increase the dimensions of my matrices, R is no longer stored in the memory containing A initally.

I want to be able to restore R (and also Q). In
viewtopic.php?f=2&t=321&p=1136&hilit=ma ... _gpu#p1136
I read that I should use magma_zgeqrf3_gpu for this task and a code like the one that follows.

Code: Select all

// get nb - in all functions has to be the same
int nb = magma_get_dgeqrf_nb(m);

// Do the QR factorization
magma_dgeqrf3_gpu( m, n, dA, ldda, tau, dT, info );

// copy dA to dQ to generate Q in dQ
cudaMemcpy2D(dQ, m*sizeof(double), dA, ldda*sizeof(double), m*sizeof(double), n, cudaMemcpyDeviceToDevice);

// copy dA to dR to generate R in dR
cudaMemcpy2D(dR, m*sizeof(double), dA, ldda*sizeof(double), m*sizeof(double), n, cudaMemcpyDeviceToDevice);

// generate Q
magma_dorgqr_gpu(m, m, min(m, n), dQ, lda, tau, dT, nb, &info);

// generate R
magmablas_dswapdblk(min(m, n), nb, dR, lda, 1, dT+min(m,n)*nb, nb, 0);

Is this still the best thing I can do?

What if I use magma_zgeqrf2_gpu instead? Then I don't have this T workspace. So I cannot restore R with a magma function on the GPU?

The code example copied from the linked post has only 8 parameters for the call of magmablas_dswapdblk. The documentation says that this function has 9 parameters. Do I just have to append the queue parameter? Is the last parameter (value 0) correct?

Thank you already in advance for your help!

mgates3 · Post by **mgates3** » Sat Oct 14, 2017 3:43 am

With magma_zgeqrf2_gpu, the arguments are LAPACK-complaint. This is the easiest way to get the R matrix. Basically, R overwrites the upper triangle of dA (including the diagonal), and Householder vectors for Q overwrite the lower triangle of dA. You don't need to mess with a dT array.

Using magma_zgeqrf3_gpu is more complicated because diagonal tiles of the R matrix are stored separately in the dT array, and everything has to be reconstructed. That is possible, but I don't recommend it. (testing_zgeqrf_gpu.cpp reconstructs R for version 3.) We use it because then multiplying by Q is little bit faster (the block-column V's that define Q have their upper triangle already set to the identity).

-mark

MAGMA Forum

MAGMA + pycuda + my own CUDA kernels

MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels

Re: MAGMA + pycuda + my own CUDA kernels