Best option to solve many small linear systems (batched)

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
lukase
Posts: 2
Joined: Tue May 26, 2015 3:42 pm

Best option to solve many small linear systems (batched)

Post by lukase » Tue May 26, 2015 4:00 pm

What is the best option to solve many (1000-5000) small linear systems of size approximately 200?

I have tried the magma_sgesv_batched routine and achieve a performance of approximately 40 GFlop/s (on a K80). I use MAGMA 1.6.1 using the Intel compiler 15.0.0 and Intel MKL as the CPU Lapack interface. Performance is excellent for large matrices.

On the other hand using Intel MKL on a dual socket E5-2630 I get approximately 500 GFlop/s for the same problem. Incidentally, the performance of Intel MKL on the Xeon Phi is similar to the performance on the GPU (approxmately 40 GFlop/s which is also somewhat disappointing).

Are there any other options? If not what is actual the limiting factor in this problem. I realize that the flop/byte ratio is not as favorable as for large matrices. Nevertheless, the difference between the CPU and GPU seems too large. Especially since the batch size is quite large.

Thank you,
Lukas

haidar
Posts: 22
Joined: Fri Sep 19, 2014 3:43 pm

Re: Best option to solve many small linear systems (batched)

Post by haidar » Tue May 26, 2015 4:30 pm

What is the best option to solve many (1000-5000) small linear systems of size approximately 200?

I have tried the magma_sgesv_batched routine and achieve a performance of approximately 40 GFlop/s (on a K80). I use MAGMA 1.6.1 using the Intel compiler 15.0.0 and Intel MKL as the CPU Lapack interface. Performance is excellent for large matrices.
Well there is many issues here.
1- The K80 should be viewed as 2 K40 meaning you should send half of your batch matrices to the device 0 and half of it to device 1 and run each of them independent batched call.
2- We had a problem on our sgemm_batched routine and we are working on fixing it (this is the biggest problem today).
3- The current version does extra allocation and extra work for nothing when the matrix is less than 256. We are working on cleaning it.
4- the sgetrs_batched routine still use the strsm_batched instead of strsv_batched and this is about 3-4 times slower.

We are working into fixing all of these issues one by one.
How much critical is this for you since I can provide you a patch of every fix step by step.

By the way: are you using our tester ?
I would suggest also to report the performance of the sgetrf_batched routine.

On the other hand using Intel MKL on a dual socket E5-2630 I get approximately 500 GFlop/s for the same problem. Incidentally, the performance of Intel MKL on the Xeon Phi is similar to the performance on the GPU (approxmately 40 GFlop/s which is also somewhat disappointing).
I am surprised of the performance of the CPU.
is it the 16 Sandy Bridge machine?

Thanks
Azzam

lukase
Posts: 2
Joined: Tue May 26, 2015 3:42 pm

Re: Best option to solve many small linear systems (batched)

Post by lukase » Tue May 26, 2015 5:34 pm

Thanks for the fast response.

Regarding the CPU I made a copy and paste error (sorry for that). The two CPUs used are CPU E5-2630 v3.

The benchmarks using the Magma test examples as requested

Code: Select all

./testing_sgesv_batched  -N 200 --batch 20000 --lapack 
MAGMA 1.6.1  compiled for CUDA capability >= 3.0
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.0, MKL threads 16. 
ndevices 2
device 0: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
device 1: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
Usage: ./testing_sgesv_batched [options] [-h|--help]

BatchCount    N  NRHS   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
     20000      200     1    265.03 (   0.41)     56.49 (   1.91)   2.35e-09   ok


./testing_sgetrf_batched  -N 200 --batch 20000 --lapack  
MAGMA 1.6.1  compiled for CUDA capability >= 3.0
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.0, MKL threads 16. 
ndevices 2
device 0: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
device 1: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
Usage: ./testing_sgetrf_batched [options] [-h|--help]

BatchCount      M     N     CPU GFlop/s (ms)    MAGMA GFlop/s (ms)  CUBLAS GFlop/s (ms)  ||PA-LU||/(||A||*N)
=========================================================================
     20000     200    200       20.32 (5229.93)     80.55 (1319.29)      37.52 (2832.14)     ---
In my code I use a loop parallelized with OpenMP and call MKL from the body of that loop. This seems to give a bit better performance than in the MAGMA test example (for 200x200 matrices 375 GFlops/s).

My code actually consists of an extremely compute bound problem (for assembling the matrix) and the linear solves. Currently I do the assembly on the GPU interleaved with the solves on the CPU. However, the GPU is underutilized. So it would be nice to move some of the linear solves to the GPU (which would also reduce data transfer).

Lukas

Post Reply