What is the best option to solve many (1000-5000) small linear systems of size approximately 200?
I have tried the magma_sgesv_batched routine and achieve a performance of approximately 40 GFlop/s (on a K80). I use MAGMA 1.6.1 using the Intel compiler 15.0.0 and Intel MKL as the CPU Lapack interface. Performance is excellent for large matrices.
On the other hand using Intel MKL on a dual socket E5-2630 I get approximately 500 GFlop/s for the same problem. Incidentally, the performance of Intel MKL on the Xeon Phi is similar to the performance on the GPU (approxmately 40 GFlop/s which is also somewhat disappointing).
Are there any other options? If not what is actual the limiting factor in this problem. I realize that the flop/byte ratio is not as favorable as for large matrices. Nevertheless, the difference between the CPU and GPU seems too large. Especially since the batch size is quite large.
Thank you,
Lukas
Best option to solve many small linear systems (batched)
Re: Best option to solve many small linear systems (batched)
Well there is many issues here.What is the best option to solve many (1000-5000) small linear systems of size approximately 200?
I have tried the magma_sgesv_batched routine and achieve a performance of approximately 40 GFlop/s (on a K80). I use MAGMA 1.6.1 using the Intel compiler 15.0.0 and Intel MKL as the CPU Lapack interface. Performance is excellent for large matrices.
1- The K80 should be viewed as 2 K40 meaning you should send half of your batch matrices to the device 0 and half of it to device 1 and run each of them independent batched call.
2- We had a problem on our sgemm_batched routine and we are working on fixing it (this is the biggest problem today).
3- The current version does extra allocation and extra work for nothing when the matrix is less than 256. We are working on cleaning it.
4- the sgetrs_batched routine still use the strsm_batched instead of strsv_batched and this is about 3-4 times slower.
We are working into fixing all of these issues one by one.
How much critical is this for you since I can provide you a patch of every fix step by step.
By the way: are you using our tester ?
I would suggest also to report the performance of the sgetrf_batched routine.
I am surprised of the performance of the CPU.On the other hand using Intel MKL on a dual socket E5-2630 I get approximately 500 GFlop/s for the same problem. Incidentally, the performance of Intel MKL on the Xeon Phi is similar to the performance on the GPU (approxmately 40 GFlop/s which is also somewhat disappointing).
is it the 16 Sandy Bridge machine?
Thanks
Azzam
Re: Best option to solve many small linear systems (batched)
Thanks for the fast response.
Regarding the CPU I made a copy and paste error (sorry for that). The two CPUs used are CPU E5-2630 v3.
The benchmarks using the Magma test examples as requested
In my code I use a loop parallelized with OpenMP and call MKL from the body of that loop. This seems to give a bit better performance than in the MAGMA test example (for 200x200 matrices 375 GFlops/s).
My code actually consists of an extremely compute bound problem (for assembling the matrix) and the linear solves. Currently I do the assembly on the GPU interleaved with the solves on the CPU. However, the GPU is underutilized. So it would be nice to move some of the linear solves to the GPU (which would also reduce data transfer).
Lukas
Regarding the CPU I made a copy and paste error (sorry for that). The two CPUs used are CPU E5-2630 v3.
The benchmarks using the Magma test examples as requested
Code: Select all
./testing_sgesv_batched -N 200 --batch 20000 --lapack
MAGMA 1.6.1 compiled for CUDA capability >= 3.0
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.0, MKL threads 16.
ndevices 2
device 0: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
device 1: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
Usage: ./testing_sgesv_batched [options] [-h|--help]
BatchCount N NRHS CPU GFlop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
20000 200 1 265.03 ( 0.41) 56.49 ( 1.91) 2.35e-09 ok
./testing_sgetrf_batched -N 200 --batch 20000 --lapack
MAGMA 1.6.1 compiled for CUDA capability >= 3.0
CUDA runtime 7000, driver 7000. OpenMP threads 16. MKL 11.2.0, MKL threads 16.
ndevices 2
device 0: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
device 1: Tesla K80, 823.5 MHz clock, 11519.6 MB memory, capability 3.7
Usage: ./testing_sgetrf_batched [options] [-h|--help]
BatchCount M N CPU GFlop/s (ms) MAGMA GFlop/s (ms) CUBLAS GFlop/s (ms) ||PA-LU||/(||A||*N)
=========================================================================
20000 200 200 20.32 (5229.93) 80.55 (1319.29) 37.52 (2832.14) ---
My code actually consists of an extremely compute bound problem (for assembling the matrix) and the linear solves. Currently I do the assembly on the GPU interleaved with the solves on the CPU. However, the GPU is underutilized. So it would be nice to move some of the linear solves to the GPU (which would also reduce data transfer).
Lukas