Page 1 of 1

Best solution for solving hundreds of small linear systems

Posted: Wed Oct 02, 2019 7:30 am
by al12rs
Hi,
I would like to ask what would be the best approach for my particular problem:

I need to solve 100-1000 small (2x2 - 100x100) linear systems in the most efficient way possible as part of bigger HPC problem.
The most common case will have small 2x2, 3x3 or 4x4 liner systems while bigger systems are possible but less and less probable.
The number of linear systems on the other hand can be expected to be around 500 in the 100-1000 range in a consistent way on a specific installation.

I was looking for a high performing GPU based solution to compare to traditional CPU cluster based solutions, but I was not sure if MAGMA offered something efficient for batches of very small matrices under 100x100.

Re: Best solution for solving hundreds of small linear systems

Posted: Thu Oct 03, 2019 12:54 pm
by abdelfattah83
There is a batch routine for solving many small linear systems (magma_Xgesv_batched), where X specifies the precision (s, d, c, z).

The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.

Here are sample results for magma_dgesv_batched against a CPU-based solution (MKL dgesv + OpenMP). The CPU is a 20-core Haswell architecture. The GPU is a Tesla V100 PCIe.

Code: Select all

./testing_dgesv_batched -N 10:100:10 --nrhs 1 --batch 500 --niter 1 -l
% MAGMA 2.5.0 svn compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10010, driver 10010. OpenMP threads 40. MKL 2018.0.1, MKL threads 20. 
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Thu Oct  3 12:44:41 2019
% Usage: ./testing_dgesv_batched [options] [-h|--help]

% BatchCount   N  NRHS   CPU Gflop/s (sec)   GPU Gflop/s (sec)   ||B - AX|| / N*||A||*||X||
%=======================================================================
       500     10     1      0.21 (   0.00)      2.61 (   0.00)   1.03e-17   ok
       500     20     1      4.24 (   0.00)     25.75 (   0.00)   5.10e-18   ok
       500     30     1      5.35 (   0.00)     78.02 (   0.00)   3.38e-18   ok
       500     40     1     28.56 (   0.00)     53.75 (   0.00)   3.08e-18   ok
       500     50     1     38.36 (   0.00)     84.90 (   0.00)   2.21e-18   ok
       500     60     1     55.46 (   0.00)    129.25 (   0.00)   2.23e-18   ok
       500     70     1     68.17 (   0.00)    100.43 (   0.00)   2.12e-18   ok
       500     80     1     60.25 (   0.00)    114.92 (   0.00)   1.69e-18   ok
       500     90     1     68.59 (   0.00)    128.80 (   0.00)   1.63e-18   ok
       500     100     1     84.74 (   0.00)    158.36 (   0.00)   1.65e-18   ok

Re: Best solution for solving hundreds of small linear systems

Posted: Fri Oct 04, 2019 7:55 pm
by al12rs
Thank you very much fir the detailed reply, do I see correctly that all the timings are 0.00 on both GPU and CPU? If so maybe increasing the batch size to 10^3 - 10^5 might actually be more of a workload?

Re: Best solution for solving hundreds of small linear systems

Posted: Mon Oct 07, 2019 4:01 pm
by mgates3
Yes, unfortunately here the time gets rounded down. But the performance is reflected in the Gflop/s rate. You can compute the approximate time using the formula:

2/3 n^3 * batch_count / (gflop/s)

For instance

2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.

-mark