Hi,
I would like to ask what would be the best approach for my particular problem:
I need to solve 100-1000 small (2x2 - 100x100) linear systems in the most efficient way possible as part of bigger HPC problem.
The most common case will have small 2x2, 3x3 or 4x4 liner systems while bigger systems are possible but less and less probable.
The number of linear systems on the other hand can be expected to be around 500 in the 100-1000 range in a consistent way on a specific installation.
I was looking for a high performing GPU based solution to compare to traditional CPU cluster based solutions, but I was not sure if MAGMA offered something efficient for batches of very small matrices under 100x100.
Best solution for solving hundreds of small linear systems
-
abdelfattah83
- Posts: 11
- Joined: Mon Dec 10, 2018 3:02 pm
Re: Best solution for solving hundreds of small linear systems
There is a batch routine for solving many small linear systems (magma_Xgesv_batched), where X specifies the precision (s, d, c, z).
The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.
Here are sample results for magma_dgesv_batched against a CPU-based solution (MKL dgesv + OpenMP). The CPU is a 20-core Haswell architecture. The GPU is a Tesla V100 PCIe.
The routine applies LU factorization with partial pivoting to the input matrices, followed by a row interchanges step and triangular solves. For very small matrices, though, there is definitely a benefit in fusing all of these steps into one GPU kernel, which MAGMA does not currently implement.
Here are sample results for magma_dgesv_batched against a CPU-based solution (MKL dgesv + OpenMP). The CPU is a 20-core Haswell architecture. The GPU is a Tesla V100 PCIe.
Code: Select all
./testing_dgesv_batched -N 10:100:10 --nrhs 1 --batch 500 --niter 1 -l
% MAGMA 2.5.0 svn compiled for CUDA capability >= 6.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10010, driver 10010. OpenMP threads 40. MKL 2018.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Thu Oct 3 12:44:41 2019
% Usage: ./testing_dgesv_batched [options] [-h|--help]
% BatchCount N NRHS CPU Gflop/s (sec) GPU Gflop/s (sec) ||B - AX|| / N*||A||*||X||
%=======================================================================
500 10 1 0.21 ( 0.00) 2.61 ( 0.00) 1.03e-17 ok
500 20 1 4.24 ( 0.00) 25.75 ( 0.00) 5.10e-18 ok
500 30 1 5.35 ( 0.00) 78.02 ( 0.00) 3.38e-18 ok
500 40 1 28.56 ( 0.00) 53.75 ( 0.00) 3.08e-18 ok
500 50 1 38.36 ( 0.00) 84.90 ( 0.00) 2.21e-18 ok
500 60 1 55.46 ( 0.00) 129.25 ( 0.00) 2.23e-18 ok
500 70 1 68.17 ( 0.00) 100.43 ( 0.00) 2.12e-18 ok
500 80 1 60.25 ( 0.00) 114.92 ( 0.00) 1.69e-18 ok
500 90 1 68.59 ( 0.00) 128.80 ( 0.00) 1.63e-18 ok
500 100 1 84.74 ( 0.00) 158.36 ( 0.00) 1.65e-18 okRe: Best solution for solving hundreds of small linear systems
Thank you very much fir the detailed reply, do I see correctly that all the timings are 0.00 on both GPU and CPU? If so maybe increasing the batch size to 10^3 - 10^5 might actually be more of a workload?
Re: Best solution for solving hundreds of small linear systems
Yes, unfortunately here the time gets rounded down. But the performance is reflected in the Gflop/s rate. You can compute the approximate time using the formula:
2/3 n^3 * batch_count / (gflop/s)
For instance
2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.
-mark
2/3 n^3 * batch_count / (gflop/s)
For instance
2/3 * 100^3 * 500 / 158.36e9 = 0.0021 sec.
-mark