Symmetric and Hermitian solvers

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
mcalderara
Posts: 4
Joined: Mon Jan 26, 2015 10:20 am

Symmetric and Hermitian solvers

Post by mcalderara » Thu Mar 05, 2015 5:47 am

Hi everyone,

I'm evaluating the new DSYSV and ZHESV routines for solving medium sized symmetric or hermitian systems with the number of rows/columns ranging from 3000 to about 10000. For the tests I used the supplied test code in testing/ on a K20x GPU and have some questions about the results:

1. Since the pivoting variants of DSYSV and ZHESV perform much slower than their general counterparts DGESV/ZGESV I tried the non-pivoting methods. Those seem to perform very well but the numerical error reported incrases by 10 of magnitude when adding a second right hand side (see below). It then stays at that high rate when solving for more columns in the RHS. Obviously the non-pivoting variants conceptually sacrifice percision for speed and at that system size one probably has to expect a decrease in accuracy but why does the inaccuracy in the LDL decomposition not show when solving for a single RHS while it does when adding more? Somehow this feels like less of a numeric problem than an implementation/measurement problem but I might be wrong. Can anybody explain?

2. The routines also report performances beyond the theoretical peak of the device for larger systems starting at about 5000 rows/columns. This seems to be due to DSYTRF and ZHETRF reporting wrong performance. I tried to find the error but didn't succeed yet (didn't look that hard though), does anybody know what's wrong here?

If these issues are unkown I'll start investigating and try to do a proper bug report with more details. To reproduce (on my platform MKL/CUDA5/Magma 1.6.1/K20x) it suffices to run testing/testing_zhesv_nopiv_gpu --nrhs 2 and compare to --nrhs 1

-mauro

One RHS

Code: Select all

MAGMA 1.6.1  compiled for CUDA capability >= 3.0
CUDA runtime 5050, driver 5050. OpenMP threads 1. MKL 11.1.2, MKL threads 1. 
ndevices 1
device 0: Tesla K20X, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_zhesv_nopiv_gpu [options] [-h|--help]

    N  NRHS   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
 1088     1     ---   (  ---  )    128.66 (   0.03)   1.96e-18   ok
 2112     1     ---   (  ---  )    436.21 (   0.06)   1.60e-18   ok
 3136     1     ---   (  ---  )    800.54 (   0.10)   1.35e-18   ok
 4160     1     ---   (  ---  )   1085.10 (   0.18)   1.15e-18   ok
 5184     1     ---   (  ---  )   1302.22 (   0.29)   1.25e-18   ok
 6208     1     ---   (  ---  )   1465.26 (   0.44)   1.15e-18   ok
 7232     1     ---   (  ---  )   1584.66 (   0.64)   1.03e-18   ok
 8256     1     ---   (  ---  )   1671.11 (   0.90)   9.78e-19   ok
 9280     1     ---   (  ---  )   1745.97 (   1.22)   8.53e-19   ok
10304     1     ---   (  ---  )   1801.52 (   1.62)   8.74e-19   ok
Two (or more) RHS

Code: Select all

MAGMA 1.6.1  compiled for CUDA capability >= 3.0
CUDA runtime 5050, driver 5050. OpenMP threads 1. MKL 11.1.2, MKL threads 1. 
ndevices 1
device 0: Tesla K20X, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_zhesv_nopiv_gpu [options] [-h|--help]

    N  NRHS   CPU GFlop/s (sec)   GPU GFlop/s (sec)   ||B - AX|| / N*||A||*||X||
================================================================================
 1088     2     ---   (  ---  )    129.07 (   0.03)   6.04e-04   failed
 2112     2     ---   (  ---  )    437.15 (   0.06)   3.08e-04   failed
 3136     2     ---   (  ---  )    799.11 (   0.10)   2.06e-04   failed
 4160     2     ---   (  ---  )   1084.44 (   0.18)   1.56e-04   failed
 5184     2     ---   (  ---  )   1301.66 (   0.29)   1.26e-04   failed
 6208     2     ---   (  ---  )   1464.16 (   0.44)   1.04e-04   failed
 7232     2     ---   (  ---  )   1582.70 (   0.64)   8.96e-05   failed
 8256     2     ---   (  ---  )   1671.41 (   0.90)   7.90e-05   failed
 9280     2     ---   (  ---  )   1744.08 (   1.22)   7.00e-05   failed
10304     2     ---   (  ---  )   1801.10 (   1.62)   6.29e-05   failed

ichitaro
Posts: 9
Joined: Fri Jul 12, 2013 11:11 am

Re: Symmetric and Hermitian solvers

Post by ichitaro » Thu Mar 05, 2015 1:09 pm

Hello, Mauro,

1. I also need to take a look at this numerical issues. Since we use random matrix and vectors, I would think the error should stay about the same for multiple right hand sides (unless we get very unlucky).. I will let you know if I find something..

2. That is a bug. We use LU flops for ZHESV (we probably use that to compare agains LU).

Thank you for the comments!!
Ichi

ichitaro
Posts: 9
Joined: Fri Jul 12, 2013 11:11 am

Re: Symmetric and Hermitian solvers

Post by ichitaro » Thu Mar 05, 2015 1:23 pm

Hi, Mauro,

Actually, there was a bug in ZHESV for multiple right-hand-sides, which has been fixed in our SVN. I attached the patch.

Please let us know if you have more comments!!
Ichi
Attachments
xsytrs_nopiv_gpu.tar
Patch to fix sytrs with multiple rhs by Adrien.
(20 KiB) Downloaded 108 times

mcalderara
Posts: 4
Joined: Mon Jan 26, 2015 10:20 am

Re: Symmetric and Hermitian solvers

Post by mcalderara » Sun Mar 08, 2015 4:42 pm

Hi Ichitaro,

thanks for the patch! It does solve the problem with the drop in precision but introduces a performance degradation when supplying more RHS. The previous code ran at roughly the same speed (in GFLops) irrespective of the number of RHS, the new one drops to around 15GFlops when solving a 4000x4000 system with 4000 RHS and then takes about 40 seconds to solve the system.

Best,
mauro

mcalderara
Posts: 4
Joined: Mon Jan 26, 2015 10:20 am

Re: Symmetric and Hermitian solvers

Post by mcalderara » Tue Mar 10, 2015 12:24 pm

After looking at the code I have the impression that the slow down of the code is due to something that could potentially be optimized a bit. The fix for the precision involves scaling the matrix with the elements of each column of the RHS using zscal_diag(). This will result in a number of divisions that grows linearly both in system size and the number of RHS.

Since divisions tend to be slow, would it not make sense to compute the product of each RHS row's values, invert it once and then scale by multiplying the elements in the system with that inverse? Assuming the same number n of RHS as rows and columns in the system, this would result in 1.5*n multiplications and one division instead of 1.5 n*n divisions. Or would that be numerically unstable?

mauro

ichitaro
Posts: 9
Joined: Fri Jul 12, 2013 11:11 am

Re: Symmetric and Hermitian solvers

Post by ichitaro » Thu Mar 12, 2015 1:25 pm

Thank you for the post, again,

Apparently, the previous patch was a quick-fix to address the multiple rhs. Here is another version, which should perform better. Please let us know if you have more issues!!

Best,
Ichi
Attachments
xsytrs_nopiv_gpu.tar
(20 KiB) Downloaded 100 times

Post Reply