I'm evaluating the new DSYSV and ZHESV routines for solving medium sized symmetric or hermitian systems with the number of rows/columns ranging from 3000 to about 10000. For the tests I used the supplied test code in testing/ on a K20x GPU and have some questions about the results:
1. Since the pivoting variants of DSYSV and ZHESV perform much slower than their general counterparts DGESV/ZGESV I tried the non-pivoting methods. Those seem to perform very well but the numerical error reported incrases by 10 of magnitude when adding a second right hand side (see below). It then stays at that high rate when solving for more columns in the RHS. Obviously the non-pivoting variants conceptually sacrifice percision for speed and at that system size one probably has to expect a decrease in accuracy but why does the inaccuracy in the LDL decomposition not show when solving for a single RHS while it does when adding more? Somehow this feels like less of a numeric problem than an implementation/measurement problem but I might be wrong. Can anybody explain?
2. The routines also report performances beyond the theoretical peak of the device for larger systems starting at about 5000 rows/columns. This seems to be due to DSYTRF and ZHETRF reporting wrong performance. I tried to find the error but didn't succeed yet (didn't look that hard though), does anybody know what's wrong here?
If these issues are unkown I'll start investigating and try to do a proper bug report with more details. To reproduce (on my platform MKL/CUDA5/Magma 1.6.1/K20x) it suffices to run testing/testing_zhesv_nopiv_gpu --nrhs 2 and compare to --nrhs 1
-mauro
One RHS
Code: Select all
MAGMA 1.6.1 compiled for CUDA capability >= 3.0
CUDA runtime 5050, driver 5050. OpenMP threads 1. MKL 11.1.2, MKL threads 1.
ndevices 1
device 0: Tesla K20X, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_zhesv_nopiv_gpu [options] [-h|--help]
N NRHS CPU GFlop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1088 1 --- ( --- ) 128.66 ( 0.03) 1.96e-18 ok
2112 1 --- ( --- ) 436.21 ( 0.06) 1.60e-18 ok
3136 1 --- ( --- ) 800.54 ( 0.10) 1.35e-18 ok
4160 1 --- ( --- ) 1085.10 ( 0.18) 1.15e-18 ok
5184 1 --- ( --- ) 1302.22 ( 0.29) 1.25e-18 ok
6208 1 --- ( --- ) 1465.26 ( 0.44) 1.15e-18 ok
7232 1 --- ( --- ) 1584.66 ( 0.64) 1.03e-18 ok
8256 1 --- ( --- ) 1671.11 ( 0.90) 9.78e-19 ok
9280 1 --- ( --- ) 1745.97 ( 1.22) 8.53e-19 ok
10304 1 --- ( --- ) 1801.52 ( 1.62) 8.74e-19 ok
Code: Select all
MAGMA 1.6.1 compiled for CUDA capability >= 3.0
CUDA runtime 5050, driver 5050. OpenMP threads 1. MKL 11.1.2, MKL threads 1.
ndevices 1
device 0: Tesla K20X, 732.0 MHz clock, 5759.6 MB memory, capability 3.5
Usage: ./testing_zhesv_nopiv_gpu [options] [-h|--help]
N NRHS CPU GFlop/s (sec) GPU GFlop/s (sec) ||B - AX|| / N*||A||*||X||
================================================================================
1088 2 --- ( --- ) 129.07 ( 0.03) 6.04e-04 failed
2112 2 --- ( --- ) 437.15 ( 0.06) 3.08e-04 failed
3136 2 --- ( --- ) 799.11 ( 0.10) 2.06e-04 failed
4160 2 --- ( --- ) 1084.44 ( 0.18) 1.56e-04 failed
5184 2 --- ( --- ) 1301.66 ( 0.29) 1.26e-04 failed
6208 2 --- ( --- ) 1464.16 ( 0.44) 1.04e-04 failed
7232 2 --- ( --- ) 1582.70 ( 0.64) 8.96e-05 failed
8256 2 --- ( --- ) 1671.41 ( 0.90) 7.90e-05 failed
9280 2 --- ( --- ) 1744.08 ( 1.22) 7.00e-05 failed
10304 2 --- ( --- ) 1801.10 ( 1.62) 6.29e-05 failed