Page 1 of 1
Limitations on precision
Posted: Mon Mar 02, 2020 1:28 am
by generalzod
Hello
I want to get eigenvalue/eigenvectors for very large and dense matrix (n=100k)
It seems like errors are accumulate more as matrix gets larger (|A-USU^H|)
Is this inherent limitation of double precision arithmetic or can it be mitigated somehow with other iterative method?
Thank you
Re: Limitations on precision
Posted: Mon Mar 02, 2020 11:08 am
by mgates3
Are you using MAGMA's testers to test these, e.g., testing/testing_zheevd?
Which specific routine are you using?
If using MAGMA's tester, can you share the complete input & output that is concerning you?
We generally check the relative backwards error,
|| A - U S U^H ||_1 / ( || A ||_1 N )
MAGMA's tester abbreviates that as |A-USU^H| in the output header, but actually computes the above quantity.
The absolute error || A - U S U^H ||_1 does grow with the matrix size, since more values are accumulated into the norm. E.g., if every element of a vector x has some small error tau, then the whole vector has a cumulative error of n*tau.
Mark
Re: Limitations on precision
Posted: Wed Mar 11, 2020 2:25 am
by generalzod
mgates3 wrote: ↑Mon Mar 02, 2020 11:08 am
Are you using MAGMA's testers to test these, e.g., testing/testing_zheevd?
Which specific routine are you using?
If using MAGMA's tester, can you share the complete input & output that is concerning you?
We generally check the relative backwards error,
|| A - U S U^H ||_1 / ( || A ||_1 N )
MAGMA's tester abbreviates that as |A-USU^H| in the output header, but actually computes the above quantity.
The absolute error || A - U S U^H ||_1 does grow with the matrix size, since more values are accumulated into the norm. E.g., if every element of a vector x has some small error tau, then the whole vector has a cumulative error of n*tau.
Mark
Hi
This is what I get with testing_dsyevd
Code: Select all
% MAGMA 2.5.2 compiled for CUDA capability >= 6.0, 64-bit magma_int_t, 64-bit pointer.
% CUDA runtime 10020, driver 10020. OpenMP threads 32.
% device 0: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 1: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 2: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% device 3: Tesla P100-SXM2-16GB, 1480.5 MHz clock, 16280.9 MiB memory, capability 6.0
% Tue Mar 3 18:43:12 2020
% Usage: ./testing_dsyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 4
% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
%============================================================================
1088 --- 3.2039 --- 2.06e-17 6.48e-17 ok
1088 --- 0.2568 --- 1.23e-17 6.68e-17 ok
1088 --- 0.2568 --- 6.84e-08 6.70e-06 failed
1088 --- 0.2593 --- 1.53e-17 6.51e-17 ok
1088 --- 0.2554 --- 1.44e-17 6.87e-17 ok
2112 --- 0.6574 --- 7.33e-18 6.04e-17 ok
2112 --- 0.6583 --- 1.59e-17 6.76e-17 ok
2112 --- 0.6598 --- 2.77e-18 6.42e-17 ok
2112 --- 0.7249 --- 4.79e-18 6.50e-17 ok
2112 --- 0.6582 --- 2.22e-18 6.14e-17 ok
3136 --- 1.3959 --- 2.54e-17 5.45e-17 ok
3136 --- 1.3857 --- 1.01e-17 5.70e-17 ok
3136 --- 1.3820 --- 2.37e-17 6.19e-17 ok
3136 --- 1.4300 --- 7.85e-18 5.51e-17 ok
3136 --- 1.3825 --- 3.57e-18 5.59e-17 ok
4160 --- 1.9709 --- 4.23e-09 1.04e-06 failed
4160 --- 1.9827 --- 1.12e-17 5.23e-17 ok
4160 --- 1.9705 --- 1.75e-17 5.57e-17 ok
4160 --- 1.9741 --- 2.12e-17 5.72e-17 ok
4160 --- 1.9651 --- 4.50e-18 5.36e-17 ok
5184 --- 3.0915 --- 1.88e-17 5.66e-17 ok
5184 --- 2.6725 --- 1.28e-17 5.83e-17 ok
5184 --- 2.6832 --- 1.85e-17 5.31e-17 ok
5184 --- 2.8734 --- 2.62e-09 2.13e-07 failed
5184 --- 2.6751 --- 7.31e-07 6.00e-05 failed
6208 --- 3.5111 --- 2.44e-08 2.99e-06 failed
6208 --- 3.5439 --- 5.87e-18 6.15e-17 ok
6208 --- 3.6768 --- 1.36e-17 5.46e-17 ok
6208 --- 3.7247 --- 1.86e-17 5.37e-17 ok
6208 --- 3.6468 --- 7.55e-08 3.33e-05 failed
7232 --- 4.9060 --- 1.08e-10 1.57e-08 failed
7232 --- 4.6172 --- 2.42e-17 5.82e-17 ok
7232 --- 4.5373 --- 1.54e-17 5.51e-17 ok
7232 --- 4.5184 --- 5.64e-18 5.55e-17 ok
7232 --- 4.5125 --- 1.32e-17 5.80e-17 ok
8256 --- 5.1735 --- 1.16e-17 6.23e-17 ok
8256 --- 5.4694 --- 3.05e-09 3.75e-07 failed
8256 --- 5.3396 --- 3.98e-10 2.29e-07 failed
8256 --- 5.7306 --- 4.29e-07 7.17e-05 failed
8256 --- 5.4754 --- 5.26e-10 1.59e-07 failed
9280 --- 6.4114 --- 3.92e-09 7.72e-07 failed
9280 --- 6.7893 --- 2.25e-08 2.22e-06 failed
9280 --- 6.1874 --- 2.92e-10 7.02e-08 failed
9280 --- 6.2186 --- 3.26e-08 1.51e-05 failed
9280 --- 6.4704 --- 4.46e-10 6.61e-08 failed
10304 --- 7.5922 --- 1.18e-07 4.28e-05 failed
10304 --- 7.3870 --- 3.73e-09 5.19e-07 failed
10304 --- 7.6400 --- 1.01e-07 2.91e-05 failed
10304 --- 7.2780 --- 4.08e-08 7.18e-06 failed
10304 --- 7.3946 --- 5.36e-10 1.48e-07 failed
30000 --- 66.6994 --- 2.76e-11 1.57e-08 failed
30000 --- 67.4788 --- 5.08e-18 6.51e-17 ok
30000 --- 63.7498 --- 7.73e-08 1.96e-05 failed
50000 --- 213.1231 --- 1.73e-12 4.78e-09 failed
50000 --- 210.9395 --- 1.36e-08 3.70e-06 failed
50000 --- 207.4701 --- 2.51e-10 9.76e-08 failed
70000 --- 488.2336 --- 7.23e-08 2.34e-05 failed
70000 --- 478.4960 --- 9.65e-09 2.84e-06 failed
Interesting point is... while it takes only few mins to solve eigenproblem
it takes few hours to check it's error.
Why is lapackf77_dsyt21 so slow???
Also, I wonder would there be any methods to refine the result (reduce error)
after the execution of dsyevd
Thank you
Re: Limitations on precision
Posted: Wed Mar 11, 2020 1:04 pm
by Stan Tomov
Some of these errors seem to be large and inconsistent. This is what I get on one of our systems with V100 and Intel CPU.
Code: Select all
[tomov@a04 testing]$ ./testing_dsyevd -JV --niter 5 -c -l -n 7000
% MAGMA 2.5.2 svn compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9020, driver 10010. OpenMP threads 20. MKL 2017.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Wed Mar 11 12:47:04 2020
% Usage: ./testing_dsyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 1
% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
%============================================================================
7000 12.2292 4.6496 4.83e-19 4.08e-18 4.39e-17 ok
7000 12.2834 4.6451 1.11e-19 6.59e-18 4.20e-17 ok
These errors are what we expect in double precision. Did you by any chance modify the code, e.g., removing the scaling or changing the input matrices?
Here lapackf77_dsyt21 took 40 seconds. It is slower than the CPU dsyevd because of the way the norms are computed - if you look at the code, the computation is done through rank 1 and 2 updates. Your times seem to be quite larger. In this experiment I used MKL on the CPU. What BLAS/LAPACK are you using on the CPU?
Related to refinement, here are some relevant papers:
http://www.netlib.org/utk/people/JackDo ... sicedr.pdf
http://www.netlib.org/utk/people/JackDo ... values.pdf
Re: Limitations on precision
Posted: Wed Mar 11, 2020 3:48 pm
by generalzod
Stan Tomov wrote: ↑Wed Mar 11, 2020 1:04 pm
Some of these errors seem to be large and inconsistent. This is what I get on one of our systems with V100 and Intel CPU.
Code: Select all
[tomov@a04 testing]$ ./testing_dsyevd -JV --niter 5 -c -l -n 7000
% MAGMA 2.5.2 svn compiled for CUDA capability >= 7.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 9020, driver 10010. OpenMP threads 20. MKL 2017.0.1, MKL threads 20.
% device 0: Tesla V100-PCIE-16GB, 1380.0 MHz clock, 16130.5 MiB memory, capability 7.0
% Wed Mar 11 12:47:04 2020
% Usage: ./testing_dsyevd [options] [-h|--help]
% jobz = Vectors needed, uplo = Lower, ngpu = 1
% N CPU Time (sec) GPU Time (sec) |S-S_magma| |A-USU^H| |I-U^H U|
%============================================================================
7000 12.2292 4.6496 4.83e-19 4.08e-18 4.39e-17 ok
7000 12.2834 4.6451 1.11e-19 6.59e-18 4.20e-17 ok
These errors are what we expect in double precision. Did you by any chance modify the code, e.g., removing the scaling or changing the input matrices?
Here lapackf77_dsyt21 took 40 seconds. It is slower than the CPU dsyevd because of the way the norms are computed - if you look at the code, the computation is done through rank 1 and 2 updates. Your times seem to be quite larger. In this experiment I used MKL on the CPU. What BLAS/LAPACK are you using on the CPU?
Related to refinement, here are some relevant papers:
http://www.netlib.org/utk/people/JackDo ... sicedr.pdf
http://www.netlib.org/utk/people/JackDo ... values.pdf
Thank you for sharing your result Mr. Tomov
I did not alter the testing code (testing_dsyevd.cpp)
lapackf77_dsyt21 for matrix with N=7000 isn't that long. I guess it's similar with yours but
with N=70k, I think it takes like 10 hours to finish
I am using P100 in IBM POWER 8 system
My MAGMA is using latest OpenBLAS, without IBM MASS or IBM XL compiler. Just compiled by gcc and gfortran
Re: Limitations on precision
Posted: Wed Mar 11, 2020 4:54 pm
by Stan Tomov
You may also want to try the 2-stage reduction algorithms, e.g.
Code: Select all
./testing_dsyevdx_2stage -JV --niter 2 -n 7000
These are much faster especially for the large sizes that you target.
Maybe also using multiple GPUs would help (adding "--ngpu 4" option).
Also, you can try with ESSL. There is make.inc example for that ("make.inc.power9-essl") that you may have to modify.