Performance issue

mathieu321 · Post by **mathieu321** » Thu Dec 15, 2016 2:29 pm

Hi guys,

I am desperately trying to post that message but my message is considered as spam and is blocked, nothing fancy to post only testing logs and a few question, can an admin PM me so that I send him my message?

mgates3 · Post by **mgates3** » Thu Dec 15, 2016 5:30 pm

[posting on behalf of mathieu, due to issues with spam filter]

Hello,

I recently installed magma 2.2 with MKL and CUDA 8.0 on my laptop running MacOS X 10.11 and run several tests. The very first check I wanted to make is about matrix*matrix product. Here are the outputs of simple and double precision gemm:

Code: Select all

./testing_sgemm --lapack
MAGMA 2.2.0  compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4. 
device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
Wed Dec 14 23:36:49 2016
Usage: ./testing_sgemm [options] [-h|--help]

If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

transA = No transpose, transB = No transpose
  M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
=======================================================================================================
 1088  1088  1088     80.71 (  31.91)     131.39 (  19.60)    286.65 (   8.99)    1.09e-08     1.09e-08   ok
 2112  2112  2112     96.75 ( 194.74)     220.56 (  85.43)    371.37 (  50.73)    1.04e-08     1.04e-08   ok
 3136  3136  3136    150.92 ( 408.71)     271.30 ( 227.36)    381.54 ( 161.67)    1.12e-08     1.12e-08   ok
 4160  4160  4160    184.68 ( 779.63)     341.58 ( 421.52)    393.01 ( 366.35)    1.01e-08     1.01e-08   ok
 5184  5184  5184    215.71 (1291.69)     349.91 ( 796.28)    391.74 ( 711.26)    1.16e-08     1.16e-08   ok
 6208  6208  6208    212.68 (2249.84)     350.84 (1363.88)    358.48 (1334.80)    1.13e-08     1.13e-08   ok
 7232  7232  7232    212.53 (3559.40)     351.36 (2153.07)    320.07 (2363.54)    1.06e-08     1.06e-08   ok
 8256  8256  8256    213.53 (5270.77)     351.72 (3199.91)    287.80 (3910.63)    1.00e-08     1.00e-08   ok
Error: magma_smalloc( &dC, lddc*N )
failed at testing/testing_sgemm.cpp:130: error -113: cannot allocate memory on GPU device

./testing_dgemm --lapack
MAGMA 2.2.0  compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4. 
device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
Wed Dec 14 23:37:57 2016
Usage: ./testing_dgemm [options] [-h|--help]

If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

transA = No transpose, transB = No transpose
  M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
=======================================================================================================
 1088  1088  1088     11.27 ( 228.61)      18.15 ( 141.93)    158.61 (  16.24)    2.03e-17     2.03e-17   ok
 2112  2112  2112     20.55 ( 916.89)      26.77 ( 703.71)    183.64 ( 102.60)    1.93e-17     1.93e-17   ok
 3136  3136  3136     25.98 (2374.10)      27.52 (2241.02)    182.26 ( 338.43)    2.09e-17     2.09e-17   ok
 4160  4160  4160     25.93 (5552.00)      27.52 (5232.50)    166.53 ( 864.62)    1.88e-17     1.88e-17   ok
 5184  5184  5184     37.66 (7398.38)    30753976.79 (   0.01)    150.41 (1852.44)    1.04e-02     1.04e-02   failed
Error: magma_dmalloc( &dA, ldda*An )
failed at testing/testing_dgemm.cpp:128: error -113: cannot allocate memory on GPU device

The results from single precision testing are quite consistent, even if I don't understand why Magma with its hybrid strategy doesn't beat cuBLAS as CPU results are clearly the best here. What I don't get neither is why magma cannot allocate two 9280*9280 matrices of float which fit into 700MB.

Concerning dgemm testing same remark concerning the memory as magma struggles to allocate 2 matrices which for into 620MB. But my second concern here is the drop of performance for double precision computation, what is wrong with Magma and CUDA ?

mgates3 · Post by **mgates3** » Thu Dec 15, 2016 5:55 pm

You asked a few different questions.

1) MAGMA vs. cuBLAS gemm performance. For BLAS operations, MAGMA BLAS operates entirely on the GPU, not as a hybrid. The MAGMA gemm hasn't been updated since around 2010, because NVIDIA has hand optimized the cuBLAS gemm for newer NVIDIA architectures (Kepler and newer).

MAGMA uses hybrid algorithms for higher level functions like solving a matrix (getrf, gesv, etc.)

2) Memory use. Testing gemm allocates 3 matrices (not 2), so m=n=k=9280 is about 985 MiB in single precision. That should fit in the 2047.6 MiB on your GPU, but the system may have other things allocated. Are you running other programs, especially anything with graphics? Unfortunately there's no nvidia-smi on MacOS to see what processes are on the GPU, that I'm aware of.

On my laptop with an identical GPU, it can go larger:

Code: Select all

magma/testing> ./testing_sgemm -n 1088:14000:1024
% MAGMA 2.1.0 svn compiled for CUDA capability >= 3.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. MAGMA not compiled with OpenMP. 
% device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
% Thu Dec 15 16:38:33 2016
% Usage: ./testing_sgemm [options] [-h|--help]

% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

% transA = No transpose, transB = No transpose
%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1088  1088  1088     80.46 (  32.01)     120.20 (  21.43)     ---   (  ---  )    0.00e+00        ---    ok
 2112  2112  2112     99.13 ( 190.06)     219.11 (  85.99)     ---   (  ---  )    0.00e+00        ---    ok
 3136  3136  3136    148.30 ( 415.92)     274.01 ( 225.11)     ---   (  ---  )    0.00e+00        ---    ok
 4160  4160  4160    192.81 ( 746.74)     344.92 ( 417.43)     ---   (  ---  )    0.00e+00        ---    ok
 5184  5184  5184    218.82 (1273.30)     349.83 ( 796.46)     ---   (  ---  )    0.00e+00        ---    ok
 6208  6208  6208    217.59 (2199.11)     350.66 (1364.57)     ---   (  ---  )    0.00e+00        ---    ok
 7232  7232  7232    216.16 (3499.61)     350.86 (2156.13)     ---   (  ---  )    0.00e+00        ---    ok
 8256  8256  8256    220.95 (5093.75)     351.54 (3201.62)     ---   (  ---  )    0.00e+00        ---    ok
 9280  9280  9280    218.79 (7305.48)     351.27 (4550.16)     ---   (  ---  )    0.00e+00        ---    ok
10304 10304 10304    217.38 (10065.56)     351.28 (6228.61)     ---   (  ---  )    0.00e+00        ---    ok
11328 11328 11328    219.41 (13250.71)     351.27 (8276.44)     ---   (  ---  )    0.00e+00        ---    ok
Error: magma_smalloc( &dC, lddc*N )
failed at testing/testing_sgemm.cpp:130: error -113: cannot allocate memory on GPU device

3) Concerning double precision performance, the GeForce series are aimed at desktop graphics, which uses single precision. The hardware does not have fast double precision. See https://en.wikipedia.org/wiki/GeForce_7 ... s#Products
The Tesla series are aimed at high performance computing, with fast double precision.

-mark

mgates3 · Post by **mgates3** » Thu Dec 15, 2016 5:59 pm

BTW: you can speed up compiling MAGMA and generate a smaller MAGMA library by setting GPU_TARGET = Kepler or sm30 in the make.inc file, if you are using this MAGMA library only on that GPU. It looks like you used the default, which also compiles for Fermi. Nothing wrong with that, it just compiles extra code.
-mark

mathieu321 · Post by **mathieu321** » Fri Dec 16, 2016 6:51 am

Thank you Mark for your reply. Everything is clear now. Thanks for your tip concerning compilation. Well it looks like that my set up needs an upgrade if I want to achieve decent GPU acceleration. It's a shame there's no nvidia-smi equivalent on MacOS. That plus the lack of visibility regarding a MacPro update make Apple the worst HPC platform of the moment.
I have tried to run sgemm tests after a fresh reboot, I reached the 10304 step and then the program silently stops, the test is said to be ok but stopped prematurely.

Code: Select all

./testing_sgemm
% MAGMA 2.2.0  compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4. 
% device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
% Fri Dec 16 10:31:57 2016
% Usage: ./testing_sgemm [options] [-h|--help]

% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.

% transA = No transpose, transB = No transpose
%   M     N     K   MAGMA Gflop/s (ms)  cuBLAS Gflop/s (ms)   CPU Gflop/s (ms)  MAGMA error  cuBLAS error
%========================================================================================================
 1088  1088  1088     82.96 (  31.05)     133.63 (  19.28)     ---   (  ---  )    0.00e+00        ---    ok
 2112  2112  2112    102.46 ( 183.89)     223.87 (  84.16)     ---   (  ---  )    0.00e+00        ---    ok
 3136  3136  3136    148.15 ( 416.35)     272.00 ( 226.77)     ---   (  ---  )    0.00e+00        ---    ok
 4160  4160  4160    180.84 ( 796.18)     336.91 ( 427.37)     ---   (  ---  )    0.00e+00        ---    ok
 5184  5184  5184    218.07 (1277.68)     350.11 ( 795.82)     ---   (  ---  )    0.00e+00        ---    ok
 6208  6208  6208    217.46 (2200.47)     350.82 (1363.97)     ---   (  ---  )    0.00e+00        ---    ok
 7232  7232  7232    215.78 (3505.84)     351.32 (2153.28)     ---   (  ---  )    0.00e+00        ---    ok
 8256  8256  8256    220.77 (5097.91)     351.73 (3199.88)     ---   (  ---  )    0.00e+00        ---    ok
 9280  9280  9280    218.53 (7314.02)     351.90 (4542.12)     ---   (  ---  )    0.00e+00        ---    ok
10304 10304 10304    239.27 (9144.47)    269915940.32 (   0.01)     ---   (  ---  )    0.00e+00        ---    ok

mgates3 · Post by **mgates3** » Fri Dec 16, 2016 8:14 am

The default range is -n 1088:10304:1024, which it completed. My tests specified a larger range, -n 1088:14000:1024.

Though it does look like there was some issue with your cuBLAS sgemm at n=10304. Notice the short time (0.01 s). If run with the -c flag to check the results, it should discover the error. (We don't check accuracy by default for efficiency.) Not sure why that error would occur; possibly cuBLAS internally ran out of memory.

-mark

mathieu321 · Post by **mathieu321** » Fri Dec 16, 2016 8:49 am

Thanks mark, can it come from magma not deallocating memory after its computation?

mgates3 · Post by **mgates3** » Fri Dec 16, 2016 10:15 am

The tester calls magmablas_sgemm and magma_sgemm (wrapper around cublasSgemm) with the same dA, dB, dC matrices, which are allocated once for each m, n, k size, and are all freed later. magmablas_sgemm doesn't have any internal allocation.

-mark

mathieu321 · Post by **mathieu321** » Fri Dec 16, 2016 11:50 am

Alright so cublas has. Thanks

MAGMA Forum

Performance issue

Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue

Re: Performance issue