Performance issue
-
mathieu321
- Posts: 4
- Joined: Wed Dec 14, 2016 7:40 pm
Performance issue
Hi guys,
I am desperately trying to post that message but my message is considered as spam and is blocked, nothing fancy to post only testing logs and a few question, can an admin PM me so that I send him my message?
I am desperately trying to post that message but my message is considered as spam and is blocked, nothing fancy to post only testing logs and a few question, can an admin PM me so that I send him my message?
Re: Performance issue
[posting on behalf of mathieu, due to issues with spam filter]
Hello,
I recently installed magma 2.2 with MKL and CUDA 8.0 on my laptop running MacOS X 10.11 and run several tests. The very first check I wanted to make is about matrix*matrix product. Here are the outputs of simple and double precision gemm:
The results from single precision testing are quite consistent, even if I don't understand why Magma with its hybrid strategy doesn't beat cuBLAS as CPU results are clearly the best here. What I don't get neither is why magma cannot allocate two 9280*9280 matrices of float which fit into 700MB.
Concerning dgemm testing same remark concerning the memory as magma struggles to allocate 2 matrices which for into 620MB. But my second concern here is the drop of performance for double precision computation, what is wrong with Magma and CUDA ?
Hello,
I recently installed magma 2.2 with MKL and CUDA 8.0 on my laptop running MacOS X 10.11 and run several tests. The very first check I wanted to make is about matrix*matrix product. Here are the outputs of simple and double precision gemm:
Code: Select all
./testing_sgemm --lapack
MAGMA 2.2.0 compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4.
device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
Wed Dec 14 23:36:49 2016
Usage: ./testing_sgemm [options] [-h|--help]
If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.
transA = No transpose, transB = No transpose
M N K MAGMA Gflop/s (ms) cuBLAS Gflop/s (ms) CPU Gflop/s (ms) MAGMA error cuBLAS error
=======================================================================================================
1088 1088 1088 80.71 ( 31.91) 131.39 ( 19.60) 286.65 ( 8.99) 1.09e-08 1.09e-08 ok
2112 2112 2112 96.75 ( 194.74) 220.56 ( 85.43) 371.37 ( 50.73) 1.04e-08 1.04e-08 ok
3136 3136 3136 150.92 ( 408.71) 271.30 ( 227.36) 381.54 ( 161.67) 1.12e-08 1.12e-08 ok
4160 4160 4160 184.68 ( 779.63) 341.58 ( 421.52) 393.01 ( 366.35) 1.01e-08 1.01e-08 ok
5184 5184 5184 215.71 (1291.69) 349.91 ( 796.28) 391.74 ( 711.26) 1.16e-08 1.16e-08 ok
6208 6208 6208 212.68 (2249.84) 350.84 (1363.88) 358.48 (1334.80) 1.13e-08 1.13e-08 ok
7232 7232 7232 212.53 (3559.40) 351.36 (2153.07) 320.07 (2363.54) 1.06e-08 1.06e-08 ok
8256 8256 8256 213.53 (5270.77) 351.72 (3199.91) 287.80 (3910.63) 1.00e-08 1.00e-08 ok
Error: magma_smalloc( &dC, lddc*N )
failed at testing/testing_sgemm.cpp:130: error -113: cannot allocate memory on GPU device
./testing_dgemm --lapack
MAGMA 2.2.0 compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4.
device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
Wed Dec 14 23:37:57 2016
Usage: ./testing_dgemm [options] [-h|--help]
If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.
transA = No transpose, transB = No transpose
M N K MAGMA Gflop/s (ms) cuBLAS Gflop/s (ms) CPU Gflop/s (ms) MAGMA error cuBLAS error
=======================================================================================================
1088 1088 1088 11.27 ( 228.61) 18.15 ( 141.93) 158.61 ( 16.24) 2.03e-17 2.03e-17 ok
2112 2112 2112 20.55 ( 916.89) 26.77 ( 703.71) 183.64 ( 102.60) 1.93e-17 1.93e-17 ok
3136 3136 3136 25.98 (2374.10) 27.52 (2241.02) 182.26 ( 338.43) 2.09e-17 2.09e-17 ok
4160 4160 4160 25.93 (5552.00) 27.52 (5232.50) 166.53 ( 864.62) 1.88e-17 1.88e-17 ok
5184 5184 5184 37.66 (7398.38) 30753976.79 ( 0.01) 150.41 (1852.44) 1.04e-02 1.04e-02 failed
Error: magma_dmalloc( &dA, ldda*An )
failed at testing/testing_dgemm.cpp:128: error -113: cannot allocate memory on GPU device
Concerning dgemm testing same remark concerning the memory as magma struggles to allocate 2 matrices which for into 620MB. But my second concern here is the drop of performance for double precision computation, what is wrong with Magma and CUDA ?
Re: Performance issue
You asked a few different questions.
1) MAGMA vs. cuBLAS gemm performance. For BLAS operations, MAGMA BLAS operates entirely on the GPU, not as a hybrid. The MAGMA gemm hasn't been updated since around 2010, because NVIDIA has hand optimized the cuBLAS gemm for newer NVIDIA architectures (Kepler and newer).
MAGMA uses hybrid algorithms for higher level functions like solving a matrix (getrf, gesv, etc.)
2) Memory use. Testing gemm allocates 3 matrices (not 2), so m=n=k=9280 is about 985 MiB in single precision. That should fit in the 2047.6 MiB on your GPU, but the system may have other things allocated. Are you running other programs, especially anything with graphics? Unfortunately there's no nvidia-smi on MacOS to see what processes are on the GPU, that I'm aware of.
On my laptop with an identical GPU, it can go larger:
3) Concerning double precision performance, the GeForce series are aimed at desktop graphics, which uses single precision. The hardware does not have fast double precision. See https://en.wikipedia.org/wiki/GeForce_7 ... s#Products
The Tesla series are aimed at high performance computing, with fast double precision.
-mark
1) MAGMA vs. cuBLAS gemm performance. For BLAS operations, MAGMA BLAS operates entirely on the GPU, not as a hybrid. The MAGMA gemm hasn't been updated since around 2010, because NVIDIA has hand optimized the cuBLAS gemm for newer NVIDIA architectures (Kepler and newer).
MAGMA uses hybrid algorithms for higher level functions like solving a matrix (getrf, gesv, etc.)
2) Memory use. Testing gemm allocates 3 matrices (not 2), so m=n=k=9280 is about 985 MiB in single precision. That should fit in the 2047.6 MiB on your GPU, but the system may have other things allocated. Are you running other programs, especially anything with graphics? Unfortunately there's no nvidia-smi on MacOS to see what processes are on the GPU, that I'm aware of.
On my laptop with an identical GPU, it can go larger:
Code: Select all
magma/testing> ./testing_sgemm -n 1088:14000:1024
% MAGMA 2.1.0 svn compiled for CUDA capability >= 3.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. MAGMA not compiled with OpenMP.
% device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
% Thu Dec 15 16:38:33 2016
% Usage: ./testing_sgemm [options] [-h|--help]
% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.
% transA = No transpose, transB = No transpose
% M N K MAGMA Gflop/s (ms) cuBLAS Gflop/s (ms) CPU Gflop/s (ms) MAGMA error cuBLAS error
%========================================================================================================
1088 1088 1088 80.46 ( 32.01) 120.20 ( 21.43) --- ( --- ) 0.00e+00 --- ok
2112 2112 2112 99.13 ( 190.06) 219.11 ( 85.99) --- ( --- ) 0.00e+00 --- ok
3136 3136 3136 148.30 ( 415.92) 274.01 ( 225.11) --- ( --- ) 0.00e+00 --- ok
4160 4160 4160 192.81 ( 746.74) 344.92 ( 417.43) --- ( --- ) 0.00e+00 --- ok
5184 5184 5184 218.82 (1273.30) 349.83 ( 796.46) --- ( --- ) 0.00e+00 --- ok
6208 6208 6208 217.59 (2199.11) 350.66 (1364.57) --- ( --- ) 0.00e+00 --- ok
7232 7232 7232 216.16 (3499.61) 350.86 (2156.13) --- ( --- ) 0.00e+00 --- ok
8256 8256 8256 220.95 (5093.75) 351.54 (3201.62) --- ( --- ) 0.00e+00 --- ok
9280 9280 9280 218.79 (7305.48) 351.27 (4550.16) --- ( --- ) 0.00e+00 --- ok
10304 10304 10304 217.38 (10065.56) 351.28 (6228.61) --- ( --- ) 0.00e+00 --- ok
11328 11328 11328 219.41 (13250.71) 351.27 (8276.44) --- ( --- ) 0.00e+00 --- ok
Error: magma_smalloc( &dC, lddc*N )
failed at testing/testing_sgemm.cpp:130: error -113: cannot allocate memory on GPU device
3) Concerning double precision performance, the GeForce series are aimed at desktop graphics, which uses single precision. The hardware does not have fast double precision. See https://en.wikipedia.org/wiki/GeForce_7 ... s#Products
The Tesla series are aimed at high performance computing, with fast double precision.
-mark
Re: Performance issue
BTW: you can speed up compiling MAGMA and generate a smaller MAGMA library by setting GPU_TARGET = Kepler or sm30 in the make.inc file, if you are using this MAGMA library only on that GPU. It looks like you used the default, which also compiles for Fermi. Nothing wrong with that, it just compiles extra code.
-mark
-mark
-
mathieu321
- Posts: 4
- Joined: Wed Dec 14, 2016 7:40 pm
Re: Performance issue
Thank you Mark for your reply. Everything is clear now. Thanks for your tip concerning compilation. Well it looks like that my set up needs an upgrade if I want to achieve decent GPU acceleration. It's a shame there's no nvidia-smi equivalent on MacOS. That plus the lack of visibility regarding a MacPro update make Apple the worst HPC platform of the moment.
I have tried to run sgemm tests after a fresh reboot, I reached the 10304 step and then the program silently stops, the test is said to be ok but stopped prematurely.
I have tried to run sgemm tests after a fresh reboot, I reached the 10304 step and then the program silently stops, the test is said to be ok but stopped prematurely.
Code: Select all
./testing_sgemm
% MAGMA 2.2.0 compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 8000, driver 8000. MAGMA not compiled with OpenMP. MKL 2017.0.0, MKL threads 4.
% device 0: GeForce GT 750M, 925.5 MHz clock, 2047.6 MiB memory, capability 3.0
% Fri Dec 16 10:31:57 2016
% Usage: ./testing_sgemm [options] [-h|--help]
% If running lapack (option --lapack), MAGMA and cuBLAS error are both computed
% relative to CPU BLAS result. Else, MAGMA error is computed relative to cuBLAS result.
% transA = No transpose, transB = No transpose
% M N K MAGMA Gflop/s (ms) cuBLAS Gflop/s (ms) CPU Gflop/s (ms) MAGMA error cuBLAS error
%========================================================================================================
1088 1088 1088 82.96 ( 31.05) 133.63 ( 19.28) --- ( --- ) 0.00e+00 --- ok
2112 2112 2112 102.46 ( 183.89) 223.87 ( 84.16) --- ( --- ) 0.00e+00 --- ok
3136 3136 3136 148.15 ( 416.35) 272.00 ( 226.77) --- ( --- ) 0.00e+00 --- ok
4160 4160 4160 180.84 ( 796.18) 336.91 ( 427.37) --- ( --- ) 0.00e+00 --- ok
5184 5184 5184 218.07 (1277.68) 350.11 ( 795.82) --- ( --- ) 0.00e+00 --- ok
6208 6208 6208 217.46 (2200.47) 350.82 (1363.97) --- ( --- ) 0.00e+00 --- ok
7232 7232 7232 215.78 (3505.84) 351.32 (2153.28) --- ( --- ) 0.00e+00 --- ok
8256 8256 8256 220.77 (5097.91) 351.73 (3199.88) --- ( --- ) 0.00e+00 --- ok
9280 9280 9280 218.53 (7314.02) 351.90 (4542.12) --- ( --- ) 0.00e+00 --- ok
10304 10304 10304 239.27 (9144.47) 269915940.32 ( 0.01) --- ( --- ) 0.00e+00 --- okRe: Performance issue
The default range is -n 1088:10304:1024, which it completed. My tests specified a larger range, -n 1088:14000:1024.
Though it does look like there was some issue with your cuBLAS sgemm at n=10304. Notice the short time (0.01 s). If run with the -c flag to check the results, it should discover the error. (We don't check accuracy by default for efficiency.) Not sure why that error would occur; possibly cuBLAS internally ran out of memory.
-mark
Though it does look like there was some issue with your cuBLAS sgemm at n=10304. Notice the short time (0.01 s). If run with the -c flag to check the results, it should discover the error. (We don't check accuracy by default for efficiency.) Not sure why that error would occur; possibly cuBLAS internally ran out of memory.
-mark
-
mathieu321
- Posts: 4
- Joined: Wed Dec 14, 2016 7:40 pm
Re: Performance issue
Thanks mark, can it come from magma not deallocating memory after its computation?
Re: Performance issue
The tester calls magmablas_sgemm and magma_sgemm (wrapper around cublasSgemm) with the same dA, dB, dC matrices, which are allocated once for each m, n, k size, and are all freed later. magmablas_sgemm doesn't have any internal allocation.
-mark
-mark
-
mathieu321
- Posts: 4
- Joined: Wed Dec 14, 2016 7:40 pm
Re: Performance issue
Alright so cublas has. Thanks