Page 1 of 1
Gpu worse than cpu
Posted: Wed Oct 18, 2017 12:32 pm
by thanasis_giannis
Hello,
I have a piece of code that originally is running with mkl. I did put magma instead of mkl commands and the timing was worse. The function that I am timing has no data transfers. I really don t know what might have been wrong. Also, I am new to this :D
Re: Gpu worse than cpu
Posted: Wed Oct 18, 2017 5:46 pm
by mgates3
Please provide more details. What function? What size matrix? What was MKL's performance and MAGMA's performance? What is your computer hardware?
The input & output of one of MAGMA's testers provides much of this, e.g., on a machine with two 8-core Intel E5-2670 (Sandy Bridge):
Code: Select all
bunsen magma/testing> ./testing_dgetrf -n 4000 --lapack
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 16. MKL 11.3.0, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Wed Oct 18 17:42:42 2017
% Usage: ./testing_dgetrf [options] [-h|--help]
% ngpu 1, version 1
% M N CPU Gflop/s (sec) GPU Gflop/s (sec) |PA-LU|/(N*|A|)
%========================================================================
4000 4000 145.91 ( 0.29) 351.77 ( 0.12) ---
Re: Gpu worse than cpu
Posted: Thu Oct 19, 2017 9:39 am
by thanasis_giannis
Well....maybe I have found something...according to google....functions like axpy are likely to be slower in gpu. Something that has to do with memory. My code is full functions like axpy..so i guess that is...
Re: Gpu worse than cpu
Posted: Fri Oct 20, 2017 8:28 pm
by mgates3
It all depends on the problem size. GPUs have faster memory, so even an axpy can be faster, but it would have to be rather large. For example:
Code: Select all
bunsen magma/testing> ./testing_daxpy -n 123 -n 1234 -n 1000:20000:1000
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 16. MKL 11.3.0, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Fri Oct 20 20:27:42 2017
% Usage: ./testing_daxpy [options] [-h|--help]
% M cnt cuBLAS Gflop/s (ms) CPU Gflop/s (ms) cuBLAS error
%===========================================================================
123 100 0.0401 ( 0.6130) 0.3155 ( 0.0780) 0.00e+00 ok
1234 100 0.4149 ( 0.5949) 1.3479 ( 0.1831) 0.00e+00 ok
1000 100 0.3343 ( 0.5982) 1.3422 ( 0.1490) 0.00e+00 ok
2000 100 0.6792 ( 0.5889) 1.3707 ( 0.2918) 0.00e+00 ok
3000 100 1.0030 ( 0.5982) 1.1382 ( 0.5271) 0.00e+00 ok
4000 100 1.2965 ( 0.6170) 1.3843 ( 0.5779) 0.00e+00 ok
5000 100 1.7239 ( 0.5801) 1.5336 ( 0.6521) 0.00e+00 ok
6000 100 1.8491 ( 0.6490) 1.6783 ( 0.7150) 0.00e+00 ok
7000 100 2.2507 ( 0.6220) 1.7327 ( 0.8080) 0.00e+00 ok
8000 100 2.4236 ( 0.6602) 1.9185 ( 0.8340) 0.00e+00 ok
9000 100 2.7109 ( 0.6640) 1.9524 ( 0.9220) 0.00e+00 ok
10000 100 3.1150 ( 0.6421) 2.0515 ( 0.9749) 0.00e+00 ok
11000 100 3.3180 ( 0.6630) 2.1091 ( 1.0431) 0.00e+00 ok
12000 100 3.4676 ( 0.6921) 2.1960 ( 1.0929) 0.00e+00 ok
13000 100 3.5861 ( 0.7250) 2.2071 ( 1.1780) 0.00e+00 ok
14000 100 3.9056 ( 0.7169) 2.3488 ( 1.1921) 0.00e+00 ok
15000 100 4.1094 ( 0.7300) 2.2558 ( 1.3299) 0.00e+00 ok
16000 100 4.1930 ( 0.7632) 2.3054 ( 1.3881) 0.00e+00 ok
17000 100 4.4859 ( 0.7579) 2.3256 ( 1.4620) 0.00e+00 ok
18000 100 4.7245 ( 0.7620) 2.3530 ( 1.5299) 0.00e+00 ok
19000 100 4.7506 ( 0.7999) 2.4517 ( 1.5500) 0.00e+00 ok
20000 100 5.1213 ( 0.7811) 2.4347 ( 1.6429) 0.00e+00 ok