Hello,
I have a piece of code that originally is running with mkl. I did put magma instead of mkl commands and the timing was worse. The function that I am timing has no data transfers. I really don t know what might have been wrong. Also, I am new to this :D
Gpu worse than cpu
Re: Gpu worse than cpu
Please provide more details. What function? What size matrix? What was MKL's performance and MAGMA's performance? What is your computer hardware?
The input & output of one of MAGMA's testers provides much of this, e.g., on a machine with two 8-core Intel E5-2670 (Sandy Bridge):
The input & output of one of MAGMA's testers provides much of this, e.g., on a machine with two 8-core Intel E5-2670 (Sandy Bridge):
Code: Select all
bunsen magma/testing> ./testing_dgetrf -n 4000 --lapack
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 16. MKL 11.3.0, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Wed Oct 18 17:42:42 2017
% Usage: ./testing_dgetrf [options] [-h|--help]
% ngpu 1, version 1
% M N CPU Gflop/s (sec) GPU Gflop/s (sec) |PA-LU|/(N*|A|)
%========================================================================
4000 4000 145.91 ( 0.29) 351.77 ( 0.12) ---
-
thanasis_giannis
- Posts: 9
- Joined: Thu Aug 24, 2017 7:35 am
Re: Gpu worse than cpu
Well....maybe I have found something...according to google....functions like axpy are likely to be slower in gpu. Something that has to do with memory. My code is full functions like axpy..so i guess that is...
Re: Gpu worse than cpu
It all depends on the problem size. GPUs have faster memory, so even an axpy can be faster, but it would have to be rather large. For example:
Code: Select all
bunsen magma/testing> ./testing_daxpy -n 123 -n 1234 -n 1000:20000:1000
% MAGMA 2.2.0 svn compiled for CUDA capability >= 3.5, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 9000. OpenMP threads 16. MKL 11.3.0, MKL threads 16.
% device 0: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% device 1: Tesla K40c, 745.0 MHz clock, 11439.9 MiB memory, capability 3.5
% Fri Oct 20 20:27:42 2017
% Usage: ./testing_daxpy [options] [-h|--help]
% M cnt cuBLAS Gflop/s (ms) CPU Gflop/s (ms) cuBLAS error
%===========================================================================
123 100 0.0401 ( 0.6130) 0.3155 ( 0.0780) 0.00e+00 ok
1234 100 0.4149 ( 0.5949) 1.3479 ( 0.1831) 0.00e+00 ok
1000 100 0.3343 ( 0.5982) 1.3422 ( 0.1490) 0.00e+00 ok
2000 100 0.6792 ( 0.5889) 1.3707 ( 0.2918) 0.00e+00 ok
3000 100 1.0030 ( 0.5982) 1.1382 ( 0.5271) 0.00e+00 ok
4000 100 1.2965 ( 0.6170) 1.3843 ( 0.5779) 0.00e+00 ok
5000 100 1.7239 ( 0.5801) 1.5336 ( 0.6521) 0.00e+00 ok
6000 100 1.8491 ( 0.6490) 1.6783 ( 0.7150) 0.00e+00 ok
7000 100 2.2507 ( 0.6220) 1.7327 ( 0.8080) 0.00e+00 ok
8000 100 2.4236 ( 0.6602) 1.9185 ( 0.8340) 0.00e+00 ok
9000 100 2.7109 ( 0.6640) 1.9524 ( 0.9220) 0.00e+00 ok
10000 100 3.1150 ( 0.6421) 2.0515 ( 0.9749) 0.00e+00 ok
11000 100 3.3180 ( 0.6630) 2.1091 ( 1.0431) 0.00e+00 ok
12000 100 3.4676 ( 0.6921) 2.1960 ( 1.0929) 0.00e+00 ok
13000 100 3.5861 ( 0.7250) 2.2071 ( 1.1780) 0.00e+00 ok
14000 100 3.9056 ( 0.7169) 2.3488 ( 1.1921) 0.00e+00 ok
15000 100 4.1094 ( 0.7300) 2.2558 ( 1.3299) 0.00e+00 ok
16000 100 4.1930 ( 0.7632) 2.3054 ( 1.3881) 0.00e+00 ok
17000 100 4.4859 ( 0.7579) 2.3256 ( 1.4620) 0.00e+00 ok
18000 100 4.7245 ( 0.7620) 2.3530 ( 1.5299) 0.00e+00 ok
19000 100 4.7506 ( 0.7999) 2.4517 ( 1.5500) 0.00e+00 ok
20000 100 5.1213 ( 0.7811) 2.4347 ( 1.6429) 0.00e+00 ok