CPU vs GPU speed

gaurav · Post by **gaurav** » Thu Jul 15, 2010 11:57 pm

Hi,

While running the supplied 'testing_cgetrf_gpu', I observed that using a parallel CPU compilation achieves greater GFLOP/s compared to that attained by the GPU. Following are part of the result tables:

Code: Select all

a) Sequential CPU vs GPU
  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     13.74          36.66         4.693558e-09
 2048     17.67          50.95         4.678659e-09
 3072     18.55          54.45         4.583328e-09
 4032     19.20          58.67         4.651947e-09

b) Parallel CPU vs GPU

Code: Select all

  N    CPU GFlop/s    GPU GFlop/s    ||PA-LU|| / (||A||*N)
==========================================================
 1024     45.28          38.50         4.691306e-09
 2048    116.63          57.24         4.679362e-09
 3072    128.99          62.58         4.616083e-09
 4032    131.83          64.67         4.631509e-09

I was just wondering if this is what to expect, or am I doing something wrong?

Allan Menezes · Post by **Allan Menezes** » Fri Jul 16, 2010 10:46 pm

Can you please specify in the second and first tables how many CPUs and which ones (INTEL VS AMD and number of cores) and the number and type of GPUs?
Allan

gaurav · Post by **gaurav** » Sat Jul 17, 2010 7:20 pm

Following is the information about the CPU & the GPU I'm using:
CPU: Two Intel Xeon E5530 quad-cores (2.4 GHz)
GPU: One NVIDIA Tesla C1060

Please let me know if more information is required.

Thanks,
Gaurav

Allan Menezes · Post by **Allan Menezes** » Tue Jul 20, 2010 7:52 am

Dear Gaurav,
Here are the specs of the Tesla C1060 at URL http://www.nvidia.com/object/product_te ... 60_us.html
Here are the specs of the Intel E5530 at URL http://ark.intel.com/Product.aspx?id=37103
As you can see it can perform at only 78 GFlops double precision peak maximum and you are getting close to that.
As for the Intel Xeon E5530 per core double precision is CPU frequency*4=9.6GLOPS without hyperthreading. If it has
hyptherthreading and it factors in twice then it is 9.6*2=19.2GFlops per core double precision which is what you observe in sequential as it
is using a single core with hyperthreading.In parallel the Intel E5530 uses all 16 cores if it has hyperthreading at 16*2.4GHz*4 =153.6GFlops double precision which when you get 131.83 factors in at an
efficiency of 86% for the cpu in parallel. So your results do make sense as your GPU is not a Tesla Fermi!.
That is all providing cgetrf_gpu measures double precision performance. For the CPU in parallel and sequential I guess Intel MKL is used to measure it's GFLOPS by 'testing_cgetrf_gpu' routine and Magma 0.2 for the C1060.
Hope this helps,
Regards,
Allan

gaurav · Post by **gaurav** » Thu Jul 22, 2010 12:12 am

Dear Allen,
Thanks a lot for the detailed info. It does make sense to me. However, I have the following comments:
(i) I think 'cgetrf' should be working in single precision (following the LAPACK naming conventions 'c' for single precision complex and 'z' for double precision complex); looking at 'testing_cgetrf_gpu.cpp' in MAGMA's testing directory also indicates the same. I was expecting to get closer to a TFLOP/s performance from the GPU. Maybe the problem-size is not big enough to attain that, or the performance of the GPU implementation is memory-bound.
(ii) I had an impression that all corresponding CUBLAS or MAGMA functions will outperform their LAPACK counterparts, which, I probably built-up via testing a few routines (e.g. _gemm) and noticing that CUBLAS performed better than LAPACK. But it seems, that this is not always the case.
Thanks!
Gaurav

Stan Tomov · Post by **Stan Tomov** » Thu Aug 05, 2010 3:59 pm

Hi,
The complex versions attain higher performance. The reason you see it lower is
that MAGMA 0.2 was released before CUBLAS added all the routines that we needed
in complex arithmetic. So we just wrote wrappers for the missing - copy data to the
CPU, use BLAS on the CPU to compute and move the result back to the GPU.
I am going to recompile and post within a few days a version that uses directly
CUBLAS, as the complex routines that we need are now available.
I think Allen used the magma sources to recompile magma using directly CUBLAS
(see also this topic).
Stan

MAGMA Forum

CPU vs GPU speed

CPU vs GPU speed

Re: CPU vs GPU speed

Re: CPU vs GPU speed

Re: CPU vs GPU speed

Re: CPU vs GPU speed

Re: CPU vs GPU speed