Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
-
gaurav
- Posts: 6
- Joined: Thu Aug 13, 2009 1:20 pm
Post
by gaurav » Thu Jul 15, 2010 11:57 pm
Hi,
While running the supplied 'testing_cgetrf_gpu', I observed that using a parallel CPU compilation achieves greater GFLOP/s compared to that attained by the GPU. Following are part of the result tables:
Code: Select all
a) Sequential CPU vs GPU
N CPU GFlop/s GPU GFlop/s ||PA-LU|| / (||A||*N)
==========================================================
1024 13.74 36.66 4.693558e-09
2048 17.67 50.95 4.678659e-09
3072 18.55 54.45 4.583328e-09
4032 19.20 58.67 4.651947e-09
b) Parallel CPU vs GPU
Code: Select all
N CPU GFlop/s GPU GFlop/s ||PA-LU|| / (||A||*N)
==========================================================
1024 45.28 38.50 4.691306e-09
2048 116.63 57.24 4.679362e-09
3072 128.99 62.58 4.616083e-09
4032 131.83 64.67 4.631509e-09
I was just wondering if this is what to expect, or am I doing something wrong?
-
Allan Menezes
- Posts: 14
- Joined: Wed Aug 05, 2009 10:01 pm
Post
by Allan Menezes » Fri Jul 16, 2010 10:46 pm
Can you please specify in the second and first tables how many CPUs and which ones (INTEL VS AMD and number of cores) and the number and type of GPUs?
Allan
-
gaurav
- Posts: 6
- Joined: Thu Aug 13, 2009 1:20 pm
Post
by gaurav » Sat Jul 17, 2010 7:20 pm
Following is the information about the CPU & the GPU I'm using:
CPU: Two Intel Xeon E5530 quad-cores (2.4 GHz)
GPU: One NVIDIA Tesla C1060
Please let me know if more information is required.
Thanks,
Gaurav
-
Allan Menezes
- Posts: 14
- Joined: Wed Aug 05, 2009 10:01 pm
Post
by Allan Menezes » Tue Jul 20, 2010 7:52 am
Dear Gaurav,
Here are the specs of the Tesla C1060 at URL
http://www.nvidia.com/object/product_te ... 60_us.html
Here are the specs of the Intel E5530 at URL
http://ark.intel.com/Product.aspx?id=37103
As you can see it can perform at only 78 GFlops double precision peak maximum and you are getting close to that.
As for the Intel Xeon E5530 per core double precision is CPU frequency*4=9.6GLOPS without hyperthreading. If it has
hyptherthreading and it factors in twice then it is 9.6*2=19.2GFlops per core double precision which is what you observe in sequential as it
is using a single core with hyperthreading.In parallel the Intel E5530 uses all 16 cores if it has hyperthreading at 16*2.4GHz*4 =153.6GFlops double precision which when you get 131.83 factors in at an
efficiency of 86% for the cpu in parallel. So your results do make sense as your GPU is not a Tesla Fermi!.
That is all providing cgetrf_gpu measures double precision performance. For the CPU in parallel and sequential I guess Intel MKL is used to measure it's GFLOPS by 'testing_cgetrf_gpu' routine and Magma 0.2 for the C1060.
Hope this helps,
Regards,
Allan
-
gaurav
- Posts: 6
- Joined: Thu Aug 13, 2009 1:20 pm
Post
by gaurav » Thu Jul 22, 2010 12:12 am
Dear Allen,
Thanks a lot for the detailed info. It does make sense to me. However, I have the following comments:
(i) I think 'cgetrf' should be working in single precision (following the LAPACK naming conventions 'c' for single precision complex and 'z' for double precision complex); looking at 'testing_cgetrf_gpu.cpp' in MAGMA's testing directory also indicates the same. I was expecting to get closer to a TFLOP/s performance from the GPU. Maybe the problem-size is not big enough to attain that, or the performance of the GPU implementation is memory-bound.
(ii) I had an impression that all corresponding CUBLAS or MAGMA functions will outperform their LAPACK counterparts, which, I probably built-up via testing a few routines (e.g. _gemm) and noticing that CUBLAS performed better than LAPACK. But it seems, that this is not always the case.
Thanks!
Gaurav
-
Stan Tomov
- Posts: 283
- Joined: Fri Aug 21, 2009 10:39 pm
Post
by Stan Tomov » Thu Aug 05, 2010 3:59 pm
Hi,
The complex versions attain higher performance. The reason you see it lower is
that MAGMA 0.2 was released before CUBLAS added all the routines that we needed
in complex arithmetic. So we just wrote wrappers for the missing - copy data to the
CPU, use BLAS on the CPU to compute and move the result back to the GPU.
I am going to recompile and post within a few days a version that uses directly
CUBLAS, as the complex routines that we need are now available.
I think Allen used the magma sources to recompile magma using directly CUBLAS
(see also
this topic).
Stan