MAGMA GEMM Sources for Fermi Released

Post by **admin** » Wed Aug 04, 2010 12:54 pm

The MAGMA BLAS SGEMM and DGEMM sources for Fermi GPUs are now released.
These improved GEMMs, developed by Rajib Nath and Stan Tomov, will be
part of the up-coming MAGMA 0.3 library release and will be included in
CUBLAS 3.2 as well.

The basic algorithm is described in:
Nath, R., Tomov, S., Dongarra, J. "An Improved MAGMA GEMM for Fermi GPUs,"
University of Tennessee Computer Science Technical Report, UT-CS-10-655
(also LAPACK working note 227), July 29, 2010.
http://icl.cs.utk.edu/projectsfiles/mag ... i_gemm.pdf

On a C2050 GPU the new DGEMM gets up to 300 GFlop/s (58% of peak) and
the SGEMM up to 645 (63% of peak). On a GTX480 DGEMM gets up to 166 GFlop/s
and SGEMM up to 844 GFlop/s.

mbibby · Post by **mbibby** » Thu Aug 05, 2010 10:04 am

When will we see the cgemm and zgemm equivalents?

Malcolm

Stan Tomov · Post by **Stan Tomov** » Thu Aug 05, 2010 12:32 pm

I am not sure if we would personally write the equivalents. NVIDIA is preparing CUBLAS 3.2
that will have improved c/z gemms using ideas from the s/d gemms.
Stan

Boxed Cylon · Post by **Boxed Cylon** » Fri Aug 13, 2010 2:26 am

I preface this post with the declaration that I know just about nothing about details of these routines...

I was looking through the fermi_sgemm.cu routine to get some sense of how the code was engineered. I noticed the __mul24 function, and wondered what it did. A google search turned up the Fermi Tuning Guide with:

Code: Select all

32-Bit Integer Multiplication
On devices of compute capability 1.x, 32-bit integer multiplication is implemented using multiple instructions as it is not natively supported. 24-bit integer multiplication is natively supported via the __[u]mul24 intrinsic.

On devices of compute capability 2.0, however, 32-bit integer multiplication is natively supported, but 24-bit integer multiplication is not. __[u]mul24 is therefore implemented using multiple instructions and should not be used (Section 5.4.1).

Should the fermi_sgemm.cu routine be using __mul24? (Or perhaps there are reasons 24-bit integers are employed?)

Stan Tomov · Post by **Stan Tomov** » Tue Sep 07, 2010 1:15 pm

There is no reason to use __mul24. We will remove it. Thanks for pointing this out.

Allan Menezes · Post by **Allan Menezes** » Sun Sep 12, 2010 11:51 pm

Dear Stan,
As this is just pointer arithmetic and used in only a few places it does not change the perfomance much at all as per my experiment below.
Just for fun I changed fermi_dgemm.cu and fermi_sgemm.cu with a single #define on top as #define __mul24(a,b) ((a)*(b)) and there was no significant difference in Gflops and err was still 0.00 on a GTX-480.
The device memory still on available fermi devices is < 4GB and is going to change in the future with the Tesla C2070 and CUDA 3.2 to 64 bit addresses.
Thank you,
Allan

rramachand21 · Post by **rramachand21** » Tue Nov 30, 2010 5:05 pm

Hello,

I am new to cuda and this api. Could I please get the source code for matrix vector multiplication (sgemv and dgemv) which is generic.

Thanks,
Ranjith

anikam · Post by **anikam** » Fri Mar 16, 2018 5:13 pm

Hello,
Why does Magmablas only works when m,n,k are multiple of 96?
Can it work if m,n,k are not multiple of 96?

Thanks and Regards
Abhishek Nikam

mgates3 · Post by **mgates3** » Fri Mar 16, 2018 5:17 pm

It should work for any m, n, k, not just multiples of 96. If you are having a problem with other sizes, please post specifics, e.g., the output of magma/testing/testing_dgemm.

(There may be problems for very large matrices, due to exceeding GPU texture memory. As I recall, in these cases we just call cublas.)

-mark

anikam · Post by **anikam** » Fri Mar 16, 2018 5:36 pm

Hello,
Thanks for the reply, for my work I need to only use the open source Magma Blas_Gemm.
Also, it calls cublasgemm if the dimensions are not multiple of 96.
It does not specify any particular warning about large sizes (large sizes with dimensions multiples of 96 must work).
Also, would the magma blas gemm work for dimensions are not multiple of 96 but are pretty small sizes.
Are there any specific changes which need to be done for that?
Also it does not work with latest Cuda versions, Is there any way with which I can make it run with latest Cuda versions?

Thanks and Regards
Abhishek NIkam

MAGMA Forum

MAGMA GEMM Sources for Fermi Released

MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released

Re: MAGMA GEMM Sources for Fermi Released