The LAPACK forum has moved to https://github.com/Reference-LAPACK/lapack/discussions.

sgemm v sgemv performance for a matrix-vector multiply

Open discussion regarding features, bugs, issues, vendors, etc.

sgemm v sgemv performance for a matrix-vector multiply

Postby panther » Mon Jan 15, 2007 1:22 pm

Hi all,

I was just wondering if anyone can comment on the efficiency of using dgemm to perform a matrix-vector multiply (for large matrices/vectors....dimensions > 2000) over the traditional dgemv routine i.e. set the #columns to 1 of matrix B (the vector) and perform a matrix-multiply with dgemm.

The reason I ask is that I use Intel's MKL library which has multithreaded Level 3 Blas and therefore I can employ multiple threads to carry out the matrix-vector multiply when I use a coerced dgemm routine as opposed to the more specific dgemv.

In general I was just wondering if anyone knew the performance penalties of dgemm over dgemv for a matrix-vector multiply with the column size set to 1, for instances in which the MKL library is not available i.e. when porting to another architecture.

Comments gratefully received.

Thanks.
panther
 
Posts: 6
Joined: Mon Jan 15, 2007 1:06 pm
Location: Irish Centre for High-End Computing

Postby Julien Langou » Mon Jan 15, 2007 2:46 pm

Hello,

well I am not sure I get your question correctly but here are some facts:

(1) any good matrix-matrix multiply (gemm) called with n=1, will call underneath the matrix-vector multiply routine (gemv)

(2) if you have an implementation of gemm around, you probably have a gemv, and reciprocally if you have a gemv, you probably have gemm. Both routines are in general part of a BLAS library. And optimized BLAS libraries are availabe on most of the platform. (Intel for example.)

Julien.
Julien Langou
 
Posts: 835
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Postby panther » Mon Jan 15, 2007 3:09 pm

Dear Julien,

Thanks for your reply.

I suspected that DGEMM will call DGEMV when the #columns in the second matrix is 1.

Intel's Math Kernel Library (MKL) has multithreaded support for Level 3 Blas routines only i.e. I can set OMP_NUM_THREADS on my SMP machine and get some thread support for calls to DGEMM for instance.

Currently I use a multithreaded DGEMM call for matrix-vector multiply so I can benefit from threads (due to the large matrices and vectors I am computing with).

In general though what happens when I wish to run my code on another architecture that only provides a standard serial BLAS library. Will I be penalised when I call DGEMM to perform a matrix-vector multiply as opposed to the traditional dgemv. I don't really want to support separate codes for different architectures...for portability reasons as you can suspect.

Thanks.
panther
 
Posts: 6
Joined: Mon Jan 15, 2007 1:06 pm
Location: Irish Centre for High-End Computing

Postby Julien Langou » Mon Jan 15, 2007 3:15 pm

ok got it.

1- No, in the general case, you will not be penalized if you call DGEMM with N=1 rather than DGEMV. (The penalty is function call from DGEMM to DGEMV which is a priori negligeable.)

2- There is a good reason why INTEL is not providing a multithreaded Level-2 BLAS implementation. It's not an easy task to have any speed-up at all on a Level-2 BLAS operation .... I do not know much BLAS implementation that provides a multihreaded Level-2 BLAS.

If you call DGEMM with N=1 to you see any speed-up with respect to DGEMV ?? I doubt.

Julien.
Julien Langou
 
Posts: 835
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Postby buttari » Tue Jan 16, 2007 9:57 am

Julien Langou wrote:ok got it.

1- No, in the general case, you will not be penalized if you call DGEMM with N=1 rather than DGEMV. (The penalty is function call from DGEMM to DGEMV which is a priori negligeable.)

2- There is a good reason why INTEL is not providing a multithreaded Level-2 BLAS implementation. It's not an easy task to have any speed-up at all on a Level-2 BLAS operation .... I do not know much BLAS implementation that provides a multihreaded Level-2 BLAS.

If you call DGEMM with N=1 to you see any speed-up with respect to DGEMV ?? I doubt.

Julien.


Hi all,
Intel (and also ATLAS) does not provide multithreaded level-2 BLAS because using many threads on BLAS-2 operations doesn't provide any speedup in many cases and, in some cases, it also results in a slowdown (i.e. 2 threads run slower than 1). The reason for this lies in the famous surface-to-volume effect that happens for BLAS-3 (i.e. n^3 operations on n^2 data) but not for BLAS-2 (n^2 operations on n^2 data); this means that BLAS-2 are upper bounded by the speed of the bus. Since it's pretty easy to saturate the bus with one thread, using many may not help. If you want you can give a shot to GotoBLAS which also provides multithreaded BLAS-2 operations: I timed GEMV on my laptop (core duo with slow bus) and I found that 2 threads run slower than 1.
Hope this helps

Alfredo
buttari
 
Posts: 51
Joined: Tue Jul 11, 2006 2:11 pm

Postby panther » Tue Jan 16, 2007 12:41 pm

Thanks for that information. At least I now know why only L3 BLAS seems to be multithreaded in most cases.

Regards.
panther
 
Posts: 6
Joined: Mon Jan 15, 2007 1:06 pm
Location: Irish Centre for High-End Computing


Return to User Discussion

Who is online

Users browsing this forum: No registered users and 6 guests