LAPACK/ScaLAPACK Development

by **panther** » Mon Jan 15, 2007 1:22 pm

Hi all,

I was just wondering if anyone can comment on the efficiency of using dgemm to perform a matrix-vector multiply (for large matrices/vectors....dimensions > 2000) over the traditional dgemv routine i.e. set the #columns to 1 of matrix B (the vector) and perform a matrix-multiply with dgemm.

The reason I ask is that I use Intel's MKL library which has multithreaded Level 3 Blas and therefore I can employ multiple threads to carry out the matrix-vector multiply when I use a coerced dgemm routine as opposed to the more specific dgemv.

In general I was just wondering if anyone knew the performance penalties of dgemm over dgemv for a matrix-vector multiply with the column size set to 1, for instances in which the MKL library is not available i.e. when porting to another architecture.

Comments gratefully received.

Thanks.

by **Julien Langou** » Mon Jan 15, 2007 2:46 pm

Hello,

well I am not sure I get your question correctly but here are some facts:

(1) any good matrix-matrix multiply (gemm) called with n=1, will call underneath the matrix-vector multiply routine (gemv)

(2) if you have an implementation of gemm around, you probably have a gemv, and reciprocally if you have a gemv, you probably have gemm. Both routines are in general part of a BLAS library. And optimized BLAS libraries are availabe on most of the platform. (Intel for example.)

Julien.

by **panther** » Mon Jan 15, 2007 3:09 pm

Dear Julien,

Thanks for your reply.

I suspected that DGEMM will call DGEMV when the #columns in the second matrix is 1.

Intel's Math Kernel Library (MKL) has multithreaded support for Level 3 Blas routines only i.e. I can set OMP_NUM_THREADS on my SMP machine and get some thread support for calls to DGEMM for instance.

Currently I use a multithreaded DGEMM call for matrix-vector multiply so I can benefit from threads (due to the large matrices and vectors I am computing with).

In general though what happens when I wish to run my code on another architecture that only provides a standard serial BLAS library. Will I be penalised when I call DGEMM to perform a matrix-vector multiply as opposed to the traditional dgemv. I don't really want to support separate codes for different architectures...for portability reasons as you can suspect.

Thanks.

by **Julien Langou** » Mon Jan 15, 2007 3:15 pm

ok got it.

1- No, in the general case, you will not be penalized if you call DGEMM with N=1 rather than DGEMV. (The penalty is function call from DGEMM to DGEMV which is a priori negligeable.)

2- There is a good reason why INTEL is not providing a multithreaded Level-2 BLAS implementation. It's not an easy task to have any speed-up at all on a Level-2 BLAS operation .... I do not know much BLAS implementation that provides a multihreaded Level-2 BLAS.

If you call DGEMM with N=1 to you see any speed-up with respect to DGEMV ?? I doubt.

Julien.

by **buttari** » Tue Jan 16, 2007 9:57 am

Julien Langou wrote:ok got it.

1- No, in the general case, you will not be penalized if you call DGEMM with N=1 rather than DGEMV. (The penalty is function call from DGEMM to DGEMV which is a priori negligeable.)

2- There is a good reason why INTEL is not providing a multithreaded Level-2 BLAS implementation. It's not an easy task to have any speed-up at all on a Level-2 BLAS operation .... I do not know much BLAS implementation that provides a multihreaded Level-2 BLAS.

If you call DGEMM with N=1 to you see any speed-up with respect to DGEMV ?? I doubt.

Julien.

Hi all,
Intel (and also ATLAS) does not provide multithreaded level-2 BLAS because using many threads on BLAS-2 operations doesn't provide any speedup in many cases and, in some cases, it also results in a slowdown (i.e. 2 threads run slower than 1). The reason for this lies in the famous surface-to-volume effect that happens for BLAS-3 (i.e. n^3 operations on n^2 data) but not for BLAS-2 (n^2 operations on n^2 data); this means that BLAS-2 are upper bounded by the speed of the bus. Since it's pretty easy to saturate the bus with one thread, using many may not help. If you want you can give a shot to GotoBLAS which also provides multithreaded BLAS-2 operations: I timed GEMV on my laptop (core duo with slow bus) and I found that 2 threads run slower than 1.
Hope this helps

Alfredo

by **panther** » Tue Jan 16, 2007 12:41 pm

Thanks for that information. At least I now know why only L3 BLAS seems to be multithreaded in most cases.

Regards.

LAPACK/ScaLAPACK Development

sgemm v sgemv performance for a matrix-vector multiply

sgemm v sgemv performance for a matrix-vector multiply

Who is online