The LAPACK forum has moved to https://groups.google.com/a/icl.utk.edu/g/lapack.

cblas_gemv / cblas_gemm much faster with vector of 0s

Post here if you have a question about LAPACK performance

cblas_gemv / cblas_gemm much faster with vector of 0s

Postby kangshiyin » Mon Jun 27, 2016 1:11 pm

I saw a question on stackoverflow.

http://stackoverflow.com/questions/3804 ... rix-values

It shows that in a vector-matrix multiplication, gemv/gemm runs much faster when the vector contains a lot of 0s, and the result is correct.
Until now I can only reproduce this with the package libblas-dev on Ubuntu 14.04.

Could you help explain why?
kangshiyin
 
Posts: 2
Joined: Mon Jun 27, 2016 1:00 pm

Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Postby Julien Langou » Tue Jun 28, 2016 1:29 am

Hi Usman,

I have no idea. You multiply a 1-by-m vector A by the m-by-m identity B and you want to compute C = AB. (So you will get C=A. Fine.)
Then when A is a sparse-vector, you observe that the code runs (significantly) faster than when A is a dense vector.
OK. No idea why on my end.

Cheers,
Julien.
Julien Langou
 
Posts: 835
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Postby Julien Langou » Tue Jun 28, 2016 1:44 am

Jim Demmel says:
The reference BLAS (which may or may not be how your BLAS is implemented)
sometimes check for zero entries in the inputs to avoid unnecessary arithmetic
(look at sger.f). I could imagine that someone optimizing the BLAS might
do this for other BLAS implementations as well. For example, it only costs
O(n^2) to count the number of zeros in an n-by-n matrix, which is cheap
compared to the O(n^3) cost of matrix multiplication, and if the number of
zeros is large enough, they might use a "sparse matrix multiply" algorithm.

Interestingly, this means that the BLAS do not propagate exceptions
consistently, eg some might multiply 0*inf and get a NaN, and others
may skip it.
Julien Langou
 
Posts: 835
Joined: Thu Dec 09, 2004 12:32 pm
Location: Denver, CO, USA

Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Postby kangshiyin » Tue Jun 28, 2016 2:54 am

Thank you for the reply. An answer to the original question confirms that such optimization exists in that particular version of BLAS.
kangshiyin
 
Posts: 2
Joined: Mon Jun 27, 2016 1:00 pm

Re: cblas_gemv / cblas_gemm much faster with vector of 0s

Postby ChrisBragg » Thu Jul 14, 2016 1:23 am

When you optimize gemv and gemm different techniques apply this
* For the matrix-matrix operation you are using blocked algorithms. Block sizes depend on cache sizes.
* For optimising the matrix-vector product you use so called fused Level 1 operations (e.g. fused dot-products or fused axpy).
ChrisBragg
 
Posts: 2
Joined: Wed Jul 13, 2016 5:05 am


Return to Performance

Who is online

Users browsing this forum: No registered users and 1 guest