LAPACK/ScaLAPACK Development

Posted: **Thu Nov 09, 2006 3:43 pm**

This might be simple question, but I do hope someone can give me your valuable opinions.

I am writing a code, which calls few simple level-1 blas subroutines
such as dnrm2 and daxpy. The increment is always one.
To keep my code in an easy setting, I prefer directly
writing these subroutines without calling blas.
They are supposed to be very simple, but I noticed that dnrm2 or daxpy
uses a more complicated implementation of "unrolled" loop if increment is one. I wonder if on a typical computer doing so makes any significant timing reduction? My guess is not. Then what is the reason these level-1 subroutines consider such an implementation?

Many thanks

Posted: **Thu Nov 09, 2006 4:38 pm**

If you are looking at BLAS code, you are probably looking at the code of the reference BLAS. Whatever this code does is not performance oriented. If possible, you should use BLAS provided by the system/cpu vendor. Nevertheless, you are correct, no matter what one does at the level of BLAS1, the code is going to be bandwith bound. Whatever BLAS code you are looking at, it was probably written a long tine ago. If you have just simple things in your code like norms and axpys, go ahead and hand code them. There is no reason to introduce external dependency into your code, if you are not gaining anything. Also, if you have doubts if something is a performance gain or not, just run it.

Posted: **Thu Nov 09, 2006 5:36 pm**

Some BLAS routines like dnrm2 are carefully written to avoid unnecessary overflow and harmful underflow. So if there is any chance that your vectors have very large, or very small, elements use the BLAS routine.

Best wishes,

Sven Hammarling.

LAPACK/ScaLAPACK Development

The need to use level-1 BLAS

The need to use level-1 BLAS