This might be simple question, but I do hope someone can give me your valuable opinions.
I am writing a code, which calls few simple level-1 blas subroutines
such as dnrm2 and daxpy. The increment is always one.
To keep my code in an easy setting, I prefer directly
writing these subroutines without calling blas.
They are supposed to be very simple, but I noticed that dnrm2 or daxpy
uses a more complicated implementation of "unrolled" loop if increment is one. I wonder if on a typical computer doing so makes any significant timing reduction? My guess is not. Then what is the reason these level-1 subroutines consider such an implementation?
Many thanks

