Why reference blas on netlib is faster in some cases
When the scale of the matrices are small, I found that reference blas is much faster than those optimized blas. By small scale I mean the size of the matrices are small than about 30x30. I tested several Libs: reference blas, GotoBlas, ACML and Atlas, and found that reference blas on netlib is the fastest. Can anybody here tell me why? Thank you