by Julien Langou » Tue Sep 01, 2015 9:26 pm
Quick explanation of why there is no subroutine in the BLAS so that x <- A*x
for a dense matrix A.
Observation: We cannot do the operation x <- A*x in place. We have to copy x in a temporaray
vector y and then perform x <- A*y. This is because, once we have computed the output value x(i) (for any i), we cannot reuse the input value x(i)
to compute the other output x(j) (for j not i). So we cannot do x <- A*x in place.
So, this leaves us to two choices to design a BLAS subroutine. (1) Either stick
with the idea that we need to create a subroutine such that x <- A*x. This
means that the BLAS will have to have a routine that allocates a temporary
workspace for y, copies y <- x, then calls the existing and useful x <- A*y,
then frees y. (2) Or we focus on performing x <- A*y and leave the tasks of
allocation, copy, free to user. The choice was (2) over (1) since it is easy
enough to get (1) from (2). The design choice for BLAS was to only implement
(2) and leave (1) to the user.
Comment #1: If A is triangular, then it is possible to implement x <- A*x in
place. If A is upper triangular, you first compute x(1) using x(1), x(2) . . .
x(n). And this is OK since x(1) is not needed to compute x(2) . . . x(n). Then
you compute x(2), then x(3), etc. This explains why the TRMV routine is by
design in-place.
Comment #2: This explanation about GEMV and TRMV holds for GEMM and TRMM. There
is no subroutine to perform A <- A*B in the BLAS. There is only GEMM of the
type A <- B*C. But there is TRMM such that A <- A*B when B is triangular or
A <- B*A when B is triangular. TRMM is in place by design.
Comment #3: If you really really are short in memory and have to implement say
A <- A*B where A is m-by-n, B is n-by-n, and n << m, you can perform the LU
factorization of B: B = P*L*U and then compute A <- (A*P)*L)*U). This way you
do not need to allocate an extra m-by-n workspace for a temporary C. You can
actually do A <- A*B in place. This is an extreme trick. I did use it a few
times in some iterative solvers where memory was scarce. For example, n is of
the order of 10, and m is of the order of 10^9, so the trade off is either
perform the LU factorization of 10-by-10 matrix B or allocate a 10^9-by-10
temporary array. Yes you may worry a little about stability with LU, so you can
use a more stable factorization for B like the QR factorization. That would
work similarly.
Julien