Hello.
Indeed, there is `no way` to do in-place matrix-matrix multiplication of the type A = A * B with A m-by-n and B n-by-n (for example) using a transformation (reordering of operations) of the standard algorithm.
If you are trying to minimize the memory requirement and use a transformation (reordering of operations) of the standard algorithm, then the best you can do is use the row-version of matrix-matrix multiplication ( for i=1:m, A(i,1:n) = A(i,1:n)*B(1:n,1:n); end; ) This requires a buffer of size n and regular copies from buffer to matrix and this can be done using GEMV. (As you wrote.)
Indeed another way to do in-place matrix-matrix multiplication is to do an LU factorization of B and then rely on the fact that TRMM can be done (and is done) in place. So a factorization of the type B=B1*B2 with B1 and B2 triangular and then A = ( A * B1 ) * B2. As far as stability, clearly we will want to use pivoting during the LU factorization and this is not a problem. We can also use a QR factorization and then rely on ORMQR (which can be done, and is done, in place) and TRMM.
Note that as far as stability goes, this kind of algorithm is normwise stable, as opposed to componentwise stable for matrix-matrix multiplication. So it is less stable than the standard algorithm. (Kind of the same `weakness` as Strassen's algorithm has for example.)
Note that B will either have to be overwritten by L and U, or we will need a buffer of size n^2 for B (in general n is not too big so not a big deal), or we can consider overwrite by L and U and reconstruct in-place in output (which sounds complicated and an overkill but possible with something like LAUUM).
I do not know many references for doing an LU factorization for enabling in-place matrix-matrix multiplication. One reference is my PhD thesis.

. See Section 2.4.3 (The LU–matrix–matrix product) on page 111. Let me know if there is a better reference.
Best wishes,
Julien.