While working on LDL factorization of symmetric indefinite matrices it came to my attention that some key LAPACK routines like *lasyf and *lahef could benefit a lot from having a routine what would compute lower/upper triangular of the product C = L * W^T of two rectangular matrices. Those naturally arise in the LDL factorizations where W is a product L * D with D being a symmetric matrix containing only diagonal blocks of sizes 1-by-1 and 2-by-2. Thus C is a symmetric matrix but this is not obvious if you look at the form of C as defined.
Typically C is computed in LAPACK routines through blocking and using *GEMV inside some loop to calculate diagonal part of each block. This is obviously inefficient as level 3 BLAS routine would provide better FLOP/Data acess ration and much beter performance overall.
It is also apparent that some sparse LDL solvers (including I believe MA57) would greatly benefit from new BLAS routine computing C directly via a single BLAS 3 call.
It turns out that BLAST already has a proposed routine called SY_TRIDIAG_R2K which computes the product L * D * L^T. However it is not clear as to how efficient an implementation of such a function would be and also how it would fit into the LAPACK framework since L * D is already precomputed inside *lasyf and *lahef routines and all is needed is just the product of L and (L * D)^T. SY_TRIDIAG_R2K would simply waste time redoing this computation. Moreover as far as implementation of the proposed routine C = op(A) * op(B), it is trivial given the implementation of the routine *SYRK since it would only require very minor modifications. From my own experience, it took me about 25 minutes to write it when i had my own implementation of *SYRK. The kernels are exactly the same: only the arguments are difefrent.
It would be nice if this would get investigated as part of PLASMA developement. I believe thus modified *lasyf and *lahef routines would have performance quite close to that of the blocked Cholesky factorization.

