I am trying to understand the block Cholesky (Level 3 BLAS) algorithm as implemented in DPOTRF. The explanation given here: http://www.netlib.org/utk/papers/factor/node9.html is quite clear and states that there are 3 steps:
1. DPOTF2 (compute L11)
2. DTRSM (compute L22)
3. DSYRK (update A22)
That website appears to be based on the 1996 publication:
Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P., Walker, D. W., & Whaley, R. C. (1996). Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, 5(3), 173-184.
However, the actual implementation of DPOTRF uses a seemingly different approach:
1. DSYRK
2. DPOTF2
3. DGEMM
4. DTRSM
I was unable to find an explanation of these steps. But I found an old LAPACK version of DPOTRF from 1993 which predates that 1996 publication and is still based on the 4 steps above. Does anyone know of a reference describing the block Cholesky algorithm in DPOTRF, and whether these 4 steps are more efficient than the 3 step algorithm in the 1996 publication?

