PPT Slide
The second optimization will demonstrate the effect of blocking, so that, as much as possible, the blocks that are being handled can be kept completely in cache memory. Thus each loop is broken up into blocks (ib, beginning of an i block, ie, end of an i block) and the variables travel from the beginning of the block to the end of the block for each i,j,k. Use blocks of size 32 to start with, if you wish you can experiment with the size of the block to obtain the optimal size.
The next logical step is to combine these two optimizations into a routine which is both blocked and unrolled and you will be asked to do this.
The final example tries to extract the core of the BLAS dgemm matrix-multiply routine. The blocking and unrolling are retained, but the additional trick here is to optimize the innermost loop. Make sure that it only references items in columns and that it does not reference anything that would not be in a column. To that end, B is copied and transposed into the temp matrix T(k,i) = B(i,k). Then multiplying B(i,k)*C(k,j) is equivalent to multiplying T(k,i)*C(k,j) (notice the k index occurs only in the row). Also, we do not store the result in A(i,j)=A(i,j)+B(i,k)*C(k,j) but in a temporary variable T1=T1+T(k,j)*C(k,j). The effect of this is the inner k-loop has no extraneous references. After the inner loop has executed, A(i,j) is set to its correct value.