PPT Slide

The next logical step is to combine these two optimizations into a routine which is both blocked and unrolled and you will be asked to do this.

The final example tries to extract the core of the BLAS dgemm matrix-multiply routine. The blocking and unrolling are retained, but the additional trick here is to optimize the innermost loop. Make sure that it only references items in columns and that it does not reference anything that would not be in a column. To that end, B is copied and transposed into the temp matrix T(k,i) = B(i,k). Then multiplying B(i,k)*C(k,j) is equivalent to multiplying T(k,i)*C(k,j) (notice the k index occurs only in the row). Also, we do not store the result in A(i,j)=A(i,j)+B(i,k)*C(k,j) but in a temporary variable T1=T1+T(k,j)*C(k,j). The effect of this is the inner k-loop has no extraneous references. After the inner loop has executed, A(i,j) is set to its correct value.

Previous slide Next slide Back to first slide View graphic version