Exercise 2 Matrix-Matrix Multiplication Optimization using Blocking and Unrolling of Loops
Purpose: This exercise is intended to show how to subdivide data into blocks and unroll loops. Subdividing data into blocks helps them to fit into cache memory better. Unrolling loops decreases the number of branch instructions. Both of these methods sometimes increase performance. A final example shows how matrix multiplication performance can be improved by combining methods of subdividing data into blocks, unrolling loops, and using temporary variables and controlled access patterns.
Information: The matrix multiplication A = A + B * C can be executed using the simple code segment below. This loop ordering kji should correspond to one of the best access ordering the six possible simple i, j, k style loops.