Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor