CTWatch Quarterly » Performance Complexity: An Execution Time Metric to Characterize the Transparency and Complexity of Performance

Performance Complexity: An Execution Time Metric to Characterize the Transparency and Complexity of Performance

Erich Strohmaier, Lawrence Berkeley National Laboratory

5. Serial Execution

As a first example, I calculate the PC values of my four performance models for the serial performance measurements of Apex-Map.² ³ The memory size used was 512MB and I swept across 10 α and 17 L values. Figure 2 shows for model 0 that performance variation is the highest on the vector processors and the lowest on the PowerPC processor in BlueGene/L, which has a relatively flat memory hierarchy compared to the other superscalar processors. This model reflects the attitude of a programmer who does not want to take into consideration the performance relevant features in his/her coding style.

The same is basically true for model 1. Complexity values are unchanged on vector processors. PC is slightly reduced for superscalar processors, but not as much as we might expect. Inspection of residual error indicates that this is in part due to not considering the effects of fractional cache-line usage for very short L values, which increases error in this parameter region greatly. This model also does not capture any more advanced memory access feature on modern superscalar processors such as pre-fetching.

Figure 2. PC values derived with three simple models for serial execution and based on efficiencies [accesses/cycle].

Figure 3. PC values derived with the combined model 3 for serial execution and based on efficiencies [accesses/cycle].

PC values for model 2, which accounts for such effects, are overall substantially better than for the previous models. While this is no surprise at all on vector systems, it is a strong indication of the importance of such features as pre-fetching on the performance of superscalar architectures. This model represents a programmer who is willing to structure his or her code with long for loops to improve performance.

Figure 4. Examples for residual squared errors (SE) of model 3 for Itanium, Opteron, and Cray MSP. Note the differences in scale.

PC values for the combined model 3 are only visible and shown in Figure 3 where I switched to a logarithmic vertical scale. Clearly, the performance of Apex-MAP is overall much better resolved in this model. This model reflects a programmer who codes for long loops as well as high cache re-use, where the later feature unfortunately is not easily controllable. The final ranking of processors with PC values shows the NEC SX8 with the lowest complexity followed by the Cray X1 in MSP, and then SSP mode, Opteron, Power3, Power5, Power4, Xeon, PowerPC, and Itanium. Different individuals and groups interviewed have produced very similar rankings of these processors on several occasions. This indicates that my methodology can produce PC values, which fit intuitive expectations well.

The only processor with highly unresolved PC is the Itanium. Inspecting residual errors in Figure 4, we see high errors for the medium range of loop length, which indicates a more complex behavior, perhaps due to peculiarities of pre-fetching. Residual errors for the Opteron and Cray MSP are on a much smaller scale. For the Opteron, there is no discernible structure, while residuals on the X1 are somewhat larger for the range of vector length equal to the vector register length.

Figure 5. Back-fitted latency and gap values for both memory hierarchies in model 3.

Latency l and gap g parameters fitted are shown in Figure 5 and show little indication of a memory hierarchy on the vector processors, which is no surprise as the SX8 has none and the E-cache in the X1 has only minor performance impacts in most situations. To fit models well, effective cache sizes are selected at 256kB for Xeon, Itanium, Opteron, PowerPC, and Cray X1, and 2MB for Power3, Power4, and Power5.

Pages: 1 2 3 4 5 6 7

CTWatch is a collaborative effort				Sponsored By