We now consider the quality of the rankings produced by using most of the machine characteristics summarized in Table 6; peak flops, interprocessor network bandwidth, interprocessor network latency, bandwidth of strided and of random accesses to L1 cache, bandwidth of strided and of random accesses to L2 cache, and bandwidth of strided and of random accesses to main memory.
We did not generate rankings based on the bandwidth of strided and random accesses to L3 cache because not all of the machines have L3 caches. All these characteristics are measured by simple benchmark probes, also summarized in Table 6. The fundamental difference between strided and random memory references is that the former are predictable, and thus prefetchable. Because random memory references are not predictable, the bandwidth of random accesses actually reflects the latency of an access to the specified level of the memory hierarchy.
Table 1 sums the number of thresholded inversions over all the application and processor counts. Because each application is run on a different set of processor counts and not every case has been run on every machine, the numbers in Table 1 should not be compared across applications, but only for a single application across the rankings by different machine characteristics.
The last row of Table 1 shows that the bandwidth of strided accesses to main memory provides the single best overall ranking, with 309 total thresholded inversions. The ranking generated by the bandwidth of random accesses to L1 cache is a close second. However, the data also show that no single ranking is optimal for all applications; whereas the bandwidth of strided accesses to main memory is nearly perfect for avus, and does very well on wrf and cth7, it is outperformed by the bandwidth of both strided and random accesses to L1 for ranking performance on gamess. It is also worth noting that flops is only the best predictor of rank on lammps. One interpretation of the data is that these applications fall into three categories:
- codes dominated by time to perform floating-point operations,
- codes dominated by time to access main memory,
- and codes dominated by time to access L1 cache.
Metric | l1(1) | l1(r) | l2(1) | l2(r) | mm(1) | mm(r) | l/lat | nw bw | flops |
avus | 51 | 26 | 44 | 42 | 1 | 61 | 19 | 30 | 22 |
cth7 | 32 | 18 | 30 | 82 | 21 | 117 | 63 | 37 | 35 |
gamess | 25 | 16 | 40 | 55 | 48 | 76 | 65 | 35 | 25 |
hycom | 26 | 10 | 26 | 83 | 17 | 126 | 65 | 28 | 35 |
lammps | 136 | 107 | 133 | 93 | 80 | 157 | 95 | 116 | 68 |
oocore | 44 | 31 | 56 | 71 | 61 | 91 | 75 | 50 | 52 | overflow2 | 71 | 39 | 79 | 91 | 47 | 104 | 108 | 81 | 44 |
wrf | 99 | 63 | 92 | 134 | 34 | 203 | 103 | 83 | 60 |
484 | 310 | 500 | 651 | 309 | 935 | 593 | 460 | 341 |
If we ask how well any fixed ranking can do for these particular applications, an exhaustive search through the viable best ranking space reveals a ranking where the sum of the number of inversions is 195, with α = .01 and β = 0.17 Although this choice of β is stricter than using β = .001, the choice is unavoidable since it is unclear what the metric numbers should be. While the optimal ranking is significantly better than any of the ones in Table 1 generated using single machine characteristics, an exhaustive search is an unrealistic methodology. In addition to the cost, adding a new machine or application to the ranking would require re-evaluating the ranking, no intuition can be gained from the ranking, and the result does not track to any easy-to-observe characteristic of machines or applications.