Metric | l1(1, r) | mm(1, r) | mm(1), l1(r) |
avus | 12 | 21 | 9 |
cth7 | 14 | 80 | 9 |
gamess | 16 | 77 | 26 |
hycom | 2 | 44 | 2 |
lammps | 107 | 148 | 81 |
oocore | 31 | 100 | 44 | overflow2 | 34 | 78 | 34 |
wrf | 63 | 158 | 44 |
overall sum | 279 | 706 | 249 |
Table 2 shows the results of these experiments. Each column shows the number of thresholded inversions for each of the eight applications using the specified choice of strided and random access bandwidths. In each column the results use the application whose m1, mr, and f led to the smallest number of inversions. When using the bandwidths L1, the best application was overflow2; when using the bandwidths to main memory, the best application was oocore; and when using a combination of L1 and main memory bandwidths, avus and hycom generated equally good rankings.
Comparing the results in Table 2 to those in Table 1, we see that partitioning the memory accesses is useful as long as the random accesses are considered to hit in L1 cache. When we use the bandwidth of accesses to main memory only, the quality of the resulting order is between those of rankings based on the bandwidth of random accesses and based on the bandwidth of strided accesses to main memory. Using the bandwidth of random access to L1 cache alone did fairly well, but the ranking is improved by incorporating the bandwidth of strided accesses to L1 cache, and is improved even more by incorporating the bandwidth of strided accesses to main memory.
We believe the combination of mm(1) and l1(r) works well because of a more general fact: applications with a large memory footprint that have many strided accesses benefit from high bandwidth to main memory because the whole cache line is used and prefetching further utilizes the full main memory bandwidth. For many of these codes main memory bandwidth is thus the limiting performance factor. On the other hand, applications with many random accesses are wasting most of the cache line and these accesses do not benefit from prefetching. The performance of these codes is limited by the latency hiding capabilities of the machine’s cache, which is captured by measuring the bandwidth of random accesses to L1 cache.