CTWatch Quarterly » Metrics for Ranking the Performance of Supercomputers

Metrics for Ranking the Performance of Supercomputers

Tzu-Yi Chen, Pomona College
Meghan Gunn, University of San Diego
Beth Simon, UC San Diego
Laura Carrington, San Diego Supercomputer Center
Allan Snavely, San Diego Supercomputer Center

2.2 Rankings using single machine metrics

We now consider the quality of the rankings produced by using most of the machine characteristics summarized in Table 6; peak flops, interprocessor network bandwidth, interprocessor network latency, bandwidth of strided and of random accesses to L1 cache, bandwidth of strided and of random accesses to L2 cache, and bandwidth of strided and of random accesses to main memory.

We did not generate rankings based on the bandwidth of strided and random accesses to L3 cache because not all of the machines have L3 caches. All these characteristics are measured by simple benchmark probes, also summarized in Table 6. The fundamental difference between strided and random memory references is that the former are predictable, and thus prefetchable. Because random memory references are not predictable, the bandwidth of random accesses actually reflects the latency of an access to the specified level of the memory hierarchy.

Table 1 sums the number of thresholded inversions over all the application and processor counts. Because each application is run on a different set of processor counts and not every case has been run on every machine, the numbers in Table 1 should not be compared across applications, but only for a single application across the rankings by different machine characteristics.

The last row of Table 1 shows that the bandwidth of strided accesses to main memory provides the single best overall ranking, with 309 total thresholded inversions. The ranking generated by the bandwidth of random accesses to L1 cache is a close second. However, the data also show that no single ranking is optimal for all applications; whereas the bandwidth of strided accesses to main memory is nearly perfect for avus, and does very well on wrf and cth7, it is outperformed by the bandwidth of both strided and random accesses to L1 for ranking performance on gamess. It is also worth noting that flops is only the best predictor of rank on lammps. One interpretation of the data is that these applications fall into three categories:

codes dominated by time to perform floating-point operations,
codes dominated by time to access main memory,
and codes dominated by time to access L1 cache.

overall sum

Metric	l1(1)	l1(r)	l2(1)	l2(r)	mm(1)	mm(r)	l/lat	nw bw	flops
avus	51	26	44	42	1	61	19	30	22
cth7	32	18	30	82	21	117	63	37	35
gamess	25	16	40	55	48	76	65	35	25
hycom	26	10	26	83	17	126	65	28	35
lammps	136	107	133	93	80	157	95	116	68
oocore	44	31	56	71	61	91	75	50	52
overflow2	71	39	79	91	47	104	108	81	44
wrf	99	63	92	134	34	203	103	83	60
484	310	500	651	309	935	593	460	341

Table 1. Sum of the number of thresholded inversions for all processor counts for each application, with α = .01 and β = .001. The smallest number (representing the best metric) for each application is in bold. The last row is a sum of each column and gives a single number representing the overall quality of the ranking produced using that machine characteristic.

2.3 Using combined machine metrics

If we ask how well any fixed ranking can do for these particular applications, an exhaustive search through the viable best ranking space reveals a ranking where the sum of the number of inversions is 195, with α = .01 and β = 0.¹⁷ Although this choice of β is stricter than using β = .001, the choice is unavoidable since it is unclear what the metric numbers should be. While the optimal ranking is significantly better than any of the ones in Table 1 generated using single machine characteristics, an exhaustive search is an unrealistic methodology. In addition to the cost, adding a new machine or application to the ranking would require re-evaluating the ranking, no intuition can be gained from the ranking, and the result does not track to any easy-to-observe characteristic of machines or applications.

Pages: 1 2 3 4 5 6 7 8 9

CTWatch is a collaborative effort				Sponsored By