Ideally we would be able to perfectly rank all machines in an application-independent manner, using only machine characteristics. Although this may be impossible, we can still measure how close any ranking comes to perfect. In this section, we describe a metric for ranking defined by thresholded inversions. We describe the optimal ranking for that metric on our combination of machines and applications, and explore how close we can come to optimal using machine characteristics both individually and in combination.
We evaluate the quality of a ranking by the number of thresholded inversions that it contains. For example, consider a list of n systems m1, m2, . . .mn, sorted from highest to lowest interprocessor network bandwidth so that i < j means that the bandwidth achievable on system mi is at least as good as the bandwidth achievable on system ,mj. Given an application A and times t(1), t(2), . . . , t(n), where t(k) is the measured running time of A on machine mk, we would normally say that machines mi and mj are inverted if i < j, but t(i) > t(j). Intuitively, this says that all other things being equal, application A should run faster on the machine with the higher network bandwidth. If it does not, we call that an inversion. The number of inversions in the ranking, then, is the number of pairs of machines that are inverted;15 in our example this is the number of pairs of machines for which the inter-processor network bandwidth incorrectly predicts which machine should run faster on application A. Clearly this number is at least zero and is no larger than n(n - 1)/2.
In practice, to allow for variance in the measured runtimes, we count only thresholded inversions. Given a threshold α satisfying 0 ≤ α ≤ 1, machines mi and mj are considered inverted only if mi is predicted to be faster than mj, and t(i) > (1 + α)t(j). If α = 0, this is equivalent to the number of inversions as described in the previous paragraph. However, choosing a small, nonzero α allows us to handle variations in timing measurements that can be caused for reasons including those discussed in.16
Furthermore, to allow for variance in the benchmark timings that give the machine metrics, we use a second threshold parameter β, where 0 ≤ β ≤ 1. Even if two machines mi and mj would be considered inverted by the above measures because t(i) > (1 + α)t(j), we only count the inversion as such if the results of the benchmark (e.g., network bandwidth) also differ by a relative measure β. In general β should be chosen to be smaller than α because one expects less variance in benchmark times than in full application runtimes. In this paper we use α = .01, which means a difference of up to 1% in the application runtimes is considered insignificant, and β = .001.
Note that a metric based on thresholded inversions is practical because it is monotonic in the sense that adding another machine mk and its associated runtime t(k) cannot decrease the number of inversions in a ranking. Within our context of large parallel applications, this is useful because we have only partial runtime data; not all of the applications in Table 4, with all possible processor counts, were run on all the machines in Table 5. In some cases these gaps are necessary because some systems do not have enough compute processors to run every application with all processor counts. Furthermore, in practice, it is not unusual for timings that have been collected at different times on different machines by different people to be incomplete in this way.






