For the purposes of this article, the performance impacts of various configuration options will be estimated using an analytical model with coefficients "tuned" to provide the best fit for a large subset of the SPECfp2000 and SPECfp_rate2000 benchmarks collected in March 2006.4 The analysis included 508 SPECfp2000 results and 730 SPECfp_rate2000 results. The remaining 233 results were excluded from the analysis because either advanced compiler optimizations or unusual hardware configurations made the results inappropriate for comparison with the bulk of the results.
The performance model has been described previously56 but has been extended here to include a much more complete set of data and has been applied to each of the 14 SPECfp2000 benchmarks as well as to the geometric mean values. Although the model does not capture some of the details of the performance characteristics of these benchmarks, using a least-squares fit to a large number of results provides a large reduction in the random "noise" associated with the individual results and provides a significant degree of platform-independence.
In brief, the model assumes that the execution time of each benchmark is the sum of a "CPU time" and a "Memory Time," where the amount of "work" done by the memory subsystem is a simple function of the cache size – linearly decreasing from a maximum value with no cache to a minimum value with a "large" cache (where "large" is also a parameter of the model), and a constant amount of memory work for caches larger than the large size. The rate at which CPU work is completed is assumed to be proportional to the peak floating-point performance of the chip for 64-bit IEEE arithmetic, while the rate at which memory work is completed is assumed to be proportional to the performance of the system on the 171.swim (base) benchmark. Previous studies have shown a strong correlation between the performance of the 171.swim benchmark and direct measurement of sustained memory performance using the STREAM benchmark.7
The model results are strongly correlated with the measured results, with 75% of the measurements falling within 15% of the projection. This suggests that the underlying model assumptions are reasonably consistent with the actual performance characteristics of these systems on these benchmarks. Although there are some indications of systematic errors in the model, not all of the differences between the model and the observations are due to oversimplification of the hardware assumptions – much of the variance also appears to be due to differences in compilers, compiler options, operating systems, and benchmark configurations. Overall, the model seems appropriately robust to use as a basis for illustrations of performance and price/performance sensitivities in microprocessor-based systems.
For the performance and price/performance analysis, we will assume
- The bare, two-socket system (with disks, memory, and network interfaces, but without CPUs) costs $1,500.
- The base CPU configuration is a single-core processor at 2.4 GHz with a 1 MB L2 cache, costing $300.
- The die is assumed to be about ½ CPU core and about ½ L2 cache, with the other on-die functionality limited to a small fraction of the total area.
- The "smaller chip" configuration is a single-core processor at 2.8 GHz with a 1 MB L2 cache, costing $150.
- The "lots of cache" configuration is a single-core processor at 2.8 GHz with a 3 MB L2 cache, costing $300.
- The "more cores" configuration is a dual-core processor at 2.0 GHz with 1 MB L2 cache per core, costing $300.