CTWatch Quarterly » Lilliputians of Supercomputing Have Arrived!

Lilliputians of Supercomputing Have Arrived!

Jose Castanos, George Chiu, Paul Coteus, Alan Gara, Manish Gupta, Jose Moreira, IBM T.J. Watson Research Center

We chose an embedded processor optimized for low power and low frequency design, rather than performance. Such a processor has a performance/power advantage compared to a high performance and high power processor. A simple relation is

performance/rack = performance/watt x watt/rack.

The last term in this expression, watt/rack, is determined by thermal cooling capabilities of a given rack volume. Therefore, it imposes the same limit (of the order of 25 kilowatts) for using either high- frequency, high-power chips or using low-frequency, low-power chips. To maximize performance/rack, it is the performance/watt term that must be compared among different CMOS technologies. This clearly illustrates one of the areas in which electrical power is critical to achieving rack density. We have found that in terms of performance/watt, the low frequency, lower power embedded IBM PowerPC 440 core consistently outperforms high frequency, high power microprocessors by a factor of about ten regardless of the manufacturers of the systems. This is one of the main reasons we chose the low power design point for our Blue Gene/L supercomputer. Figure 1 illustrates the power efficiency of some recent supercomputers. The data is based on total peak Gflops (giga floating-point operations per second) divided by total system power in watts, when that data is available. If the data is not available, we approximate it using Gflops/chip power (an overestimate of the true system Gflops/power number).

Figure 1. Power efficiencies of recent supercomputers. (Blue = IBM Machines, black = other U.S. machines, red = Japanese machine)

This chart presents empirical evidence of the fact that in the presence of a common power envelope, the collective peak performance per unit volume is superior with low- power CMOS technology. We now explain the theoretical basis of the superior collective performance of low power systems. Any performance metric such as flops , MIPS (millions instructions per sec), or SPEC benchmarks is linearly proportional to the chip clock frequency. On the other hand, the power consumption of the i^th transistor is given by the expression:

P_i = switching power of transistor i + leakage power of transistor i
= ½ C_Li V² f_i + leakage power of transistor i,

where C_Li is the load capacitance of the i^th transistor, V = V_DD is the supply voltage, and f_i is the switching frequency of the i^th transistor. Note that not every transistor participates in switching on every clock cycle f. Although the leakage power is increasingly important for 90nm, 65nm and 45nm technologies, we ignore the leakage power of the Blue Gene/L chips which, built in 130 nm technology, contributes less than 2% of the system power. The switching power consumed in a chip is the sum of the power of all switching nodes. It can be expressed as:

P_chip = Σ switching power of transistor i = 1/2 C_sw V² f,

where the average switching chip capacitance is given by

C_sw = (Σ C_Li f_i) / f.

It is difficult to predict C_sw accurately because we seldom know the switching frequencies f_i of every transistor in every cycle, and furthermore f_i is different for each application. To simplify the discussion, we use an averaged value of C_sw obtained either from direct measurement or from power modeling tools. For high power, high frequency CMOS chips, the clock frequency f is roughly proportional to the supply voltage V, thus the power consumed per chip P_chip is proportional to V² f or f³. Therefore, in the cubed-frequency regime, the power grows by a factor of eight, if the frequency is doubled. If we use eight moderate frequency chips, each of them half the frequency of the original high frequency chip, we burn the same amount of power, yet we have a fourfold increase in flops/watt. This then is the basis of our Blue Gene/L design philosophy. One might ask if we can do this indefinitely. If 100,000 processors at some frequency is good, are not 800,000 processors at 1/2 the frequency even better? The answer is complex, because we must consider also the mechanical component sizes, power to communicate between processors, the failure rate of those processors, the cost of packaging those processors, etc. Blue Gene/L is a complex balance of these factors and many more. Moreover, as we lower the frequency, the power consumed per chip drops from cubic frequency dependence to quadratic dependence and finally to linear dependence. In the linear regime, both power and performance are proportional to frequency; there is no advantage of reducing frequency at that point.

Pages: 1 2 3 4 5

CTWatch is a collaborative effort				Sponsored By