CTWatch Quarterly » The Role of Multicore Processors in the Evolution of General-Purpose Computing

The Role of Multicore Processors in the Evolution of General-Purpose Computing

John McCalpin, Advanced Micro Devices, Inc.
Chuck Moore, Advanced Micro Devices, Inc.
Phil Hester, Advanced Micro Devices, Inc.

3. Current and Near-Term Issues

The "Power Problem"

Like performance, power is more multidimensional than one might initially assume. In the context of high-performance microprocessor-based systems, the "power problem" might mean any of:

Bulk current delivery to the chip through a large number of very tiny pins/pads. (Note that as voltages go down, currents go up and resistive heating in the pins/pads increases even at the same bulk power level.)
Bulk heat removal to prevent die temperature from exceeding threshold values associated with significantly shortened product lifetime.
"Hot Spot" power density in small areas of the chip that can cause localized failures. (Note that halving the size of the processor core on the die while increasing frequency to maintain the same bulk power envelope causes a doubling of power density within the core.)
Bulk power deliver to the facility housing the service – including cost of electricity plus the cost of electrical upgrades.
Bulk heat removal from the facility housing the servers – including the cost of electricity plus the cost of site cooling system upgrades.
Bulk heat removal from the processor chips that preheats air flowing over other thermally sensitive components (memory, disk drives, etc.).

So the power problem actually refers to at least a half-dozen inter-related, but distinct, technical and economic issues.

Throughput vs Power/Core vs Number of Cores

Restricting our attention to the issue of delivering increased performance at a constant power envelope, we can discuss the issues associated with using more and more cores to deliver increased throughput.

Depending on exactly what parameters are held constant, power consumption typically rises as the square or cube of the CPU frequency, while performance increases less than linearly with frequency. For workloads that can take advantage of multiple threads, multiple cores provide significantly better throughput per watt. As we saw earlier however, this increased throughput comes at the cost of reduced single-thread performance.

In addition to waiting for lithography to shrink so that we can fit more of our current cores on a chip, the opportunity also exists to create smaller CPU cores that are intrinsically smaller and more energy efficient. This has not been common in the recent past (with the exception of the Sun T1 processor chips) because of the assumption that single-thread performance was too important to sacrifice. We will come back to that issue in the section on Longer-Term Extrapolations, but a direct application of the performance model above shows that as long as the power consumption of the CPU cores drops faster than the peak throughput, the optimum throughput will always be obtained by an infinite number of infinitesimally fast cores. Obviously, such a system would suffer from infinitesimal single-thread performance, which points to a shortcoming of optimizing to a single figure of merit. In response to this conundrum, one line of reasoning would define a minimum acceptable single-thread performance and then optimize the chip to include as many of those cores as possible, subject to area and bulk power constraints.

There are other reasons to limit the number of cores to a modest number. One factor, which is missing in the simple throughput models used here, is communication and synchronization. If one desires to use more than one core in a single parallel/threaded application, some type of communication/synchronization will be required, and it is essentially guaranteed that for a fixed workload, the overhead of that communication/synchronization will be a monotonically increasing function of the number of CPU cores employed in the job.

This leads to a simple modification of Amdahl's Law:

Here T is the total time to solution, T_s is the time required to do serial (non-overlapped) work, T_p is the time required to do all of the parallelizable work, N is the number of processors applied to the parallel work, and T_o is the overhead per processor associated with the communication & synchronization required by the implementation of the application. This last term representing the growth in overhead as more processors are added is the addition to the traditional Amdahl's Law formulation.

In the standard model (no overhead), the total time to solution is a monotonically decreasing function of N, asymptotically approaching T_s. In the modified formation, it is easy to see that as N increases there will come a point at which the total time to solution will begin to increase due to the communication overheads. In the simple example above, a completely parallel application (T_s=0) will have an optimal number of processors defined by:

So, for example, if T_o is 1% of T_p, then maximum performance will be obtained using 10 processors. This may or may not be an interesting design point, depending on the relative importance of other performance and performance/price metrics.

Pages: 1 2 3 4 5 6 7 8 9

CTWatch is a collaborative effort				Sponsored By