CTWatch Quarterly » Low Power Computing for Fleas, Mice, and Mammoth

Low Power Computing for Fleas, Mice, and Mammoth — Do They Speak the Same Language?

GUEST EDITOR
Satoshi Matsuoka, Tokyo Institute of Technology

All is not favorable, however. There are various drawbacks to low power designs, some of which are generic and some of which are more peculiar to HPC. Both types require substantial research and engineering.

Cons:

Increased system complexity — Low power design obviously adds complexity to the overall system both in hardware and software as well as overall management. We omit a more detailed discussion here for the sake of brevity.
Increased sporadic failures — In low power systems generally, the chance of sporadic failures may increase for several reasons, including reduced noise margins caused by lowering the supply voltage, timing issues, etc. Such failures will have to be compensated for by careful and somewhat conservative circuit design, sanity checking, redundancy, etc. Another possibility is to employ software checking and recovery more extensively, but such measures tend to be difficult to implement without some hardware support.
Increased failures as the number of components is scaled up — In some low power HPC architectures, the desire to exploit a “slow and parallel” strategy leads to designs with a higher number of nodes and thus a higher number of components in the system. For example, the largest BlueGene/L on the Top500 to date sports 65,536 CPU cores, an order of magnitude greater than any other machine on the Top 500. By comparison, the Earth Simulator has only 5120 cores. Certainly the number of cores is one particular metric and cannot account for overall machine stability. In fact, BlueGene/L has gone to great lengths to reduce the number of overall system components, and the results from early deployments have demonstrated that it is a quite reliable machine. Nonetheless, as we approach the petaflops range, the amount of component increase will be substantially more demanding.
Reliance on Extremely High Parallel Efficiency to Extract Performance — Since the performance of each processor in such low power designs will be slow, achieving good performance will require a much higher degree of parallel efficiency compared to conventional high-performance, high-power CPUs. Thus, unless the application is able to exhibit considerable parallel efficiency, we will not be able to attain proper performance from the system. If the inefficiency is due to the software or the underlying hardware, solutions may be available to resolve it. However, if the cause is fundamental to the algorithm, with unavoidable serialization capping limits on parallelism, then we will have to resort to somehow discovering the fundamental application algorithm. This is sometimes very difficult, however, especially for very large, legacy applications.

	Advanced Vectors (Earth Simulator => SX-8)	High Density Cluster (Itanium Mondecito Blade + Infiniband 4x)	Low PowerCPU? Super Highly Density (Blue Gene/L)
GFLOPS/CPU	16	8	2.8
CPU CORE/Chip	1	2	2
CPU Chip/Cabinet	8	72	1024
TFLOPS/Cabinet	0.128	1.152	5.7344
Memory BW/Chip (GB/s)	64	10.672	6.4
Memory BW/Cabinet (GB/s)	512	768.384	6553.6
Network BW/Chip (MB/s)	NA	625	1050
Network Bytes/s/Flop	0.125	0.0390625	0.1875
#Cabinets for 1PF (+30% Network)	10156	1128	174
Physical size relative to ES	13.22	1.47	0.23
Power/Cabinet (KW)	9	15	25
Total Power (30% cooling) (MW)	118.83	22.00	5.66
Power relative to ES (8MW)	14.85	2.75	0.71
Cost/Cabinet ($Million US)	1	1	1.5
Total Cost ($Billion US)	10.16	1.13	0.26
Cost relative to ES ($400 mil US)	25.39	2.82	0.65

Table 1. Modern HPC Machine Parameters

Pages: 1 2 3 4 5 6 7

CTWatch is a collaborative effort				Sponsored By