CTWatch Quarterly » Low Power Computing for Fleas, Mice, and Mammoth

Introduction

The main theme of this issue of CTWatch Quarterly is the new trend within the high performance computing (HPC) toward lower power requirements. Low power computing itself is not new — it has had a long history in embedded systems, where battery life is at a premium. In fact, the applicability of low power has widened its scope in both directions on the power consumption scale. Lower power consumption in the microwatts arena — so-called “ultra low power” (ULP) — is necessary to enable applications such as wireless remote sensing, where a device may have to run on a single small battery for months and need to be networked to collect data. In a more familiar context, most PCs have recently become Energy Star¹ compliant. In fact, a really dramatic shift in design emphasis occurred around 2003-2004, when the industry began to move from the pursuit of desktop performance alone to the pursuit of desktop performance/power in combination. Recent processors initially designed for energy efficient notebooks, such as Intel’s Pentium-M, have started to find their way into desktop units. In fact, there is strong speculation that future mainstream PC processors will be successors of the Pentium-M style, a power efficient design.

But why do we want to save power in the HPC arena since the goal has always been to go faster at almost any cost? Certainly it is fair to say that performance/power has always been an engineering concern in designing HPC machines. For example, NEC claims to have achieved five times better performance/power efficiency in their SX-6 model over their previous generation SX-5.² Where HPC machines function as large servers in datacenters, reducing power would also result in substantial cost savings in their operations. And of course, there are important social and economic reasons for reducing the extremely high power consumption of many HPC installations.

However, the recent attention to low power in HPC systems is not merely driven by such “energy-conscious” requirements alone. There have been recent research results, especially spearheaded by those of the BlueGene/L³ group, that seem to indicate that being low power may be fundamental to future system scalability, including future petascale systems, personalized terascale systems, and beyond. The purpose of the articles in this issue is to reveal such new trends and discuss the future of HPC from the perspective of low power computing.

In the remainder of this article, we will show how low power designs in the traditional arena of embedded computing, plus the very interesting ultra low power systems that are now receiving considerable attention, relate to low power HPC. In particular, we will discuss how technologies developed for low power embedded systems might be applicable to low power HPC and what the future holds for further research and development in this area that aims for greater performance in next generation HPC.

Is Saving Power Anything Special?

From an engineering point of view, it is obvious that one would want to save power to attain maximum efficiency in any largely-deployed infrastructure, as we mentioned earlier. But the metrics of tradeoffs in power vs. performance differ vastly depending on the application. In other technology areas, similar differences exist. For example, with automobiles, one metric is to shoot for maximum speed, as with a Formula One race car, where one gets only a little over one kilometer per liter in fuel efficiency. On the other hand, there are fuel-efficiency competitions where one attempts to maximize the distance that can be traveled with a liter of fuel; the current world record is 5134 km, which is nearly four orders of magnitude different than the race car example. Even though combustion technology is recognized as being fairly mature, we seldom observe the exponential growth that we see in the IT industry.

Still, the technological advancements in fuel efficiency improvement in the “standard” automotive industry is in the low percentage points, and even disruptive technologies such as fuel cell or battery-based EVs (Electrical Vehicles) will not improve the efficiency by an order of magnitude. With the IT industry, however, we all know that with Moore’s Law performance has been increasing exponentially since the 1960s and is expected to continue until at least 2015. However, some of the problematic phenomena that drive up power consumption also follow this exponential curve. For example, static leakage current is directly related to the number of transistors, which gave rise to the exponential performance increase in the first place.

Pros and Cons of Low Power, Especially in HPC

While low power consumption may seem to be an obvious engineering ideal for computing systems, especially in HPC, achieving it requires designers to make various tradeoffs that have their own pros and cons.

Pros:

Higher density — With lower thermal density, an HPC architecture can be more densely packed. This is very important. As Table 1 shows, the absolute space occupancy is starting to limit the machine size, not just in terms of the physical real estate needed, but also, for example, in terms of cable length. In fact, if we were to build a Petascale machine now using Earth Simulator technology, not only would it require a 100MW scale electrical power plant, but it would occupy over 30,000 square meters of floor space (approx 330,000 square feet, or the size of a small football stadium). The weight of its cabling, amounting to approximately 400,000 kilometers or about 250,000 miles, would be a whopping 15,000 tons (several times more than the steel reinforcements that would be used in such a stadium)!
Reduced Cooling Infrastructure — Cooling requirements of large machines may add anywhere from 25 to 50 % to the consumed power of the machine. Moreover, most machines are designed to operate at their maximum performance level at some point, resulting in maximum thermal heat dissipation. The cost of the initial infrastructure, which would include maximum cooling capacity, could be millions of dollars of equipment and construction, not to mention substantial space consumption.
Improved error rate / MTBF — Higher cabinet temperatures will result in shorter mean time between failures (MTBF) for various parts of the machine, primarily disk storage, capacitors, and silicone. Some studies have shown that a ten degree increase in the operational temperature of a typical hard disk will reduce its lifetime to 1/10^th of its typical rating.

All is not favorable, however. There are various drawbacks to low power designs, some of which are generic and some of which are more peculiar to HPC. Both types require substantial research and engineering.

Cons:

Increased system complexity — Low power design obviously adds complexity to the overall system both in hardware and software as well as overall management. We omit a more detailed discussion here for the sake of brevity.
Increased sporadic failures — In low power systems generally, the chance of sporadic failures may increase for several reasons, including reduced noise margins caused by lowering the supply voltage, timing issues, etc. Such failures will have to be compensated for by careful and somewhat conservative circuit design, sanity checking, redundancy, etc. Another possibility is to employ software checking and recovery more extensively, but such measures tend to be difficult to implement without some hardware support.
Increased failures as the number of components is scaled up — In some low power HPC architectures, the desire to exploit a “slow and parallel” strategy leads to designs with a higher number of nodes and thus a higher number of components in the system. For example, the largest BlueGene/L on the Top500 to date sports 65,536 CPU cores, an order of magnitude greater than any other machine on the Top 500. By comparison, the Earth Simulator has only 5120 cores. Certainly the number of cores is one particular metric and cannot account for overall machine stability. In fact, BlueGene/L has gone to great lengths to reduce the number of overall system components, and the results from early deployments have demonstrated that it is a quite reliable machine. Nonetheless, as we approach the petaflops range, the amount of component increase will be substantially more demanding.
Reliance on Extremely High Parallel Efficiency to Extract Performance — Since the performance of each processor in such low power designs will be slow, achieving good performance will require a much higher degree of parallel efficiency compared to conventional high-performance, high-power CPUs. Thus, unless the application is able to exhibit considerable parallel efficiency, we will not be able to attain proper performance from the system. If the inefficiency is due to the software or the underlying hardware, solutions may be available to resolve it. However, if the cause is fundamental to the algorithm, with unavoidable serialization capping limits on parallelism, then we will have to resort to somehow discovering the fundamental application algorithm. This is sometimes very difficult, however, especially for very large, legacy applications.

	Advanced Vectors (Earth Simulator => SX-8)	High Density Cluster (Itanium Mondecito Blade + Infiniband 4x)	Low PowerCPU? Super Highly Density (Blue Gene/L)
GFLOPS/CPU	16	8	2.8
CPU CORE/Chip	1	2	2
CPU Chip/Cabinet	8	72	1024
TFLOPS/Cabinet	0.128	1.152	5.7344
Memory BW/Chip (GB/s)	64	10.672	6.4
Memory BW/Cabinet (GB/s)	512	768.384	6553.6
Network BW/Chip (MB/s)	NA	625	1050
Network Bytes/s/Flop	0.125	0.0390625	0.1875
#Cabinets for 1PF (+30% Network)	10156	1128	174
Physical size relative to ES	13.22	1.47	0.23
Power/Cabinet (KW)	9	15	25
Total Power (30% cooling) (MW)	118.83	22.00	5.66
Power relative to ES (8MW)	14.85	2.75	0.71
Cost/Cabinet ($Million US)	1	1	1.5
Total Cost ($Billion US)	10.16	1.13	0.26
Cost relative to ES ($400 mil US)	25.39	2.82	0.65

Table 1. Modern HPC Machine Parameters

So, Where Do We Obtain the Power Savings?

With mainstream information technology, such as standard office application suites, speed requirements may have “matured.” But the majority of application areas, in particular ones mentioned in this article, are still in need of significant (even exponential) improvements in both absolute performance and relative performance/power metrics over the next ten years, as we progress towards building a “true” cyberinfrastructure for science and engineering. Such is quite obviously the case for traditional HPC applications, where even a petaflop machine may not satisfy the needs of the most demanding applications. It is also evident in application areas that are taking a leap to next generation algorithms in order to increase scale, accuracy, etc. An example is large scale text processing/data mining where the proliferation of the web and the associated explosion of data call for more sophisticated search and mining algorithms to deal with "data deluge.” Another example is the push to develop humanoid robots, where one is said to require more than five to six orders of magnitude processing power while retaining the human form factor.

The question is, can we achieve these goals? If so, will the techniques/technologies employed in respective domains, as well as their respective requirements, be different? If there are such differences, will this cause one power range to be more likely than the others? Or are there some uncharted territories of disruptive technologies with even more possibilities?

Major power saving techniques, in particular those being exploited by more traditional embedded systems, plus the recent breed of low power HPC systems could be categorized as follows:

Fundamental decrease in transistor power consumption and wire capacitance reduction — Traditionally, one would save power “for free” with lower micron sizes, where the transistors become smaller and the wires become thinner. However, it is well known that this is becoming harder to exploit because of longer circuit delays, higher static leakage current, and other physical device characteristics that come into play. As an example, Intel’s move to the .09 micron with the new version of their Pentium 4 processor (Prescott) resulted in higher power consumption than its previous generation (Northwood). Granted, there were substantial architectural changes. But the original idea seemed to have been that the move to .09 micron would more than compensate for the added power consumption due to increased logic complexity and higher transistor count. However, this proved not to be the case.
Voltage reduction (DVS: Dynamic Voltage Scaling) — Closely related to the previous strategy is the idea of reducing voltage with each reduction in processor size. However, this too is reaching its limits, as the state machines (i.e. flip-flops constituting the various state elements and memory cells in the architecture) cannot get significantly below one Volt or so due to physical device limitations. Since DVS is one of the fundamental techniques that low power systems most frequently employ, especially for HPC applications, this is not good news. But there is still hope, as we will see later in the article.
Duty cycling / Power — Another classical methodology is turning off the power when the device is not being used. Many of the ULP devices rely on this technique because they have duty cycles in seconds or even minutes and are effectively turned off most of the remaining time. Dynamic Voltage scaling as well as other techniques are employed extensively along with duty cycling to reduce the idle power as much as possible.
Architectural overhead elimination —There are numerous features in modern-day processors and other peripherals that attempt to obtain relatively small increases in performance at significant hardware and thus power cost. By simplifying the architecture, as is done for embedded processors, one may obtain substantial gains in performance/power ratio while incurring only a small penalty.
Exploiting Parallelism (Slow and Parallel) — Because increases in processor frequency will also incur voltage increases, if we can attain perfect parallel speedup, then reducing the clockspeed in exchange for parallelism (slow and parallel) will generate greater power savings. This is the principle now being employed in various recent multi-core CPU designs; the technical details are covered in the BlueGene/L article (“Lilliputians of Supercomputing Have Arrived”) in this issue of CTWatch Quarterly.
Algorithmic changes — On the software side, one may save power by fundamentally changing the algorithm to consume less computing steps and/or reducing reliance on power-hungry features of the processors and instead using more efficient portions. While the former is obvious and always exploitable, the latter may not be so obvious and not always exploitable, depending on the underlying hardware. For example, in the latter one may attempt to utilize the on-die temporary memory to reduce the off-chip bus traffic as much as possible. But its effectiveness depends on whether the processor’s external bus driver power, relative to the power consumption of the internal processing, would be significant or not.
Other new techniques — there are other technologies in development, which we will not be able to cover here due to lack of space.

How do these techniques apply to different types of systems in order to optimize different kinds of metrics? To clarify the differences, we have divided the power ranges by three orders of magnitudes, namely in the Microwatts, Milliwatts, and Watts and beyond. The table below shows the resulting power ranges and their principal application domains, metrics, technical characteristics, example systems, etc. One can observe here that there is significant divergence in the respective properties of the systems.

Average Power Consumption	Microwatt - Milliwatt	Milliwatt-Watt	Over One Watt
Application Domain	Ubiquitous Sensor Networks	Standard Embedded Devices	PCs/Workstations, Servers, HPC
Important Metrics	Longivity: device powering months~years with a single battery, environmental harvesting of power	Long battery life of dedicated, real-time applications	Maintaining high performance, high thermal density
Technical Characteristics	Programming of Long Duty Cycle applications in Tiny CPU/Storage Environment Ultra Low Power Wireless Autonomous Configuration amongst a group of nodes Fault Tolerance via Massive Redundancy	Various “Classical” low power techniques Adjusting CPU speed voltage in periodic real time processings Dynamic reconfiguration via Software/Hardware co-design	CPU power consumption dominant => “Slow and Parallel” Fine-Grained software control significant – measurement, prediction, planning, (DVS) control Need for low power high performance networking High reliability and scalability
Example Systems	Mote, TinyOS (UC Berkeley)	Various Embedded OSes	NotePC/Blade Server BlueGene/L Green Destiny

Table 2. Low Power System Power Range Categorizations and their Properties:

Examining this table, one might argue that systems in the Microwatt and the Milliwatt ranges do have some similarities, four of which are outlined below:

Devices in both categories are typically driven by batteries and/or some independent (solar or energy-harvesting) generators, without AC electrical wirings, and as such longevity of battery life is of utmost concern.
Their application space is dedicated, or at least fairly restricted per each device; they are not meant to be general-purpose computing devices to be used for every possible application. Also the applications tend to be real-time in nature, primarily sensing, device control, and multimedia. These application characteristics have two consequences. First, when combined, these properties allow duty cycling to be performed extensively. For example, the Berkeley Mote envisions applications where a single battery will last for months, with sensing and networking duty cycling in phases of tens of seconds to minutes. Second, they will sometimes allow dedicated and/or reconfigurable hardware to be employed for the performance/energy demanding portion of the application, such as multimedia encoding/decoding. In some cases this will bring about orders of magnitude performance/power ratio improvement.
Their physical locations tend to be spaced apart, as seen in mobile devices. Coupled with being very low power, thermal density is not the primary concern (although in some modern embedded multimedia devices it may be, but it is not the primary or driving motivation for achieving low power).
Although in modern applications they are often networked, these devices do not work together in a tightly-coupled fashion to execute a single application, and as a result network bytes/flop is not as demanding as is with HPC systems.

In the HPC arena, by contrast, the important point to note is that low power is now being considered the essential means to achieve the traditional goals of high performance. This may at first seem oxymoronic, since lower power usually means lower performance in embedded devices and ULP devices, and great efforts are made to “recover” lost performance as much as possible. However, BlueGene/L and other low power, HPC machines that utilize low power technologies have demonstrated that, by exploiting the “slow and parallel” characteristics, we may achieve higher performance.

Still, their properties produce a different opportunity space for low power than embedded or ULP devices confront:

HPC machines are typically powered by AC; so the motivation is not only lower energy utilization, which is what is needed to extend battery life, but also peak power requirements, as these requirements will mostly determine the necessary capacity of the electrical infrastructure as well as maximum cooling capacity.
HPC machines are more general purpose, and as such the application space is rather broad. Such applications usually demand continuous computing or are I/O intensive, or both. Also these applications are not necessarily real time, but will usually be optimized to minimize their execution time. This will restrict the use of duty cycling, as any idle compute time will be subject to elimination via some optimization.
The generality of the application space will make dedicated, hardwired hardware acceleration effective in only a limited set of applications. There are some instances of successful HPC accelerators such as the GRAPE system, but its effectiveness is restricted to a handful of (albeit important) applications.
Density is one of the driving factors for achieving low power, since some of the large machines are at the limit of practical deployability with respect to their physical size. Simply reducing their volume, however, will result in significant thermal complications, primarily critical “hot spots.” Thus, power control that will guarantee that such hot spots will not occur is an absolute must for stable operation of the entire system.
Many HPC applications are tightly coupled and make extensive use of networking capabilities. Network bytes/flop is an important metric, and the difficulty is to meet the low power requirements in high-bandwidth networking.

Are the Low-power HPC Systems Too Divergent to Traditional, Embedded Low Power Systems?

Given the observations above, could we go as far to say that low power HPC systems are so divergent from traditional embedded systems that there are no research results or engineering techniques they can share? As a matter of a fact, there are commonalities that permit such sharing, and we are starting to see some “convergence” between the low power realization techniques in HPC and those with other power ranges. Here are some of the examples:

Although it is difficult to duty cycle HPC applications, there are still opportunities to fine tune the usage and exploit the potentially “idle” occasions in the overall processing. One example of this approach, dubbed power aware computing, would be to adjust the processor DVS features in a fine-grained fashion so that one can achieve minimum energy consumption for a particular phase in a computation. Another possibility is to exploit the load imbalance in irregular parallel applications, where one may slow down processors so they all synchronize at the same time. Details of the techniques are covered in Dr. Feng’s article, “The Importance of Being Low Power in High-Performance Computing,” in this CTWatch Quarterly issue.
There are direct uses of low power embedded processors, augmented with HPC features such as vector processing, low power high performance networking, etc. Examples are BlueGene/L, Green Destiny,⁴ and MegaProto.⁵ Fundamental power savings are realized with lower voltage, smaller number of transistors, intricate DVS features, etc. In fact, BlueGene/L has demonstrated that the use of low power processors is one of the most promising methodologies. There are still issues, however, since the power/performance ratio of embedded processors applied to HPC are not overwhelmingly advantageous, especially with the development of the power efficient processors that will be arriving in 2006-2007, where similar implementation techniques are being used. Moreover, although one Petaflop would be quite feasible with today’s technologies, to reach the next plateau of performance, i.e. ten Petaflops and beyond, we will need a ten-fold increase in power/performance efficiency. In light of the limits in voltage reduction and other constraints, the question of where to harvest such efficiency is a significant research issue.
Dedicated vector co-processing accelerators have always been used in some MPPs; in the form of GPUs, they are already in use in PCs and will be more aggressively employed in next generation gaming machines such as Microsoft’s Xbox 360 and Sony’s PlayStation 3. Such co-processing accelerators offer much more general purpose programming opportunities than previous generations of GPUs have had, aiding to considerably boost the Flops/power ratio. For example, the Xenon GPU in the Xbox 360 has 48 parallel units of 4-way parallel SIMD vector processors + scalar processors, achieving 216 Gflops at several tens of watts, or about four to seven Gflops/Watt. Also, some embedded processors are starting to employ reconfigurable FPGA devices to dynamically configure hardware per each application. One example is Sony’s new flash-based “Network Walkman” NW-E507, where MP3 decode circuitry is programmed on-the-fly in its internal FPGA to achieve 50 hours of playback in a device as small as 47 grams. The use of reconfigurable devices and modern-day, massively-parallel vector co-processors is still not at the stage of massive use within the HPC arena due to cost and technical immaturity but it will be a promising approach for the future.

The Future of Low Power HPC is “Overdesign” and “Portability”

We have examined the relationships between the various areas of low power computing, focusing especially on the similarities and differences. Overall, the attention given to low power in HPC is still not well recognized by the community, despite the success of BlueGene/L. In particular, controlling power requires a sophisticated application of self-system control. This type of control is being practiced as a norm in other disciplines but is quite crude in computers, especially large HPC machines.

For example, modern fighter aircraft are deliberately made to be somewhat aerodynamically unstable in order to improve their maneuverability; in order to recover and maintain operational stability, they use massive, dynamic, computer-assisted real-time control. Modern automobiles embed massive amounts of self-control for engines and handling without which the car would easily break down or at least suffer from poorer performance. Compared to these technology domains, power/performance controls in modern-day HPC machines are meager at best. They may contain some simple feedback loops that, for example, upturn the cooling fans when the internal chassis temperature climbs higher, or that apply some crude, spontaneous automated control of voltage/frequency without concern for application characteristics. There are other promising avenues of research, as the other articles in this issue show, but further investigation is required to identify the limits of such control methodologies, as well as discover better ways to conserve power.

One promising conceptual design principle that this author envisions is to “overdesign” the system, i.e. engineer it so that, without software self-control, the the system will break down (say thermally), hit other power limits, or become very power/performance inefficient. Most of the machines we design now follow quite conservative engineering disciplines so that no matter how much we hammer them they will not break. Altenatively, we may design for maximum efficiency out of the theoretical peak achievable. Now that we are quickly approaching the one billion transistor mark in our CPUs (and quickly going onto ten billion) there are many transistors to consume power if exploited directly or used for alternative purposes. Moreover, we will have better understanding of how we may monitor and control power, depending on the system/application states (including multiple applications within the system). With multiple failovers in place, we could “overdesign” the system so that it will operate at maximum performance/power ratio (which may be somewhat below the maximum computational efficiency), but driving the efficiency above this will “break” the system. In order to achieve such a subtle balance, there will be various hardware and software sensors to monitor performance/power metrics and perform regulatory feedback into the system, enabling dynamic fine tuning of both software (such as scheduling) and hardware (such as DVS).

Such a design principle may allow substantial improvement in the various metrics that motivate the pursuit of low power in HPC in the first place. For example, one may put an extensive set of thermal sensors in a machine that is densely packed to intricately control the power/performance so as to maintain thermal consistency throughout the system. In such a machine, it would be impossible to achieve theoretical maximum performance, since doing so would break the system and, as a result, some failover mechanism would have to kick in to throttle the system. Overall, its performance per volume may be substantially greater than a conservative machine for various reasons, including that the machine would be running more units in parallel at the best performance/energy tradeoff point.

Many technical challenges would have to be conquered for such a system to become a reality, however. For example, most current motherboards, including sever-grade, high-end versions, lack the sensors required to perform such intricate monitoring of thermal and power consumptions. In many cases, the only available sensors may be a few thermistors, with no power sensors present except voltage meters on power lines. Although the state of the art in analysis of performance/power tradeoffs are advancing (as seen in Dr. Feng’s article mentioned previously), most of the results are still early, with no real broad-based community efforts, such as standardization, to enable, facilitate, or promote usage of the technology. In fact, because of the significant effect such low power systems will have on the software infrastructure, including the compilers, run-time systems, libraries, performance monitors, etc., it is currently impractical to expect any portability across different types of machines. Here, theoretical modeling of such machines, leading to eventual standardization, will be necessary for realistic deployment to occur.

¹ The Energy Star Home Page, http://www.energystar.gov/
² Computers Division "Design of Eco Products SX-6," NEC Technical Journal, Vol. 57, No.1, 2004 http://www.nec.co.jp/techrep/ja/journal/g04/n01/t040105.pdf (in Japanese).
³ IBM Journal of Research and Development, special double issue on Blue Gene, Vol.49, No.2/3, March/May, 2005
⁴ W. Feng, "Making a Case for Efficient Supercomputing," ACM Queue, 1(7):54-64, October 2003.
⁵ Hiroshi Nakashima, Hiroshi Nakamura, Mitsuhisa Sato, Taisuke Boku, Satoshi Matsuoka, et. al. (2 more authors) "MegaProto: 1 TFlops/10 kW Rack Is Feasible Even with Only Commodity Technology," Proc. IEEE/ACM Supercomputing 2005, the IEEE Computer Society Press, Nov. 2005 (to appear).