Low Power Computing for Fleas, Mice, and Mammoth — Do They Speak the Same Language?
GUEST EDITOR
Satoshi Matsuoka, Tokyo Institute of Technology
CTWatch Quarterly
August 2005

Introduction

The main theme of this issue of CTWatch Quarterly is the new trend within the high performance computing (HPC) toward lower power requirements. Low power computing itself is not new — it has had a long history in embedded systems, where battery life is at a premium. In fact, the applicability of low power has widened its scope in both directions on the power consumption scale. Lower power consumption in the microwatts arena — so-called “ultra low power” (ULP) — is necessary to enable applications such as wireless remote sensing, where a device may have to run on a single small battery for months and need to be networked to collect data. In a more familiar context, most PCs have recently become Energy Star1 compliant. In fact, a really dramatic shift in design emphasis occurred around 2003-2004, when the industry began to move from the pursuit of desktop performance alone to the pursuit of desktop performance/power in combination. Recent processors initially designed for energy efficient notebooks, such as Intel’s Pentium-M, have started to find their way into desktop units. In fact, there is strong speculation that future mainstream PC processors will be successors of the Pentium-M style, a power efficient design.

But why do we want to save power in the HPC arena since the goal has always been to go faster at almost any cost? Certainly it is fair to say that performance/power has always been an engineering concern in designing HPC machines. For example, NEC claims to have achieved five times better performance/power efficiency in their SX-6 model over their previous generation SX-5.2 Where HPC machines function as large servers in datacenters, reducing power would also result in substantial cost savings in their operations. And of course, there are important social and economic reasons for reducing the extremely high power consumption of many HPC installations.

However, the recent attention to low power in HPC systems is not merely driven by such “energy-conscious” requirements alone. There have been recent research results, especially spearheaded by those of the BlueGene/L3 group, that seem to indicate that being low power may be fundamental to future system scalability, including future petascale systems, personalized terascale systems, and beyond. The purpose of the articles in this issue is to reveal such new trends and discuss the future of HPC from the perspective of low power computing.

In the remainder of this article, we will show how low power designs in the traditional arena of embedded computing, plus the very interesting ultra low power systems that are now receiving considerable attention, relate to low power HPC. In particular, we will discuss how technologies developed for low power embedded systems might be applicable to low power HPC and what the future holds for further research and development in this area that aims for greater performance in next generation HPC.

Is Saving Power Anything Special?

From an engineering point of view, it is obvious that one would want to save power to attain maximum efficiency in any largely-deployed infrastructure, as we mentioned earlier. But the metrics of tradeoffs in power vs. performance differ vastly depending on the application. In other technology areas, similar differences exist. For example, with automobiles, one metric is to shoot for maximum speed, as with a Formula One race car, where one gets only a little over one kilometer per liter in fuel efficiency. On the other hand, there are fuel-efficiency competitions where one attempts to maximize the distance that can be traveled with a liter of fuel; the current world record is 5134 km, which is nearly four orders of magnitude different than the race car example. Even though combustion technology is recognized as being fairly mature, we seldom observe the exponential growth that we see in the IT industry.

Still, the technological advancements in fuel efficiency improvement in the “standard” automotive industry is in the low percentage points, and even disruptive technologies such as fuel cell or battery-based EVs (Electrical Vehicles) will not improve the efficiency by an order of magnitude. With the IT industry, however, we all know that with Moore’s Law performance has been increasing exponentially since the 1960s and is expected to continue until at least 2015. However, some of the problematic phenomena that drive up power consumption also follow this exponential curve. For example, static leakage current is directly related to the number of transistors, which gave rise to the exponential performance increase in the first place.

Pros and Cons of Low Power, Especially in HPC

While low power consumption may seem to be an obvious engineering ideal for computing systems, especially in HPC, achieving it requires designers to make various tradeoffs that have their own pros and cons.

Pros:

All is not favorable, however. There are various drawbacks to low power designs, some of which are generic and some of which are more peculiar to HPC. Both types require substantial research and engineering.

Cons:

Advanced Vectors (Earth Simulator => SX-8) High Density Cluster (Itanium Mondecito Blade + Infiniband 4x) Low PowerCPU? Super Highly Density (Blue Gene/L)
GFLOPS/CPU 16 8 2.8
CPU CORE/Chip 1 2 2
CPU Chip/Cabinet 8 72 1024
TFLOPS/Cabinet 0.128 1.152 5.7344
Memory BW/Chip (GB/s) 64 10.672 6.4
Memory BW/Cabinet (GB/s) 512 768.384 6553.6
Network BW/Chip (MB/s) NA 625 1050
Network Bytes/s/Flop 0.125 0.0390625 0.1875
#Cabinets for 1PF (+30% Network) 10156 1128 174
Physical size relative to ES 13.22 1.47 0.23
Power/Cabinet (KW) 9 15 25
Total Power (30% cooling) (MW) 118.83 22.00 5.66
Power relative to ES (8MW) 14.85 2.75 0.71
Cost/Cabinet ($Million US) 1 1 1.5
Total Cost ($Billion US) 10.16 1.13 0.26
Cost relative to ES ($400 mil US) 25.39 2.82 0.65
Table 1. Modern HPC Machine Parameters

So, Where Do We Obtain the Power Savings?

With mainstream information technology, such as standard office application suites, speed requirements may have “matured.” But the majority of application areas, in particular ones mentioned in this article, are still in need of significant (even exponential) improvements in both absolute performance and relative performance/power metrics over the next ten years, as we progress towards building a “true” cyberinfrastructure for science and engineering. Such is quite obviously the case for traditional HPC applications, where even a petaflop machine may not satisfy the needs of the most demanding applications. It is also evident in application areas that are taking a leap to next generation algorithms in order to increase scale, accuracy, etc. An example is large scale text processing/data mining where the proliferation of the web and the associated explosion of data call for more sophisticated search and mining algorithms to deal with "data deluge.” Another example is the push to develop humanoid robots, where one is said to require more than five to six orders of magnitude processing power while retaining the human form factor.

The question is, can we achieve these goals? If so, will the techniques/technologies employed in respective domains, as well as their respective requirements, be different? If there are such differences, will this cause one power range to be more likely than the others? Or are there some uncharted territories of disruptive technologies with even more possibilities?

Major power saving techniques, in particular those being exploited by more traditional embedded systems, plus the recent breed of low power HPC systems could be categorized as follows:

How do these techniques apply to different types of systems in order to optimize different kinds of metrics? To clarify the differences, we have divided the power ranges by three orders of magnitudes, namely in the Microwatts, Milliwatts, and Watts and beyond. The table below shows the resulting power ranges and their principal application domains, metrics, technical characteristics, example systems, etc. One can observe here that there is significant divergence in the respective properties of the systems.

Average Power Consumption Microwatt - Milliwatt Milliwatt-Watt Over One Watt
Application Domain Ubiquitous Sensor Networks Standard Embedded Devices PCs/Workstations, Servers, HPC
Important Metrics Longivity: device powering months~years with a single battery, environmental harvesting of power Long battery life of dedicated, real-time applications Maintaining high performance, high thermal density
Technical Characteristics Programming of Long Duty Cycle applications in Tiny CPU/Storage Environment
Ultra Low Power Wireless
Autonomous Configuration amongst a group of nodes
Fault Tolerance via Massive Redundancy
Various “Classical” low power techniques
Adjusting CPU speed voltage in periodic real time processings
Dynamic reconfiguration via Software/Hardware co-design
CPU power consumption dominant =>
“Slow and Parallel”
Fine-Grained software control significant – measurement, prediction, planning, (DVS) control
Need for low power high performance networking
High reliability and scalability
Example Systems Mote, TinyOS (UC Berkeley) Various Embedded OSes NotePC/Blade Server
BlueGene/L
Green Destiny
Table 2. Low Power System Power Range Categorizations and their Properties:

Examining this table, one might argue that systems in the Microwatt and the Milliwatt ranges do have some similarities, four of which are outlined below:

  1. Devices in both categories are typically driven by batteries and/or some independent (solar or energy-harvesting) generators, without AC electrical wirings, and as such longevity of battery life is of utmost concern.
  2. Their application space is dedicated, or at least fairly restricted per each device; they are not meant to be general-purpose computing devices to be used for every possible application. Also the applications tend to be real-time in nature, primarily sensing, device control, and multimedia. These application characteristics have two consequences. First, when combined, these properties allow duty cycling to be performed extensively. For example, the Berkeley Mote envisions applications where a single battery will last for months, with sensing and networking duty cycling in phases of tens of seconds to minutes. Second, they will sometimes allow dedicated and/or reconfigurable hardware to be employed for the performance/energy demanding portion of the application, such as multimedia encoding/decoding. In some cases this will bring about orders of magnitude performance/power ratio improvement.
  3. Their physical locations tend to be spaced apart, as seen in mobile devices. Coupled with being very low power, thermal density is not the primary concern (although in some modern embedded multimedia devices it may be, but it is not the primary or driving motivation for achieving low power).
  4. Although in modern applications they are often networked, these devices do not work together in a tightly-coupled fashion to execute a single application, and as a result network bytes/flop is not as demanding as is with HPC systems.

In the HPC arena, by contrast, the important point to note is that low power is now being considered the essential means to achieve the traditional goals of high performance. This may at first seem oxymoronic, since lower power usually means lower performance in embedded devices and ULP devices, and great efforts are made to “recover” lost performance as much as possible. However, BlueGene/L and other low power, HPC machines that utilize low power technologies have demonstrated that, by exploiting the “slow and parallel” characteristics, we may achieve higher performance.

Still, their properties produce a different opportunity space for low power than embedded or ULP devices confront:

Are the Low-power HPC Systems Too Divergent to Traditional, Embedded Low Power Systems?

Given the observations above, could we go as far to say that low power HPC systems are so divergent from traditional embedded systems that there are no research results or engineering techniques they can share? As a matter of a fact, there are commonalities that permit such sharing, and we are starting to see some “convergence” between the low power realization techniques in HPC and those with other power ranges. Here are some of the examples:

The Future of Low Power HPC is “Overdesign” and “Portability”

We have examined the relationships between the various areas of low power computing, focusing especially on the similarities and differences. Overall, the attention given to low power in HPC is still not well recognized by the community, despite the success of BlueGene/L. In particular, controlling power requires a sophisticated application of self-system control. This type of control is being practiced as a norm in other disciplines but is quite crude in computers, especially large HPC machines.

For example, modern fighter aircraft are deliberately made to be somewhat aerodynamically unstable in order to improve their maneuverability; in order to recover and maintain operational stability, they use massive, dynamic, computer-assisted real-time control. Modern automobiles embed massive amounts of self-control for engines and handling without which the car would easily break down or at least suffer from poorer performance. Compared to these technology domains, power/performance controls in modern-day HPC machines are meager at best. They may contain some simple feedback loops that, for example, upturn the cooling fans when the internal chassis temperature climbs higher, or that apply some crude, spontaneous automated control of voltage/frequency without concern for application characteristics. There are other promising avenues of research, as the other articles in this issue show, but further investigation is required to identify the limits of such control methodologies, as well as discover better ways to conserve power.

One promising conceptual design principle that this author envisions is to “overdesign” the system, i.e. engineer it so that, without software self-control, the the system will break down (say thermally), hit other power limits, or become very power/performance inefficient. Most of the machines we design now follow quite conservative engineering disciplines so that no matter how much we hammer them they will not break. Altenatively, we may design for maximum efficiency out of the theoretical peak achievable. Now that we are quickly approaching the one billion transistor mark in our CPUs (and quickly going onto ten billion) there are many transistors to consume power if exploited directly or used for alternative purposes. Moreover, we will have better understanding of how we may monitor and control power, depending on the system/application states (including multiple applications within the system). With multiple failovers in place, we could “overdesign” the system so that it will operate at maximum performance/power ratio (which may be somewhat below the maximum computational efficiency), but driving the efficiency above this will “break” the system. In order to achieve such a subtle balance, there will be various hardware and software sensors to monitor performance/power metrics and perform regulatory feedback into the system, enabling dynamic fine tuning of both software (such as scheduling) and hardware (such as DVS).

Such a design principle may allow substantial improvement in the various metrics that motivate the pursuit of low power in HPC in the first place. For example, one may put an extensive set of thermal sensors in a machine that is densely packed to intricately control the power/performance so as to maintain thermal consistency throughout the system. In such a machine, it would be impossible to achieve theoretical maximum performance, since doing so would break the system and, as a result, some failover mechanism would have to kick in to throttle the system. Overall, its performance per volume may be substantially greater than a conservative machine for various reasons, including that the machine would be running more units in parallel at the best performance/energy tradeoff point.

Many technical challenges would have to be conquered for such a system to become a reality, however. For example, most current motherboards, including sever-grade, high-end versions, lack the sensors required to perform such intricate monitoring of thermal and power consumptions. In many cases, the only available sensors may be a few thermistors, with no power sensors present except voltage meters on power lines. Although the state of the art in analysis of performance/power tradeoffs are advancing (as seen in Dr. Feng’s article mentioned previously), most of the results are still early, with no real broad-based community efforts, such as standardization, to enable, facilitate, or promote usage of the technology. In fact, because of the significant effect such low power systems will have on the software infrastructure, including the compilers, run-time systems, libraries, performance monitors, etc., it is currently impractical to expect any portability across different types of machines. Here, theoretical modeling of such machines, leading to eventual standardization, will be necessary for realistic deployment to occur.

1 The Energy Star Home Page, http://www.energystar.gov/
2 Computers Division "Design of Eco Products SX-6," NEC Technical Journal, Vol. 57, No.1, 2004 http://www.nec.co.jp/techrep/ja/journal/g04/n01/t040105.pdf (in Japanese).
3 IBM Journal of Research and Development, special double issue on Blue Gene, Vol.49, No.2/3, March/May, 2005
4 W. Feng, "Making a Case for Efficient Supercomputing," ACM Queue, 1(7):54-64, October 2003.
5 Hiroshi Nakashima, Hiroshi Nakamura, Mitsuhisa Sato, Taisuke Boku, Satoshi Matsuoka, et. al. (2 more authors) "MegaProto: 1 TFlops/10 kW Rack Is Feasible Even with Only Commodity Technology," Proc. IEEE/ACM Supercomputing 2005, the IEEE Computer Society Press, Nov. 2005 (to appear).

URL to article: http://www.ctwatch.org/quarterly/articles/2005/08/low-power-computing-for-fleas-mice-and-mammoth/