CTWatch Quarterly » Failure Tolerance in Petascale Computers

Failure Tolerance in Petascale Computers

Garth Gibson, Carnegie Mellon University
Bianca Schroeder, Carnegie Mellon University
Joan Digney, Carnegie Mellon University

Better Fault Tolerance for Petascale Computers

As LANL’s data suggests that failure rate grows proportionally to the number of sockets, keeping the number of sockets constant should stave off an increase in the failure rate. To do this, however, means either failing to achieve the top500.org aggregate performance trends, or increasing the performance of each processor chip faster than currently projected.⁸ Chip designers consider it unlikely that we will see a return to the era of rapid decreases in processor cycle time because of the power consumption implications. The remaining alternative, increasing the number of cores per chip faster, would probably not be effective, even if it was possible, because memory bandwidth per chip will not keep up. Therefore, we think the number of processor chip sockets will continue to increase to keep performance on the top500.org trends.

Socket Reliability: The increase in failure rates could also be prevented if individual processor chip sockets were made more reliable each year in the future, i.e., if the per socket MTTI would increase proportionally to the number of sockets per system over time. Unfortunately, LANL’s data does not indicate that hardware has become more reliable over time, suggesting that as long as the number of sockets is rising, the system-wide MTTI will drop.

Partitioning: The number of interrupts an application sees depends on the number of processor chips it is using in parallel. One way to stave off the drop in MTTI per application would be to run it only on a constant-sized sub-partition of the machine, rather than on all nodes of a machine. Unfortunately, while this solution works for small applications that do not need performance to increase faster than the speed each chip increases, it is not appealing for the most demanding “hero” applications, for which the largest new computers are often justified.

Faster Checkpointing: The basic premise of the checkpoint restart approach to fault tolerance, often called the balanced system design, is that even if MTTI is not decreasing, storage bandwidth increases in proportion to total performance.¹ Though achieving balance (i.e., doubling of storage bandwidth every year) is a difficult challenge for storage systems, one way to cope with increasing failure rates is to effect further increases in storage bandwidth. For example, assuming that the number of sockets and hence the failure rate grows by 40% per year, the effective application utilization would stay the same if checkpoints were taken in 30% less time each year. Projections show that this sort of increase in bandwidth is orders of magnitude higher than the commonly expected increase in bandwidth per disk drive (generally about 20% per year). Therefore, an increase in bandwidth would have to come from a rapid growth in the total number of drives, well over 100% per year, increasing the cost of the storage system much faster than any other part of petascale computers. This might be possible, but it is not very desirable.

Another option is to decrease the amount of memory being checkpointed, either by not growing total memory as fast, or by better compression of application data leading to only a smaller fraction of memory being written in each checkpoint. Growing total memory at a slower than balanced rate will help reduce total system cost, which is perhaps independently likely, but may not be acceptable for the most demanding applications. Of the two, compression seems to be the more appealing approach, and is entirely under the control of application programmers.

Achieving higher checkpoint speedups purely by compression will require significantly better compression ratios each year. As early as in the year 2010, an application will have to construct its checkpoints with a size at most 50% of the total memory. Once the 50% mark is crossed, other options, such as diskless checkpointing where the checkpoint is written to the volatile memory of another node rather than disk,¹¹ ¹² or hybrid approaches¹³ become viable. We recommend that any application capable of compressing its checkpoint size should pursue this path; considering the increasing number of cycles that will go into checkpointing, the compute time needed for compression may be time well spent.

A third approach to taking checkpoints faster is to introduce special devices between storage and memory that will accept a checkpoint at speeds that scale with memory, then relay the checkpoint to storage after the application has resumed computing. Although such an intermediate memory could be very expensive as it is as large as memory, it might be a good application for cheaper but write-limited technologies such as flash memory, because checkpoints are written infrequently.

Non-Checkpoint-based Fault Tolerance: Process-pairs duplication and checking of all computations is a traditional method for tolerating detectable faults that hasn’t been applied to HPC systems.¹⁴ ¹⁵ ¹⁶ Basically, every operation is done twice in different nodes so the later failure of a node does not destroy the operation’s results. Process pairs would eliminate both the cost associated with writing checkpoints, because they are not needed, as well as lost work in the case of failure. However, using process pairs is expensive in that it requires giving up 50% of the hardware to compute each operation twice in different nodes and it introduces further overheads to keep process pairs in synch. However, if no other method works to keep utilization above 50%, this sacrifice might become appropriate, and it bounds the decrease in effectiveness to about 50%, perhaps without requiring special hardware.

Pages: 1 2 3 4 5 6 7

CTWatch is a collaborative effort				Sponsored By