CTWatch Quarterly » Failure Tolerance in Petascale Computers

Failure Tolerance in Petascale Computers

Garth Gibson, Carnegie Mellon University
Bianca Schroeder, Carnegie Mellon University
Joan Digney, Carnegie Mellon University

Lower Mean Time To Interrupt (MTTI) in Petascale Computers

What does our data analysis, examined in the light of recent technology trends, predict for the reliability and availability of future HPC systems?

Our essential prediction is that the number of processor chips will grow with time, increasing failure rates and fault tolerance overheads.

First, we expect petascale computers will be conceived and constructed according to long standing trends (aggregate compute performance doubling every year) shown on the top500.org list of the largest documented computers.⁷ Second, we expect little or no increase in clock speed, but an increase in the number of processor cores per processor chip, commonly referred to as a socket in the new multi-core processor era, at a fast rate, estimated as doubling every two years.⁸ Our data also predicts that failure rates will grow in proportion to the number of sockets in the system and that there is no indication that the failure rate per socket will decrease over time with technology changes. Therefore, as the number of sockets in future systems increases to achieve top500.org performance trends, we expect the system wide failure rate will increase.

In an attempt to quantify what one might expect to see in future systems, we examined the LANL data and found that an optimistic estimate for the failure rate per year per socket is 0.1. Our data does not predict how failure rates will change with increasing numbers of cores per processor chip core, but it is reasonable to predict that many failure prone mechanisms operate at the chip level, so we make the (possibly highly optimistic) assumption that failure rates will increase only with the number of chip sockets, and not with the number of cores per chip.

As a baseline for our projections, we modeled the Jaguar system at Oak Ridge National Laboratory (ORNL). After it is expanded to a Petaflop system in 2008, Jaguar is expected to have around 11,000 processor sockets (dual-core Opterons), 45 TB of main memory and a storage bandwidth of 55 GB/s.⁹ Predictions for system expansion are bracketed with three projected rates of growth, with numbers of cores doubling every 18, 24 and 30 months.

Figure 4. (a) The expected growth in failure rate and (b) decrease in MTTI, assuming that the number of cores per socket grows by a factor of two every 18, 24 and 30 months, respectively, and the number of sockets increases so that aggregate performance conforms to top500.org.

Figure 4 plots the expected increase in failure rate and corresponding decrease in mean time to interrupt (MTTI), based on the above assumptions. Even if we assume a zero increase in failure rate with more cores per socket (a stretch), the failure rates across the biggest machines in the top 500 lists of the future can be expected to grow dramatically.

Pages: 1 2 3 4 5 6 7

CTWatch is a collaborative effort				Sponsored By