The most demanding applications, often the same applications that justify the largest computers, will see ever-increasing failure rates if the trends seen at top500.org continue. Using the standard checkpoint restart fault tolerance strategy, the efficacy of petascale machines running demanding applications will fall off. Relying on computer vendors to counter this trend is not recommended by historical data, and relying on disk storage bandwidth to counter it is likely to be expensive at best. We recommend that these applications consider spending an increasing number of cycles compressing checkpoints. We also recommend experimentation with process pairs fault tolerance for supercomputing. And if technologies such as flash memory are appropriate, we recommend experimenting with special devices devoted to checkpointing.
The work described in this article is part of our broader research agenda with the goal of analyzing and making publicly available the failure data from a large variety of real production systems. To date, large-scale studies of failures in real production systems are scarce, probably a result of the reluctance of the owners of such systems to release failure data. Thus, we have built a public Computer Failure Data Repository (CFDR), hosted by the USENIX association [4] with the goal of accelerating research on system reliability by filling the nearly empty collection of public data with detailed failure data from a variety of large production systems. We encourage all petascale computing organizations to collect and publish failure data for their systems in the repository.
We would like to thank Jamez Nunez and Gary Grider from the High Performance Computing Division at Los Alamos National Lab for collecting and providing us with data and helping us to interpret the data. We thank the members and companies of the PDL Consortium (including APC, Cisco, Google, EMC, Hewlett-Packard, Hitachi, IBM, Intel, LSI, Microsoft, Network Appliance, Oracle, Panasas, Seagate, and Symantec) for their interest and support. This material is based upon work supported by the Department of Energy under Award Number DE-FC02-06ER257673 and on research sponsored in part by the Army Research Office, under agreement number DAAD19–02–1–0389.
2 Schroeder, B., Gibson, G. “Understanding Failures in Petascale Computers,” in SciDAC 2007: Journal of Physics: Conference Series 78 (2007) 012022.
3 Scientific Discovery through Advanced Computing (SciDAC), The Petascale Data Storage Institute (PDSI). www.pdsi-scidac.org, 2006.
4 The Computer Failure Data Repository (CFDR) - cfdr.usenix.org.
5 Schroeder, B., Gibson, G. “A large-scale study of failures in high-performance computing systems,” in Proc. of the 2006 International Conference on Dependable Systems and Networks (DSN’06), 2006.
6 The LANL raw data and more information is available at: www.lanl.gov/projects/computerscience/data/.
7 Top 500 supercomputing sites - www.top500.org, 2007.
8 Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., Yelick, K. A. “The landscape of parallel computing research: A view from Berkeley,” Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
9 Roth, P. C. “The Path to Petascale at Oak Ridge National Laboratory,” in
10 Young, J. W. “A first order approximation to the optimum checkpoint interval,” Commun. ACM, 17(9):530–531, 1974.
11 Plank, J. S., Li, K. “Faster checkpointing with N + 1 parity,” in Proc. 24th International Symposium on Fault Tolerant Computing, 1994.
12 Plank, J. S., Li, K., Puening, M. A. “Diskless checkpointing,” IEEE Trans. Parallel Distrib. Syst., 9(10):972–986, 1998.
13 Vaidya, N. H. “A case for two-level distributed recovery schemes,” in Proceedings of the 1995 ACM SIGMETRICS conference, 1995.
14 Bressoud, T. C., Schneider, F. B., "Hypervisor-based fault tolerance,” ACM Trans. Comput. Syst., 14(1):80–107, 1996.
15 Chapin, J., Rosenblum, M., Devine, S., Lahiri, T., Teodosiu, D., Gupta, A. "Hive: fault containment for shared-memory multiprocessors,” in SOSP ’95: Proceedings of the fifteenth ACM symposium on Operating systems principles, 1995.
16 McEvoy, D. “The architecture of tandem’s nonstop system,” in ACM 81: Proceedings of the ACM ’81 conference, page 245, New York, NY, USA, 1981. ACM Press.