The primary data set we are studying was collected between 1995 and 2005 at Los Alamos National Laboratory (LANL, www.lanl.gov) and covers 22 high-performance computing systems, including a total of 4,750 machines and 24,101 processors.5 Figure 1 shows pictures of two LANL systems. The data contain an entry for any failure that occurred during the nine year time period that resulted in an application interruption or a node outage. It covers all aspects of system failures: software failures, hardware failures, failures due to operator error, network failures, and failures due to environmental problems (e.g., power outages). For each failure, the data notes start time and end time, the system and node affected, as well as categorized root cause information. To the best of our knowledge, this is the largest failure data set studied to date, both in terms of the time-period it spans and the number of systems and processors it covers. It is also the first to be publicly available to researchers.6
![Figure 1a](../../../../../figures/12/gibson-figure1a.jpg)
![Figure 1b](../../../../../figures/12/gibson-figure1b.jpg)