Assuming failure independence: are we right to be wrong?

Submitted by scrawford on Thu, 08/17/2017 - 15:09

Title	Assuming failure independence: are we right to be wrong?
Publication Type	Conference Paper
Year of Publication	2017
Authors	Aupy, G., Y. Robert, and F. Vivien
Conference Name	The 3rd International Workshop on Fault Tolerant Systems (FTS)
Date Published	2017-09
Publisher	IEEE
Conference Location	Honolulu, Hawaii
Abstract	This paper revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!

File:

External Publication Flag: