Submitted by scrawford on
| Title | Optimal Checkpointing Period with replicated execution on heterogeneous platforms |
| Publication Type | Conference Paper |
| Year of Publication | 2017 |
| Authors | Benoit, A., A. Cavelan, V. Le Fèvre, and Y. Robert |
| Conference Name | 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale |
| Date Published | 2017-06 |
| Publisher | IEEE Computer Society Press |
| Conference Location | Washington, DC |
| Abstract | In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~W for a periodic checkpointing strategy where both platforms concurrently try and execute W units of work before checkpointing. The first platform that completes its pattern takes a checkpoint, and the other platform interrupts its execution to synchronize from that checkpoint. We compare this strategy to a simpler on-failure checkpointing strategy, where a checkpoint is taken by one platform only whenever the other platform encounters a failure. We use first or second-order approximations to compute overheads and optimal pattern sizes, and show through extensive simulations that these models are very accurate. The simulations show the usefulness of a secondary platform to reduce execution time, even when the platforms have relatively different speeds: in average, over a wide range of scenarios, the overhead is reduced by 30%. The simulations also demonstrate that the periodic checkpointing strategy is globally more efficient, unless platform speeds are quite close. |
| DOI | 10.1145/3086157.3086165 |



