Submitted by webmaster on
Title | A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly? |
Publication Type | Journal Article |
Year of Publication | 2024 |
Authors | Bautista-Gomez, L., A. Benoit, S. Di, T. Herault, Y. Robert, and H. Sun |
Journal | Future Generation Computer Systems |
Date Published | 2024-07 |
ISSN | 0167739X |
Abstract | The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. We provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication. |
URL | https://linkinghub.elsevier.com/retrieve/pii/S0167739X24003777 |
DOI | 10.1016/j.future.2024.07.022 |
Short Title | Future Generation Computer Systems |