A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

TitleA survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Publication TypeJournal Article
Year of Publication2024
AuthorsBautista-Gomez, L., A. Benoit, S. Di, T. Herault, Y. Robert, and H. Sun
JournalFuture Generation Computer Systems
Date Published2024-07
ISSN0167739X
Abstract

The Young/Daly formula provides an approximation of the optimal checkpointing period for a parallel application executing on a supercomputing platform. It was originally designed to handle fail-stop errors for preemptible tightly-coupled applications, but has been extended to other application and resilience frameworks. We provide some background and survey various scenarios to assess the usefulness and limitations of the formula, both for preemptible applications and workflow applications represented as a graph of tasks. We also discuss scenarios with uncertainties, and extend the study to silent errors. We exhibit cases where the optimal period is of a different order than that dictated by the Young/Daly formula, and finally we explain how checkpointing can be further combined with replication.

URLhttps://linkinghub.elsevier.com/retrieve/pii/S0167739X24003777
DOI10.1016/j.future.2024.07.022
Short TitleFuture Generation Computer Systems
External Publication Flag: