%0 Journal Article %J International Journal of Networking and Computing %D 2019 %T Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors %A Anne Benoit %A Aurelien Cavelan %A Florina M. Ciorba %A Valentin Le Fèvre %A Yves Robert %K checkpoint %K fail-stop error; silent error %K HPC %K linear workflow %K Replication %X Large-scale platforms currently experience errors from two di?erent sources, namely fail-stop errors (which interrupt the execution) and silent errors (which strike unnoticed and corrupt data). This work combines checkpointing and replication for the reliable execution of linear work?ows on platforms subject to these two error types. While checkpointing and replication have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear work?ows in error-prone environments. Moreover, combined checkpointing and replication has not yet been studied in the presence of both fail-stop and silent errors. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it to ensure its reliable execution. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques, lead to improved performance. %B International Journal of Networking and Computing %V 9 %P 2-27 %8 2019 %G eng %U http://www.ijnc.org/index.php/ijnc/article/view/194 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2018 %T Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing %A Anne Benoit %A Aurelien Cavelan %A Franck Cappello %A Padma Raghavan %A Yves Robert %A Hongyang Sun %K checkpointing %K fail-stop errors %K Fault tolerance %K High-performance computing %K Replication %K silent errors %X This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model. %B Journal of Parallel and Distributed Computing %V 122 %P 209–225 %8 2018-12 %G eng %R https://doi.org/10.1016/j.jpdc.2018.08.002