Publications
Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
(1.39 MB)
“
Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
(837 KB)
“
Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“
Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
(573.71 KB)
“
Resilience for Stencil Computations with Latent Errors,”
International Conference on Parallel Processing (ICPP), Bristol, UK, IEEE Computer Society Press, August 2017.
(1.19 MB)
“
Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(603.58 KB)
“
Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
(1.02 MB)
“
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
(865.68 KB)
“