Publications
Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
DOI: 10.1016/j.jpdc.2018.08.002
(837 KB)
“
Efficient Checkpoint/Verification Patterns,”
International Journal on High Performance Computing Applications, July 2015.
DOI: 10.1177/1094342015594531
(392.76 KB)
“
Scheduling Computational Workflows on Failure-prone Platforms,”
International Journal of Networking and Computing, vol. 6, no. 1, pp. 2-26, 2016.
(503.81 KB)
“
Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
DOI: 10.1109/TC.2016.2643660
(1.39 MB)
“
On the Combination of Silent Error Detection and Checkpointing,”
UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013.
(1.29 MB)
“