Publications
Co-Scheduling Algorithms for Cache-Partitioned Systems,”
19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017.
DOI: 10.1109/IPDPSW.2017.60 (584.76 KB)
“Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
Cluster 2018, Belfast, UK, IEEE Computer Society Press, September 2018.
(423.75 KB)
“Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs,”
22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(696.21 KB)
“Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
DOI: 10.1145/3086157.3086162 (865.68 KB)
“Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
DOI: 10.1145/3086157.3086165 (1.02 MB)
“Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
DOI: 10.1109/IPDPS.2016.39 (603.58 KB)
“A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(868.44 KB)
“Replication is More Efficient Than You Think,”
The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019.
(975.69 KB)
“When to checkpoint at the end of a fixed-length reservation?,”
Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop, Denver, United States, August 2023.
“Checkpointing à la Young/Daly: An Overview,”
IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, Noida, India, ACM Press, pp. 701-710, August 2022.
DOI: 10.1145/3549206 (639.77 KB)
“Max-Stretch Minimization on an Edge-Cloud Platform,”
IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium: IEEE Computer Society Press, 2021.
(4.94 MB)
“Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
DOI: 10.1145/2897189 (573.71 KB)
“Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
DOI: 10.1016/j.jpdc.2018.08.002 (837 KB)
“Co-Scheduling Amdhal Applications on Cache-Partitioned Systems,”
International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 123–138, January 2018.
DOI: 10.1177/1094342017710806 (672.52 KB)
“Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019.
DOI: 10.1177/1094342019846956 (930.28 KB)
“Efficient Checkpoint/Verification Patterns,”
International Journal on High Performance Computing Applications, July 2015.
DOI: 10.1177/1094342015594531 (392.76 KB)
“Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“Resilient Co-Scheduling of Malleable Applications,”
International Journal of High Performance Computing Applications (IJHPCA), May 2017.
DOI: 10.1177/1094342017704979 (1.62 MB)
“Resilient scheduling heuristics for rigid parallel jobs,”
Int. J. of Networking and Computing, vol. 11, no. 1, pp. 2-26, 2021.
(8.67 MB)
“Scheduling Computational Workflows on Failure-prone Platforms,”
International Journal of Networking and Computing, vol. 6, no. 1, pp. 2-26, 2016.
(503.81 KB)
“Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
DOI: 10.1109/TC.2016.2643660 (1.39 MB)
“Revisiting I/O bandwidth-sharing strategies for HPC applications,”
INRIA Research Report, no. RR-9502: INRIA, March 2023.
“On the Combination of Silent Error Detection and Checkpointing,”
UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013.
(1.29 MB)
“Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“Optimal Checkpointing Period: Time vs. Energy,”
University of Tennessee Computer Science Technical Report (also LAWN 281), no. ut-eecs-13-718: University of Tennessee, October 2013.
(440.13 KB)
“