Publications
Export 89 results:
Filters: Author is Yves Robert [Clear All Filters]
Fault Tolerance Techniques for High-performance Computing,”
University of Tennessee Computer Science Technical Report (also LAWN 289), no. UT-EECS-15-734: University of Tennessee, May 2015.
“A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
(1.04 MB)
“Failure Detection and Propagation in HPC Systems,”
Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
“Evaluating Task Dropping Strategies for Overloaded Real-Time Systems (Work-In-Progress),”
42nd Real Time Systems Symposium (RTSS): IEEE Computer Society Press, 2021.
(217.13 KB)
“Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms,”
49th International Conference on Parallel Processing (ICPP 2020), Edmonton, AB, Canada, ACM Press, 2020.
(804.96 KB)
“Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“Efficient Checkpoint/Verification Patterns,”
International Journal on High Performance Computing Applications, July 2015.
(392.76 KB)
“Dynamic DAG scheduling under memory constraints for shared-memory platforms,”
Int. J. of Networking and Computing, vol. 11, no. 1, pp. 27-49, 2021.
(574.64 KB)
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
“Distributed Termination Detection for HPC Task-Based Environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-14: University of Tennessee, June 2018.
“Designing LU-QR Hybrid Solvers for Performance and Stability,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(4.2 MB)
“Designing LU-QR hybrid solvers for performance and stability,”
University of Tennessee Computer Science Technical Report (also LAWN 282), no. ut-eecs-13-719: University of Tennessee, October 2013.
(4.11 MB)
“Design and Implementation of the PULSAR Programming System for Large Scale Computing,”
Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017.
(764.96 KB)
“Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs,”
22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(696.21 KB)
“Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019.
(930.28 KB)
“Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
Cluster 2018, Belfast, UK, IEEE Computer Society Press, September 2018.
(423.75 KB)
“Co-Scheduling Amdhal Applications on Cache-Partitioned Systems,”
International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 123–138, January 2018.
(672.52 KB)
“Co-Scheduling Algorithms for Cache-Partitioned Systems,”
19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017.
(584.76 KB)
“Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
(837 KB)
“Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors,”
Parallel Computing, vol. 75, pp. 41–60, July 2018.
(2.56 MB)
“Computing Dense Tensor Decompositions with Optimal Dimension Trees,”
Algorithmica, vol. 81, issue 5, pp. 2092–2121, May 2019.
(638.4 KB)
“Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing,”
International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, January 2015.
(755.54 KB)
“Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms,”
Parallel Computing, vol. 85, pp. 1–12, July 2019.
(865.18 KB)
“Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“On the Combination of Silent Error Detection and Checkpointing,”
UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013.
(1.29 MB)
“Checkpointing Workflows for Fail-Stop Errors,”
IEEE Transactions on Computers, vol. 67, issue 8, pp. 1105–1120, August 2018.
“Checkpointing Workflows for Fail-Stop Errors,”
IEEE Cluster, Honolulu, Hawaii, IEEE, September 2017.
(400.64 KB)
“Checkpointing Strategies for Shared High-Performance Computing Platforms,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 28–52, 2019.
(490.5 KB)
“Checkpointing à la Young/Daly: An Overview,”
IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, Noida, India, ACM Press, pp. 701-710, August 2022.
(639.77 KB)
“Budget-aware scheduling algorithms for scientific workflows with stochastic task weights on IaaS Cloud platforms,”
Concurrency and Computation: Practice and Experience, vol. 33, no. 17, pp. e6065, 2021.
(1.99 MB)
“Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, IEEE, May 2018.
(1.31 MB)
“Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, IEEE, May 2017.
(328.15 KB)
“Assuming failure independence: are we right to be wrong?,”
The 3rd International Workshop on Fault Tolerant Systems (FTS), Honolulu, Hawaii, IEEE, September 2017.
(597.11 KB)
“Assessing the Impact of ABFT and Checkpoint Composite Strategies,”
16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.02 MB)
“Assessing the impact of ABFT and Checkpoint composite strategies,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
(968.47 KB)
“Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results,”
Parallel Computing, vol. 52, pp. 22-41, February 2016.
(2.06 MB)
“Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
(573.71 KB)
“Algorithmic Issues on Heterogeneous Computing Platforms,”
Parallel Processing Letters, vol. 9, no. 2, pp. 197-213, January 1999.
(301.17 KB)
“