Submitted by claxton on
Title | Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms |
Publication Type | Conference Paper |
Year of Publication | 2018 |
Authors | Herault, T., Y. Robert, A. Bouteiller, D. Arnold, K. Ferreira, G. Bosilca, and J. Dongarra |
Conference Name | 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award |
Date Published | 2018-05 |
Publisher | IEEE |
Conference Location | Vancouver, BC, Canada |
Abstract | In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance. |
DOI | 10.1109/IPDPSW.2018.00127 |