Checkpointing Strategies for Shared High-Performance Computing Platforms

Thomas Herault; Yves Robert; Aurelien Bouteiller; Dorian Arnold; Kurt Ferreira; George Bosilca; Jack Dongarra

Submitted by claxton on Fri, 05/17/2019 - 17:57

Title	Checkpointing Strategies for Shared High-Performance Computing Platforms
Publication Type	Journal Article
Year of Publication	2019
Authors	Herault, T., Y. Robert, A. Bouteiller, D. Arnold, K. Ferreira, G. Bosilca, and J. Dongarra
Journal	International Journal of Networking and Computing
Volume	9
Number	1
Pagination	28–52
ISSN	2185-2847
Abstract	Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Overall, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.
URL	http://www.ijnc.org/index.php/ijnc/article/view/195

Project Tags:

smurfs

File:

icl-utk-1346-2019.pdf

External Publication Flag: