%0 Report
%D 2023
%T Revisiting I/O bandwidth-sharing strategies for HPC applications
%A Anne Benoit
%A Thomas Herault
%A Lucas Perotin
%A Yves Robert
%A Frederic Vivien
%K bandwidth sharing
%K HPC applications
%K I/O
%K scheduling strategy
%X This work revisits I/O bandwidth-sharing strategies for HPC applications. When several applications post concurrent I/O operations, well-known approaches include serializing these operations (First-Come First-Served) or fair-sharing the bandwidth across them (FairShare). Another recent approach, I/O-Sets, assigns priorities to the applications, which are classified into different sets based upon the average length of their iterations. We introduce several new bandwidth-sharing strategies, some of them simple greedy algorithms, and some of them more complicated to implement, and we compare them with existing ones. Our new strategies do not rely on any a-priori knowledge of the behavior of the applications, such as the length of work phases, the volume of I/O operations, or some expected periodicity. We introduce a rigorous framework, namely steady-state windows, which enables to derive bounds on the competitive ratio of all bandwidth-sharing strategies for three different objectives: minimum yield, platform utilization, and global efficiency. To the best of our knowledge, this work is the first to provide a quantitative assessment of the online competitiveness of any bandwidth-sharing strategy. This theory-oriented assessment is complemented by a comprehensive set of simulations, based upon both synthetic and realistic traces. The main conclusion is that our simple and low-complexity greedy strategies significantly outperform First-Come First-Served, FairShare and I/O-Sets, and we recommend that the I/O community implements them for further assessment.
%B INRIA Research Report
%I INRIA
%8 2023-03
%G eng
%U https://hal.inria.fr/hal-04038011

%0 Conference Paper
%B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop
%D 2023
%T When to checkpoint at the end of a fixed-length reservation?
%A Quentin Barbut
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%X This work considers an application executing for a fixed duration, namely the length of the reservation that it has been granted. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. We address two scenarios. In the first scenario, a checkpoint can be taken at any time; despite its simplicity, this natural problem has not been considered yet (to the best of our knowledge). We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. The second scenario is more involved: the application is a linear workflow consisting of a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the application checkpoints at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue executing at the end of each task. We instantiate this second scenario with several examples of probability distribution laws for task durations.
%B Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop
%C Denver, United States
%8 2023-08
%G eng
%U https://inria.hal.science/hal-04215554

%0 Conference Proceedings
%B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing
%D 2022
%T Checkpointing à la Young/Daly: An Overview
%A Anne Benoit
%A Yishu Du
%A Thomas Herault
%A Loris Marchal
%A Guillaume Pallez
%A Lucas Perotin
%A Yves Robert
%A Hongyang Sun
%A Frederic Vivien
%X The Young/Daly formula provides an approximation of the optimal checkpoint period for a parallel application executing on a supercomputing platform. The Young/Daly formula was originally designed for preemptible tightly-coupled applications. We provide some background and survey various application scenarios to assess the usefulness and limitations of the formula.
%B IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing
%I ACM Press
%C Noida, India
%P 701-710
%8 2022-08
%@ 9781450396752
%G eng
%U https://dl.acm.org/doi/fullHtml/10.1145/3549206.3549328
%R 10.1145/3549206

%0 Conference Proceedings
%B 42nd Real Time Systems Symposium (RTSS)
%D 2021
%T Evaluating Task Dropping Strategies for Overloaded Real-Time Systems (Work-In-Progress)
%A Yiqin Gao
%A Guillaume Pallez
%A Yves Robert
%A Frederic Vivien
%B 42nd Real Time Systems Symposium (RTSS)
%I IEEE Computer Society Press
%G eng

%0 Conference Paper
%B 49th International Conference on Parallel Processing (ICPP 2020)
%D 2020
%T Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms
%A Li Han
%A Yiqin Gao
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 49th International Conference on Parallel Processing (ICPP 2020)
%I ACM Press
%C Edmonton, AB, Canada
%G eng

%0 Conference Paper
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%D 2020
%T Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints
%A Li Han
%A Louis-Claude Canon
%A Jing Liu
%A Yves Robert
%A Frederic Vivien
%B 40th IEEE Real-Time Systems Symposium (RTSS 2019)
%I IEEE Press
%C York, UK
%8 2020-02
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%B International Journal of High Performance Computing Applications
%V 33
%P 1255-1274
%8 2019-11
%G eng
%N 6
%R https://doi.org/10.1177/1094342019866891

%0 Conference Paper
%B IEEE Cluster 2019
%D 2019
%T Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms
%A Yiqin Gao
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%B IEEE Cluster 2019
%I IEEE Computer Society Press
%C Albuquerque, New Mexico
%8 2019-09
%G eng

%0 Journal Article
%J International Journal of High Performance Computing Applications
%D 2019
%T Scheduling Independent Stochastic Tasks under Deadline and Budget Constraints
%A Louis-Claude Canon
%A Aurélie Kong Win Chang
%A Yves Robert
%A Frederic Vivien
%X This article discusses scheduling strategies for the problem of maximizing the expected number of tasks that can be executed on a cloud platform within a given budget and under a deadline constraint. The execution times of tasks follow independent and identically distributed probability laws. The main questions are how many processors to enroll and whether and when to interrupt tasks that have been executing for some time. We provide complexity results and an asymptotically optimal strategy for the problem instance with discrete probability distributions and without deadline. We extend the latter strategy for the general case with continuous distributions and a deadline and we design an efficient heuristic which is shown to outperform standard approaches when running simulations for a variety of useful distribution laws.
%B International Journal of High Performance Computing Applications
%V 34
%P 246-264
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1177/1094342019852135

%0 Journal Article
%J IEEE Transactions on Computers
%D 2018
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%K checkpoint
%K fail-stop error
%K resilience
%K workflow
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
%B IEEE Transactions on Computers
%V 67
%P 1105–1120
%8 2018-08
%G eng
%U http://ieeexplore.ieee.org/document/8279499/
%N 8

%0 Conference Paper
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%D 2018
%T A Generic Approach to Scheduling and Checkpointing Workflows
%A Li Han
%A Valentin Le Fèvre
%A Louis-Claude Canon
%A Yves Robert
%A Frederic Vivien
%X This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target failstop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows.
%B The 47th International Conference on Parallel Processing (ICPP 2018)
%I IEEE Computer Society Press
%C Eugene, OR
%8 2018-08
%G eng

%0 Conference Paper
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%D 2017
%T Assuming failure independence: are we right to be wrong?
%A Guillaume Aupy
%A Yves Robert
%A Frederic Vivien
%X This paper revisits the failure temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design and compare several cascadeaware checkpointing algorithms to quantify the maximum gain that could be obtained, and we report extensive simulation results with archive and synthetic failure logs. Altogether, there are a few logs that contain cascades, but we show that the gain that can be achieved from this knowledge is not significant. The conclusion is that we can wrongly, but safely, assume failure independence!
%B The 3rd International Workshop on Fault Tolerant Systems (FTS)
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng

%0 Conference Paper
%B IEEE Cluster
%D 2017
%T Checkpointing Workflows for Fail-Stop Errors
%A Li Han
%A Louis-Claude Canon
%A Henri Casanova
%A Yves Robert
%A Frederic Vivien
%X We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class  of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-graphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter.
%B IEEE Cluster
%I IEEE
%C Honolulu, Hawaii
%8 2017-09
%G eng

%0 Generic
%D 2013
%T On the Combination of Silent Error Detection and Checkpointing
%A Guillaume Aupy
%A Anne Benoit
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%K checkpointing
%K error recovery
%K High-performance computing
%K silent data corruption
%K verification
%X In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
%B UT-CS-13-710
%I University of Tennessee Computer Science Technical Report
%8 2013-06
%G eng
%U http://www.netlib.org/lapack/lawnspdf/lawn278.pdf

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2013
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%X In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
%B Concurrency and Computation: Practice and Experience
%8 2013-11
%G eng
%R 10.1002/cpe.3173

%0 Generic
%D 2012
%T Unified Model for Assessing Checkpointing Protocols at Extreme-Scale
%A George Bosilca
%A Aurelien Bouteiller
%A Elisabeth Brunet
%A Franck Cappello
%A Jack Dongarra
%A Amina Guermouche
%A Thomas Herault
%A Yves Robert
%A Frederic Vivien
%A Dounia Zaidouni
%B University of Tennessee Computer Science Technical Report (also LAWN 269)
%8 2012-06
%G eng

%0 Conference Proceedings
%B 2008 PPoPP Conference
%D 2008
%T Matrix Product on Heterogeneous Master Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Frederic Vivien
%B 2008 PPoPP Conference
%C Salt Lake City, Utah
%8 2008-01
%G eng

%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS)
%D 2008
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS)
%V 19
%P 1317-1336
%8 2008-12
%G eng

%0 Journal Article
%J International Journal of Foundations of Computer Science (IJFCS) (accepted)
%D 2007
%T Revisiting Matrix Product on Master-Worker Platforms
%A Jack Dongarra
%A Jean-Francois Pineau
%A Yves Robert
%A Zhiao Shi
%A Frederic Vivien
%B International Journal of Foundations of Computer Science (IJFCS) (accepted)
%8 2007-00
%G eng

%0 Journal Article
%J Parallel Processing Letters
%D 1999
%T Algorithmic Issues on Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Fabrice Rastello
%A Yves Robert
%A Frederic Vivien
%B Parallel Processing Letters
%V 9
%P 197-213
%8 1999-01
%G eng

%0 Journal Article
%J Parallel Computing
%D 1999
%T Static Tiling for Heterogeneous Computing Platforms
%A Pierre Boulet
%A Jack Dongarra
%A Yves Robert
%A Frederic Vivien
%B Parallel Computing
%V 25
%P 547-568
%8 1999-01
%G eng