Submitted by scrawford on
Title | Checkpointing Workflows for Fail-Stop Errors |
Publication Type | Conference Paper |
Year of Publication | 2017 |
Authors | Han, L., L-C. Canon, H. Casanova, Y. Robert, and F. Vivien |
Conference Name | IEEE Cluster |
Date Published | 2017-09 |
Publisher | IEEE |
Conference Location | Honolulu, Hawaii |
Abstract | We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class |