Submitted by scrawford on
| Title | Checkpointing Workflows for Fail-Stop Errors |
| Publication Type | Conference Paper |
| Year of Publication | 2017 |
| Authors | Han, L., L-C. Canon, H. Casanova, Y. Robert, and F. Vivien |
| Conference Name | IEEE Cluster |
| Date Published | 2017-09 |
| Publisher | IEEE |
| Conference Location | Honolulu, Hawaii |
| Abstract | We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class |



