Toward Resilient Applications for Extreme- Scale Systems Part III of IV
Abstract
As leadership-class computing systems increase in complexity and transistor feature sizes decrease, application codes find themselves
less and less able to treat a system as a reliable digital machine. In fact, the high performance computing community has grown
increasingly concerned that applications will have to manage resilience issues beyond the current practice of global checkpoint
restart. This is expensive
at scale and not capable of fixing all types of errors. We discuss alternatives in software and numerical algorithms that can improve
the resiliency of applications and manage a variety of faults anticipated in future extreme-scale computing systems.
Organizers
- Keita Teranishi, Sandia National Laboratories, USA
- Mark Hoemmen, Sandia National Laboratories, USA
- Jaideep Ray, Sandia National Laboratories, USA
- Michael A. Heroux, Sandia National Laboratories, USA
Part IV
Wednesday, February 19
MS9
10:35 AM - 12:15 PM Room: Salon F