1st Extending MPI for Resilience (EMPIRe) Workshop


September 25, 2017, Chicago, IL in collaboration with Euro MPI/USA 2017

Call for Papers (pdf)

This workshop has been cancelled. The website will remain up as a legacy, but the event will not be held in 2017.

 

About the workshop


The continuing trend in hardware architectures towards smaller, more efficient and certainly more cost effective components, as well as the increase in scale of systems for computational science opens the door for deeper and more clear understanding of the physical phenomena governing our surroundings. On the other side, though, reduction in feature sizes due to improvements in photolithography combined with a growing number of components and the volatility of computational resources in some types of platforms lead, from the application perspective, to a decrease in the mean time to failure. Failures manifest in all types and scales of execution platforms with consistently dramatic result, the lost of data, computations and results. Most of the parallel programming paradigms and runtimes used in the high performance computing field have been impermeable to notions of resilience, and provide little support for programmatically dealing with any type of faults. Moreover, solutions widely used in the industry have been slow to make their way into the high performance computing field.

The EMPIRe workshop articulates novel and promising directions for addressing the challenges of scale and resilience, and fosters discussions between the MPI community and the users with regards to different approaches to be pursued, mixed techniques with an increased opportunity of handling faults for a larger and more varied application domains.

This workshop targets cross-cutting research into resilience, to ensure that scientific computations timely deliver their results on all execution platforms, free of defective results. Its scope is to highlight the complexity of faults and isolate some of their root causes as well as to investigate solutions to address the natural increase in faults diversity and rates, and to provide efficient and portable solutions that encompass all types of parallel execution platforms, runtimes and applications. While the main focus of this workshop is in the context of message passing programming paradigms, we welcome other solutions not bound to a particular parallel programming model or runtime system, but hopefully portable enough across programming languages and paradigms, capable of delivering their promises at all execution platform sizes.

Submit a paper!

Topics of interest


  • Failure detection, prediction and characterization
  • Checkpoint/Restart: optimal checkpoint interval, lossy compression of checkpoints, application level interface (SCR, FTI)
  • SDC detection: Predictor, Auxiliary methods and recovery, ABFT
  • Resilient software stack: Global OS, file system, runtimes (MPI, ULFM)
  • Algorithms: Resilient numerical methods
  • Methodology: failure and SDC injectors, Detection (Recall, Precision) and Reliability Metrics
  • Models for fault prediction, impact, management and application costs
  • Resource management for system resiliency and availability
  • Naturally fault tolerant, self-healing, or fault oblivious scientific algorithms
  • Programming model and system software support for scalability and resilience

Submission


Authors are invited to submit manuscripts in English, structured as technical papers not exceeding 10 letter size (8.5in x 11in) pages or as short papers limited to 4 pages of the same format, using the ACM 2017 Template. Similarly to the main conference, Euro MPI/USA 2017, the page limit includes figures, tables, and appendices, but does not include references, for which there is no page limit. Margins and font sizes should not be modified. Authors should submit their work through the EMPIRe Submission Site.

In collaboration with the main conference, Euro MPI/USA 2017, select papers will be invited to submit revised and extended versions to be considered for inclusion in an invitation-only special issue of the Elsevier Parallel Computing journal. The extended version of the paper must have at least 30% additional content compared to the version published at Euro MPI/USA 2017.

  • Full paper submission: July 21, 2017 AoE (firm)
  • Notification of acceptance: August 10, 2017
  • Final paper submission: August 15th, 2017
  • Workshop/conference early registration:
  • Workshop: September 25, 2017

Program


TBD

Venue


The workshop will be help under the umbrella of the Euro MPI/USA 2017 conference, and as such the travel, hotel, registration and everything else related to the handling of the conference, can be found on the Travel page of Euro MPI/USA 2017.

  • Wesley Bland, Intel, USA
  • Aurelien Bouteiller, University of Tennessee, Knoxville, USA
  • Franck Cappello, ANL, USA
  • Camille Coti, Université Paris 13, France
  • Marc Gamell, Intel, USA
  • Dominik Goeddeke, University of Stuttgart, Germany
  • Atshusi Hori, RIKEN AICS, Japan
  • Ignacio Laguna, LLNL, USA
  • Pierre Lemarinier, IBM Dublin, Ireland
  • Esteban Meneses Rojas, TEC, Costa Rica
  • Franck Mueller, North Carolina State Univeristy
  • Mitsuhisa Sato, RIKEN AICS, Japan
  • Anthony (Tony) Skjellum, Auburn University, USA
  • Marc Snir, UIUC, USA
  • Peter Stradzins, Australian National University, Australia
  • Osman Unsal, Barcelona Supercomputing Center, Spain

Planning Committee

  • George Bosilca, University of Tennessee, Knoxville, USA
  • Martin Schulz, LLNL, USA
  • Keita Teranishi, Sandia National Labs, USA