Surviving Errors with OpenSHMEM

TitleSurviving Errors with OpenSHMEM
Publication TypeConference Proceedings
Year of Publication2016
AuthorsBouteiller, A., G. Bosilca, and M G. Venkata
EditorVenkata, M G., N. Imam, S. Pophale, and T. M. Mintz
Conference NameOpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments
PublisherSpringer International Publishing
Conference LocationBaltimore, MD, USA
ISBN Number978-3-319-50995-2

Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application.

Project Tags: 
External Publication Flag: