Submitted by claxton on
Title | Surviving Errors with OpenSHMEM |
Publication Type | Conference Proceedings |
Year of Publication | 2016 |
Authors | Bouteiller, A., G. Bosilca, and M G. Venkata |
Editor | Venkata, M G., N. Imam, S. Pophale, and T. M. Mintz |
Conference Name | OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments |
Pagination | 66–81 |
Publisher | Springer International Publishing |
Conference Location | Baltimore, MD, USA |
ISBN Number | 978-3-319-50995-2 |
Abstract | Unexpected error conditions stem from a variety of underlying causes, including resource exhaustion, network failures, hardware failures, or program errors. As the scale of HPC systems continues to grow, so does the probability of encountering a condition that causes a failure; meanwhile, error recovery and run-through failure management are becoming mature, and interoperable HPC programming paradigms are beginning to feature advanced error management. As a result from these developments, it becomes increasingly desirable to gracefully handle error conditions in OpenSHMEM. In this paper, we present the design and rationale behind an extension of the OpenSHMEM API that can (1) notify user code of unexpected erroneous conditions, (2) permit customized user response to errors without incurring overhead on an error-free execution path, (3) propagate the occurence of an error condition to all Processing Elements, and (4) consistently close the erroneous epoch in order to resume the application. |