The primary goal of the Coordinated Infrastructure for Fault Tolerance in High Performance Computing Systems (CIFTS) is to design and implement interfaces for integrating fault tolerance features into multiple layers of the system software stack, from the application and programming system layer through the file system and other system software down to the operating system. Such integration will make possible a level of fault prediction, notification, management, and recovery impossible today but critical to the productive usage of the future peta and exa-scale systems.
FT-LA is integrated with the core component of CIFTS, the Fault Tolerance Bus (FTB), in order to benefit from health status updates and to enable systemwide cooperation toward failure recovery. The main feature demonstrated so far in FT-LA is the cooperation with the batch scheduler and the resource manager of the host system. If the application is able to tolerate failures, all these efforts can be ruined if the scheduling architecture decides to terminate any job hit by a failure, regardless of its current condition or ability to continue. FT-LA defines a set of events allowing it to notify job schedulers of its ability to tolerate failures and its current status regarding any ongoing failure. Among these status updates, FT-LA has the ability to transmit estimates of the alteration of the expected time-to-completion induced by any recovery actions. Several different estimates are given, according to the number of spare process allocated to the library. All these notifications are optimistic, in the sense that FT-LA can resume normal operation regardless of the decisions of the job scheduler (this is especially important considering the case where the job scheduler is not FTB enabled). However, we expect these notifications to unleash better platform efficiency by allowing the job scheduler to predicts more accurately the application behavior and take actions accordingly.