FT-MPI

Overview

HARNESS introduction

HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) is an experimental Metacomputing System aiming at providing a highly dynamic, fault-tolerant computing environment for high performance computing applications.

The major goal of HARNESS is to provide a framework for distributed computing. A major difference between previous projects (e.g. PVM) and HARNESS is the dynamic plug-in modularity of the latter one. To provide users of HARNESS instant application support, both a PVM and a MPI plug-in were envisaged. As the HARNESS system itself was both dynamic and fault tolerant (no single points of failure), it became possible to build a MPI plug-in with added capabilities such as dynamic process management and fault tolerance.

FT-MPI Introduction

FT-MPI has been developed in the frame of the HARNESS project. The goal of FT-MPI is to provide the end-user a communication library providing an MPI API, which benefits from the fault-tolerance in the HARNESS system. Therefore, FT-MPI implements the whole MPI-1.2 specification, some parts of the MPI-2 document and extends some of the semantics of MPI for giving the application the possibility to recover from failed processes.

FT-MPI survives the crash of n-1 processes in a n-process job, and, if required, can respawn them. However, it is still the responsebility of the application to recover the data-structures and the data on the crahsed processes.

FT-MPI semantics

FT-MPI provides four different error modes, which can be specified while starting the application. These modes are called 'communicator modes'.

ABORT: like any other MPI application, FT-MPI can abort on an error.

BLANK: failed processes are not replaced, all surviving processes have the same rank like before the crash and MPI_COMM_WORLD has the same size like previously.

SHRINK: failed processes are not replaced, however the new communicator after the crash has no 'holes' in its list of processes. Thus, processes might have a new rank after recovery and the size of MPI_COMM_WORLD has changed.

REBUILD: failed processes are respawned, surviving processes have the same rank as previously. The REBUILD mode is the default, and the best tested mode of FT-MPI.

The second parameter, the 'communication modes' indicates how messages, which are on the 'fly' why an error occurs are treated.

CONT/CONTINUE: all operations which returned the error code MPI_SUCCESS will finish properly, even if a process failure occurs during the operation (unless the communication partner has failed). NOOP/RESET: all ongoing messages are dropped. The asumption behind this mode is, that on error the application returns to its last constistent state, and all currently ongoing operations are not of any interest.

Difference between FT-MPI and other MPIs

Differences between the semantics of FT-MPI and MPI-1.2 or MPI-2: - default error-handler in FT-MPI is MPI_ERRORS_RETURN (in contrary to MPI_ERRORS_ARE_FATAL in MPI 1.2 ).

- MPI_Init returns MPI_INIT_RESTARTED_NODE in case a process has been respawned (in contrary to MPI_SUCCESS in MPI 1.2).

- FT-MPI provides two additional attributes indicating how many processes have failed (FTMPI_NUM_FAILED_PROCS) and an error code (FTMPI_ERRCODE_FAILED). Using the value of the second attribute, the user can inquire an error-string using MPI_Error_string, which than contains the list of processes which have crashed.

- The only operation allowed after an error occurs is MPI_Comm_dup. At the moment, the 'newcomm' argument for initiating a recovery with MPI_Comm_dup has to be set to FT_MPI_CHECK_RECOVER. This restriction will be removed from future releases. (Note: an automatic mode, where MPI_COMM_WORLD and MPI_COMM_SELF are automatically rebuild for the user is currently beeing tested and will be available in future releases. Thus, the MPI_Comm_dup as the operation initiating the recovery will drop if required by the user).

Supported Platforms

HARNESS and FT-MPI have been tested on a wide range of platforms and compilers. The settings in the paranthesis indicates the architecture setting used for the specifica platform. Among the supported platforms are:

Intel IA32 platforms using LINUX and gnu compilers (HARNESS_LINUX)

Intel IA64 platforms using LINUX and gnu compilers (HARNESS_LINUXIA64)

Intel IA64 platforms using LINUX intel compilers (HARNESS_LINUXIA64)

AMD Opteron64 platforms using LINUX gnu compilers (HARNESS_LINUXAMD64)*

Sun platforms using SOLARIS and SUNWSpro compilers (HARNESS_SUN4SOL2)

Sun platforms using SOLARIS and gnu compilers (HARNESS_SUN4SOL2)

SGI platforms using IRIX6.5 and SGI compilers (HARNESS_SGI6 or HARNESS_SGI64)

SGI platforms using IRIX6.5 and gnu compilers (HARNESS_SGI6 or HARNESS_SGI64)

ALPHA platforms using OSF1 and gnu compilers (HARNESS_ALPHA)

IBM platforms using AIX and IBM compilers (HARNESS_AIX46K)

Notes. * Opteron64 client hlib controlled loading of plug-ins is currently not supported but will be resolved in the next release. See HARNESS web page for more information.

Known bugs and problems

- MPI_Ssend, MPI_Issend and MPI_Ssend_init are not providing a real synchronous send mode at the moment. This will be fixed in future releases.

- MPI_Cancel on Isend operations might fail in situations, where the MPI standard requires the cancel operation to succeed. The reason is, that for short messages FT-MPI is using an eager send mode. Therefore, the cancel operation cannot stop the send operation anymore.

- FT-MPI uses as default receiver side data conversion on heterogeneous platforms. This leads to the problem, that while Packing/Unpacking data using the according MPI functions, the origin of the message is not known. To circumvent this problem, packed data is sent converting it into XDR format. Thus, a packed message can correctly be unpacked on any architecture/process. This leads however to problems for 'mixed mode' communications, e.g. the user packs the message on the sender side, but does not unpack it on the receiver side, instead he is receiving it directly as e.g. MPI_FLOAT. This is explicitly allowed in the standard, is however currently not supported by FT-MPI on heterogeneous platforms. If your application does have this datatype matching pattern (packing on sender side and receiving it as a valid datatype, or sending as a real datatype and receiving it as a packed message), please switch during the configure step the XDR data converision on.

Note: The problem described above does NOT apply for homogeneous configurations.

Note: turning the XDR data conversion on using configure has just been tested rudimentary. In case of any problems, please contact the FT-MPI team.

- In the current release the SHRINK mode is fully supported when used with the CONT message mode.


Sponsored By:	Industry Support From:

Project Handouts