This file gives a basic introduction to HARNESS and FT-MPI. 
In case of any questions to this document or for further 
information, please contact

   harness@cs.utk.edu
   ftmpi@cs.utk.edu

The documents handles the following issues:

1. HARNESS
1.1 Introduction
1.2 Concept
1.3 The HARNESS API

2 FT-MPI
2.1 Introduction
2.2.FT-MPI semantics
2.3 Supported platforms
2.4 Known problems

3. Further referrences

Separate documents are available about the CONSOLE (CONSOLE.README),
an installation guide (INSTALL) and a quickstart reference for the
impatiant user/reader (QUICKSTART).


--------------------------------------------------------------------  
1. HARNESS

--------------------------------------------------------------------  
1.1 Introduction 

HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) is an
experimental Metacomputing System aiming at providing a highly
dynamic, fault-tolerant computing environment for high performance
computing applications.

The major goal of HARNESS is to provide a framework for distributed
computing.  A major difference between between previous projects
(e.g. PVM) and HARNESS is the dynamic plug-in modularity of the latter
one. To provide users of HARNESS instant application support, both a
PVM and a MPI plug-in were envisaged. As the HARNESS system itself was
both dynamic and fault tolerant (no single points of failure), it
became possible to build a MPI plug-in with added capabilities such as
dynamic process management and fault tolerance.

--------------------------------------------------------------------  
1.2 Concept

The General HARNESS CORE (G_HCORE) is a daemon that provides a very
lightweight infrastructure from which to build distributed
systems. The capabilities of the G_HCORE are exploited via remote
procedure calls (RPCs) as provided by the user level library
(HLIB). The core provides a number of very simple services that can be
dynamically added to. The simplest service is the ability to load
additional code in the form of a dynamic library (shared object) known
as a plug-in, and make this available to either a remote process or
directly to the core itself. Once the code is loaded it can be invoked
using a number of different techniques such as:

- Direct invocation: the core calls the code as a function, or a
  program uses the core as a runtime library to load the function, which
  it then calls directly itself.

- Indirect invocation: the core loads the function and then
  handles requests to the function on behalf of the calling program, or,
  it sets the function up as a separate service and advertises how to
  access the function.

An application built for HARNESS might not interact with the host OS
directly, but could instead install plug-ins that provide the required
functionality. The handling of different OS capabilities would then be
left to the plug-in developers, as is the case with FT-MPI.

One of the powerfull concepts behing HARNESS is the idea of having
several Distriubted Virtual Machines (DVM's), which can used and
managed separately. For example, one site can have one DVM and can
then merge with another DVM at a different site for a short
collaboration and then separate again. More on the DVM concept can be
found at the http://www.csm.ornl.gov/harness/

--------------------------------------------------------------------  
1.3 HARNESS API

For a detailed list of the HLIB API, please referr to the man pages
provided with this distribution.


--------------------------------------------------------------------  
2. FT-MPI

--------------------------------------------------------------------  
2.1 Introduction

FT-MPI has been developed in the frame of the HARNESS project. The
goal of FT-MPI is to provide the end-user a communication library
providing an MPI API, which benefits from the fault-tolerance in the
HARNESS system. Therefore, FT-MPI implements the whole MPI-1.2
specification, some parts of the MPI-2 document and extends some of
the semantics of MPI for giving the application the possibility to
recover from failed processes.

FT-MPI survives the crash of n-1 processes in a n-process job, and, if
required, can respawn them. However, it is still the responsebility of
the application to recover the data-structures and the data on the
crahsed processes.


--------------------------------------------------------------------  
2.2. FT-MPI semantics

FT-MPI provides four different error modes, which can be specified
while starting the application. These modes are called 'communicator
modes'.

ABORT:   like any other MPI application, FT-MPI can abort on an error.

BLANK:   failed processes are not replaced, all surviving processes
         have the same rank like before the crash and MPI_COMM_WORLD has
         the same size like previously.

SHRINK:  failed processes are not replaced, however the new communicator
         after the crash has no 'holes' in its list of processes. Thus,
         processes might have a new rank after recovery and the size 
         of MPI_COMM_WORLD has changed.

REBUILD: failed processes are respawned, surviving processes have the 
         same rank as previously. The REBUILD mode is the default,
	 and the best tested mode of FT-MPI. 

The second parameter, the 'communication modes' indicates how messages,
which are on the 'fly' why an error occurs are treated.

CONT/CONTINUE: all operations which returned the error code MPI_SUCCESS
               will finish properly, even if a process failure occurs
               during the operation (unless the communication partner
               has failed).
NOOP/RESET:    all ongoing messages are dropped. The asumption behind 
	       this mode is, that on error the application returns to
               its last constistent state, and all currently ongoing
               operations are not of any interest.

Examples on how to recover in the application from a failure
are shown in the examples directory.


Differences between the semantics of FT-MPI and MPI-1.2 or MPI-2:
- default error-handler in FT-MPI is MPI_ERRORS_RETURN (in contrary
  to MPI_ERRORS_ARE_FATAL in MPI 1.2 ).

- MPI_Init returns MPI_INIT_RESTARTED_NODE in case a process has been
  respawned (in contrary to MPI_SUCCESS in MPI 1.2).

- FT-MPI provides two additional attributes indicating how many
  processes have failed (FTMPI_NUM_FAILED_PROCS) and an error code
  (FTMPI_ERRCODE_FAILED). Using the value of the second attribute,
  the user can inquire an error-string using MPI_Error_string,
  which than contains the list of processes which have crashed.

- The only operation allowed after an error occurs is MPI_Comm_dup.
  At the moment, the 'newcomm' argument for initiating a recovery
  with MPI_Comm_dup has to be set to FT_MPI_CHECK_RECOVER. This 
  restriction will be removed from future releases. 
  (Note: an automatic mode, where MPI_COMM_WORLD and MPI_COMM_SELF
  are automatically rebuild for the user is currently beeing tested
  and will be available in future releases. Thus, the MPI_Comm_dup
  as the operation initiating the recovery will drop if required 
  by the user).


----------------------------------------------------------------
2.3 Supported platforms

HARNESS and FT-MPI have been tested on a wide range of platforms
and compilers. The settings in the paranthesis indicates the 
architecture setting used for the specifica platform. Among the
supported platforms are:

  Intel IA32 platforms using LINUX and gnu compilers (HARNESS_LINUX)
  Intel IA64 platforms using LINUX and gnu compilers (HARNESS_LINUXIA64)
  Intel IA64 platforms using LINUX intel compilers   (HARNESS_LINUXIA64)
  AMD Opteron64 platforms using LINUX gnu compilers (HARNESS_LINUXAMD64)*
  Sun platforms using SOLARIS and SUNWSpro compilers (HARNESS_SUN4SOL2)
  Sun platforms using SOLARIS and gnu compilers      (HARNESS_SUN4SOL2)
  SGI platforms using IRIX6.5 and SGI compilers      (HARNESS_SGI6 or 
                                                      HARNESS_SGI64)
  SGI platforms using IRIX6.5 and gnu compilers      (HARNESS_SGI6 or 
                                                      HARNESS_SGI64)
  ALPHA platforms using OSF1 and gnu compilers       (HARNESS_ALPHA)
  IBM platforms using AIX and IBM compilers          (HARNESS_AIX46K)

Notes. * Opteron64 client hlib controlled loading of plug-ins is
currently not supported but will be resolved in the next release.

----------------------------------------------------------------
2.4 Known bugs and problems:

- MPI_Ssend, MPI_Issend and MPI_Ssend_init are not providing
  a real synchronous send mode at the moment. This will be fixed
  in future releases.

- MPI_Cancel on Isend operations might fail in situations, where
  the MPI standard requires the cancel operation to succeed. The
  reason is, that for short messages FT-MPI is using an eager send
  mode. Therefore, the cancel operation cannot stop the send operation 
  anymore.

- FT-MPI uses as default receiver side data conversion on
  heterogeneous platforms. This leads to the problem, that 
  while Packing/Unpacking data using the according MPI functions, the 
  origin of the message is not known. To circumvent this problem, packed
  data is sent converting it into XDR format. Thus, a packed message can
  correctly be unpacked on any architecture/process.
  This leads however to problems for 'mixed mode' communications, e.g.
  the user packs the message on the sender side, but does not unpack it
  on the receiver side, instead he is receiving it directly as e.g. 
  MPI_FLOAT. This is explicitly allowed in the standard, is however 
  currently not supported by FT-MPI on heterogeneous platforms. If
  your application does have this datatype matching pattern (packing
  on sender side and receiving  it as a valid datatype, or sending 
  as a real datatype and receiving it as a packed message), 
  please switch during the configure step the XDR data converision on.

  Note: The problem described above does NOT apply for homogeneous 
        configurations.

  Note: turning the XDR data conversion on using configure has just been
        tested rudimentary. In case of any problems, please contact the
        FT-MPI team.

  - In the current release the SHRINK mode is fully supported when
  used with the CONT message mode. 

----------------------------------------------------------------
3. Further referrences

FT-MPI webpage: 
  http://icl.cs.utk.edu/ftmpi/

HARNESS webpage at UT
  http://icl.cs.utk.edu/harness/

HARNESS webpage at Oak Ridge:
  http://www.csm.ornl.gov/harness/

HARNESS wepage at Emory University:
  http://www.mathcs.emory.edu/harness/

Some papers:

Graham E. Fagg, Edgar Gabriel, Zhizhong Chen, Thara Angskun, George Bosilca, 
Antonin Bukovsky, Jack J. Dongarra, 'Fault Tolerant Communication Library and 
Applications for High Performance Computing', LACSI Symposium 2003, Santa Fe, 
October 27 - 29, 2003.

