Several systems exist to allow users to easily co-reserve time on grid resources. GUR (Grid Universal Remote)4 is one such system, developed at San Diego Supercomputer Center (SDSC). The GUR tool is a python script, which builds on the ssh and scp commands to give users the ability to make reservations of compute time and co-schedule jobs. GUR is installed on the SDSC, National Center for Supercomputing Applications (NCSA) and Argonne National Laboratory (ANL) TeraGrid IA-64 systems, and is expected to be available at other TeraGrid sites soon.
HARC (Highly Available Robust Co-scheduler) is one of the most robust and widely deployed open-source systems that allows users to reserve multiple distributed resources in a single step 5. These resources can be of different types, including multiprocessor machines and visualisation engines, dedicated network connections, storage, the use of a scientific or clinical instrument, and so on. HARC can be used to co-allocate resources for use at the same time, for example, within a scenario in which a clinical instrument is transferring data over a high-speed network link to remote computational resources for real-time processing. It can also be used to reserve resources at different times for the scheduling of workflow applications. We envisage clinical scenarios within which patient-specific simulations can be timetabled and reserved in advance, via the booking of an instrument, the reservation of network links and storage facilities, followed by high-end compute resources to process data, and finally the use of visualisation facilities to interpret the data for critical clinical decisions to be made.
Currently, HARC can be used to book computing resources and lightpaths across networks based on GMPLS (Generalised Multi-protocol Label Switching) with simple topologies. HARC is also designed to be extensible, so new types of resources can be easily added; it is this that differentiates HARC from other co-allocation solutions. There are multiple deployments of HARC in use today: the US TeraGrid, the EnLIGHTened testbed in the United States, the regional North-West Grid in England, and the National Grid Service (NGS) in the UK. We use HARC on a regular basis to make single and multiple machine reservations, within which we are able to run numerous applications including HemeLB (see Section 4.1).
SPRUCE (SPecial PRiority and Urgent Computing Environment) 6 is an urgent computing solution that has been developed to address the growing number of problem domains where critical decisions must be made quickly with the aid of large-scale computation. SPRUCE uses simple authentication mechanisms, by means of transferable ‘right of way’ tokens. These tokens allow privileged users to invoke an urgent computing session on pre-defined resources, during which time they can request an elevated priority for jobs. The computations can be run at different levels of urgency; for example, they can have a ‘next to run’ priority, such that the computation is run once the current job on the machine completes, or ‘run immediately,’ such that existing jobs on the system are removed, making way for ‘emergency’ computation in a pre-emptive fashion, the most extreme form of urgent computing. The neurovascular blood-flow simulator, HemeLB (discussed in Section 4.1) has been used with SPRUCE in a ‘next to run’ fashion on the large scale Lonestar cluster at the Texas Advanced Computing Center (TACC), and was demonstrated live on the show floor at SuperComputing 2007, where real-time visualisation and steering were used to control HemeLB within an urgent computing session.
The TeraGrid also provides a contrasting solution to the need to run urgent simulations on its resources. SDSC provide an ‘On-Demand’ computer cluster, made available to researchers via the TeraGrid, to support scientists who need to make use of urgent scientific applications. The cluster is configured to give top priority to urgent simulations, where results of the simulation are needed to plan responses to real-time events. When the system is not being used for on-demand work, it runs normal batch compute jobs, similar to the majority of other TeraGrid resources. Many of the current urgent scenarios considered cover the need to anticipate the effects of natural disasters, such as earthquakes and hurricanes, by performing simulations to predict possible consequences while the event is actually happening. Patient-specific medical simulations present another natural set of use cases for the resource.