CTWatch
March 2008
Urgent Computing: Exploring Supercomputing's New Role
Gabrielle Allen, Center for Computation & Technology and Department of Computer Science, Louisiana State University
Philip Bogden, Department of Physics, Louisiana State University
Tevfik Kosar
Archit Kulshrestha
Gayathri Namala
Sirish Tummala, Center for Computation & Technology and Department of Computer Science, Louisiana State University
Edward Seidel, Center for Computation & Technology and Department of Physics, Louisiana State University

6
3.3 Immediate Access to Resources

Using preemption or other mechanisms to enable urgent simulations on supercomputers is not new. However, the traditional procedure for implementing preemption is to run such jobs in a special queue for which access is only granted for a fixed set of users. The policy, queue configuration, and set of users on each machine, and particularly at each site, would need to be carefully negotiated (and usually frequently renegotiated). These procedures are usually not documented, thus it is difficult and time consuming to add new users for urgent computing, or to change the configuration of machines, for example to accommodate larger simulations. To resolve some of these issues, Special PRiority and Urgent Computing Environment (SPRUCE) 7 was implemented in the workflow. SPRUCE is a specialized software system to support urgent or event-driven computing on both traditional supercomputers and distributed Grids. It is being developed by the University of Chicago and Argonne National Laboratory and is presently functioning as a TeraGrid science gateway.

SPRUCE uses token based authentication system for resource allocation. Users are provided with right of way tokens, which are unique 16 character strings that can be activated through a web portal. The token is created on the CN value of the administrator. When a token is activated, there are other parameters that are set including:

  • Resources for urgent jobs: the activated token can be used to access any resource that is specified in this list and can be used by any person registered in it.
  • Lifetime of the token: Each token is given a specific time period. Once active, the token can be used during this time period.
  • Maximum urgency that can be requested, specified by the colors red, orange and yellow
  • People to be notified when the token is used (e.g., the local administrators)
    SPRUCE is a grid middleware that integrates with the resource manager on the system. When SPRUCE is installed, the resource manager is equipped with an authentication filter that checks for a valid token on the corresponding user name or the Distinguished Name (DN). If a token is activated, the job is submitted to a queue of higher priority level.
4. Results

The SCOOP on-demand system was demonstrated at the SuperComputing 2007 conference in Reno, Nevada using the resources of the SURAgrid and Louisiana Optical Network Initiative (LONI). The demo illustrated how a hurricane event triggered the use of on-demand resources, and how the priority-aware scheduler was able to schedule the runs on the appropriate queues in the appropriate order. The guarantee that a member runs as soon as data for it has been generated makes it possible to provide a guarantee that the set of runs chosen as high priority runs will complete before the six hour deadline. Other work in benchmarking the models on different architecture platforms was used to estimate the amount of CPU time that a model would need to complete given the number of on-demand processors available.

SPRUCE was used to acquire the on-demand processors on some resources, and highlighted several different advantages. For example, the SCOOP workflow was no longer tied to being run as certain special users. This also meant that there was no need for negotiating access to the on-demand queues with the resource owners. Also using SPRUCE provided the resource owners the ability to restrict the usage of the system in on-demand mode at the same time providing on-demand resources to any one who needs them. In the past this could only be done by adding and deleting user access on a case-by-case basis. SPRUCE tokens can now be handed out to users by an allocation committee, thus removing the burden of evaluating the need for on-demand resources by users from the system administrators.

Figure 7a

Figure 7b

Figures 7(a) and 7(b). Ensemble member execution and wait times for (a) best-effort and (b) on-demand execution.

Figures 7(a) and 7(b) show the execution and wait times for the various stages of execution of the SCOOP workflow. Figure 7(a) shows the execution with only best-effort resources. The pink bars depict the execution and queue wait times of the core Wave Watch III execution on eight processors. It can be seen that the queue wait times account for most of the total time. Figure 7(b) depicts the ensemble execution using on-demand resources. In this case, 16 processors were available for on-demand use, hence two ensemble members ran simultaneously while others waited for these to finish.

A closer look at the 7(b) graph indicates that ensemble members p38 and p02 executed first followed by p14 and e10. The lengths of the pink bars for p14 and e10 are double that of p38 and p02 showing that they began execution right after the first two members finished execution. Comparing the two graphs, the last run finished in about 700 seconds when using on-demand resources compared to a time of about 2100 seconds without on-demand resources. It must be noted that the tests were performed using a short three hour forecast run that completes in about 90 seconds on the chosen platform.

Pages: 1 2 3 4 5 6 7

Reference this article
Allen, G., Bogden, P., Kosar, T., Kulshrestha, A., Namala, G., Tummala, S., Seidel, E. "Cyberinfrastructure for Coastal Hazard Prediction," CTWatch Quarterly, Volume 4, Number 1, March 2008. http://www.ctwatch.org/quarterly/articles/2008/03/cyberinfrastructure-for-coastal-hazard-prediction/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.