Supercomputing On Demand: SDSC Supports Event-Driven Science
|
March 2008 |
Somewhere in Southern California a large earthquake strikes without warning, and the news media and the public clamor for information about the temblor -- Where was the epicenter? How large was the quake? What areas did it impact?
A picture is worth a thousand words – or numbers – and the San Diego Supercomputer Center (SDSC) 1 at UC San Diego is helping to provide the answers. Caltech computational seismologist Jeroen Tromp can now give the public movies that tell the story in a language that’s easy to understand, revealing waves of ground motion spreading out from the earthquake -- and he can deliver these movies in just 30 minutes with the help of a supercomputer at SDSC. But he can’t do it by submitting a job to a traditional computing batch queue and waiting hours or days for the results.
Tromp is an example of the new users in today’s uncertain world who require immediate access to supercomputing resources 2. To meet this need, SDSC has introduced OnDemand, a new supercomputing resource that will support event-driven science 3.
“This is the first time that an allocated National Science Foundation (NSF) TeraGrid supercomputing resource will support on-demand users for urgent science applications,” said Anke Kamrath, director of User Services at SDSC. “In opening this new computing paradigm we’ve had to develop novel ways of handling this type of allocation as well as scheduling and job handling procedures.”
In addition to supporting important research now, this system will serve as a model to develop on-demand capabilities on additional TeraGrid systems in the future. TeraGrid is an NSF-funded large-scale production grid linking some of the nation’s largest supercomputer centers for open scientific research including SDSC.
Urgent applications that will make use of OnDemand range from making movies of Southern California earthquakes to systems that will help give near real-time warnings based on predicting the path of a tornado or hurricane, or foretell the most likely direction of a toxic plume released by an industrial accident or terrorist incident.
When an earthquake greater than magnitude 3.5 strikes Southern California, typically once or twice a month, Tromp’s simulation code needs to use 144 processors of the OnDemand system for about 28 minutes. Shortly after the earthquake strikes a job is automatically submitted and immediately allowed to run. The code launches and any “normal” jobs running at the time are interrupted to make way for the on-demand job.
SDSC computational expert Dong Ju Choi worked extensively with Tromp to ensure that the simulation code will run efficiently in on-demand mode on the new system.
“SDSC’s new OnDemand system is an important step forward for our event-driven earthquake science,” said Tromp. “We’re getting very good performance that will let us cut the time to deliver earthquake movies from about 45 to 30 minutes or less, and every minute is important.”
The movies that result from the computations are made available as part of the ShakeMovie project in Caltech's Near Real-Time Simulation of Southern California Seismic Events Portal 4. But behind the scenes of these dramatic earthquake movies, a great deal of coordinated activity is rapidly taking place in a complex, automated workflow.
The system springs to life every time an earthquake occurs in Southern California. When an event takes place, thousands of seismograms, or ground motion measurements, are recorded at hundreds of stations across the region, and the earthquake’s epicenter, or location, as well as its depth and intensity are determined.
The waiting ShakeMovie system at Caltech collects these seismic recordings automatically over the Internet. Then, for events greater than magnitude 3.5, to fill in the gaps between the actual ground motion recorded at specific locations in the region, the scientists use the recorded data to guide a computer model that creates a “virtual earthquake,” giving an overall view of the ground motion throughout the region.
The animations rely on the SPECFEM3D_BASIN software, which simulates seismic wave propagation in sedimentary basins. The software computes the motion of the earth in 3-D based on the actual earthquake recordings and what is known about the subsurface structure of the region, which greatly affects the wave motion -- bending, speeding or slowing, and reflecting energy in complex ways.
After the full 3-D wave simulation is run on the OnDemand system at SDSC and a system at Caltech for redundancy, data that captures the surface motion (displacement, velocity, and acceleration) are collected and mapped onto the topography of Southern California, and rendered into movies. The movies are then automatically published via the portal, and an email is sent to subscribers, including the news media and the public.
OnDemand is a Dell cluster with 64 Intel dual-socket, dual-core compute nodes for a total of 256 processors. The 2.33 GHz, 4-way nodes have 8 GB of memory. The system, which has a nominal theoretical peak performance of 2.4 Tflops, is running the SDSC-developed Rocks open-source Linux cluster operation software and the IBRIX parallel file system. Jobs are scheduled by the Sun Grid Engine.
OnDemand also makes use of the SPRUCE system developed by a team at Argonne National Laboratory. SPRUCE provides production-level functionality, including access controls, reporting, and fine-grained control for urgent computing jobs. An organization can issue tokens to its user groups who have been approved for urgent computing runs. Different colors (classes) of SPRUCE tokens represent varying urgency levels. A yellow token will put the requested job in the normal queue in the Sun Grid Engine scheduler; an orange token goes to the high priority queue; and a job submitted with a red token will preempt running jobs if necessary.
The researchers are working to develop additional capabilities. Currently, jobs with the least amount of accumulated CPU are the first to be preempted. In the future, preempted backfill jobs may be held and restarted when appropriate, without being killed, and investigation of checkpoint and restart systems is ongoing.
Backfill jobs consist of a variety of regular user jobs, primarily parallel scientific computing and visualization applications using MPI. Users who run on the OnDemand cluster are made aware of the cluster’s mission to prioritize jobs that require immediate turnaround.
One of the most interesting and successful applications using OnDemand is a commercial application called Star-P 5, which extends easy access to supercomputing to a much wider range of researchers. Users can code models and algorithms on their desktop computers using familiar applications like MATLAB, Python and R, and then run them interactively on SDSC's OnDemand cluster through the Star-P platform. This eliminates the need to re-program applications to run on parallel systems, so that programming that took months can now be done in days, and simulations that took days on the desktop can now be done in minutes. Lowering the barrier to supercomputing resources will let researchers jumpstart research that otherwise wouldn't get done.
Star-P supports researchers by allowing them to transparently use HPC clusters through a client (running on their user desktop environment) and server framework (running in an HPC cluster environment). For example, existing MATLAB users on a PC desktop can now achieve parallel scalability from the same MATLAB desktop interface with a simple set of STAR-P commands. This has enabled many users to achieve the tremendous speed-ups that advanced research groups see by laboriously reprogramming applications using MPI.
Researchers on SDSC’s OnDemand are using STAR-P in a variety of application areas, including science, engineering, medical and financial disciplines. Several research groups have seen true performance breakthroughs through STAR-P, which fundamentally changes the type of problems they are able to explore. A close collaboration with SDSC also won the Interactive Supercomputing HPC Challenge at SC 07.
SDSC and its academic and industrial partners, including Argonne National Laboratory and Interactive Supercomputing, are aggressively continuing to improve the cluster environment to enhance this urgent computing service. The accumulating experience at SDSC using OnDemand is playing a critical role as a testbed as the team works to further develop the urgent computing paradigm and robust infrastructure.