CTWatch Quarterly » Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries

Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries

Arie Shoshani, Lawrence Berkeley National Laboratory
Ilkay Altintas, San Diego Supercomputer Center
Alok Choudhary, Northwestern University
Terence Critchlow, Pacific Northwest National Laboratory
Chandrika Kamath, Lawrence Livermore National Laboratory
Bertram Ludäscher, University of California, Davis
Jarek Nieplocha, Pacific Northwest National Laboratory
Steve Parker, University of Utah
Rob Ross, Argonne National Laboratory
Nagiza Samatova, Oak Ridge National Laboratory
Mladen Vouk, North Carolina State University

Descriptions of technologies developed and used in the SDM Center

In this section we describe the SDM Center technologies, and include some examples of their application in various scientific projects. We proceed with technologies from the top layer to the bottom layer.

The Kepler Scientific Workflow System

A practical bottleneck for more effective use of available computational and data resources is often the design of resource access and use of processes, and the corresponding execution environments, i.e., in the scientific workflow environment of end user scientists. The goal of the Kepler system² is to provide solutions and products for effective and efficient modeling, design and execution of scientific workflows. Kepler is a multi-site open source effort, co-founded by the SDM center, to extend the Ptolemy system (from UC Berkeley) and create an integrated scientific workflow infrastructure. We have also started to incorporate data, process, system and workflow provenance and run-time tracking and monitoring. We have worked closely with application scientists to design, implement, and deploy workflows that address their real-world needs. In particular, we have active users on the SciDAC Terascale Supernova Initiative (TSI) team and an LLNL Biotechnology project, as well as at the Center for Plasma Edge Simulation (CPES) fusion project. While the Scientific Process Automation (SPA) layer uses Kepler to achieve workflow automation, it is the specific task components (called “actors” in Kepler) developed by the SDM center that makes our work unique in it usefulness to scientific applications.

Figure 2. An abstract representation of a scientific workflow.

Underlying challenges related to simulations, data analysis and data manipulation include scalable parallel numerical algorithms for the solution of large, often sparse linear systems, flow equations, and large Eigen-value problems, running of simulations on supercomputers, movement of large amounts of data over large distances, collaborative visualization and computational steering, and collection of appropriate process and simulation related status and provenance information. This requires interdisciplinary teams of application scientists and computer scientists working together to define the workflows and putting them into the Kepler workflow framework. The general underlying “templates” are often similar across disciplines: large-scale parallel computations and steering (hundreds of processors, gigabytes of memory, hours to weeks of CPU time), data-movement and reduction (terabytes of data), visualization and analytics (interactive, retrospective, and auditable). An abstraction of this and its Kepler translation are illustrated in Figure 2 and 3 for a particular astrophysics project, call the Terascale Supernova Initiative (TSI).³ Figure 3 shows the capability of the Kepler system to represent hierarchically structured workflows. In the center of the figure there are four simple high-level tasks; each is expanded into lower level tasks that manage the detailed processes.

Figure 3. Instantiation of the abstract workflow in Kepler.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By