CTWatch Quarterly » Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries

Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries

Arie Shoshani, Lawrence Berkeley National Laboratory
Ilkay Altintas, San Diego Supercomputer Center
Alok Choudhary, Northwestern University
Terence Critchlow, Pacific Northwest National Laboratory
Chandrika Kamath, Lawrence Livermore National Laboratory
Bertram Ludäscher, University of California, Davis
Jarek Nieplocha, Pacific Northwest National Laboratory
Steve Parker, University of Utah
Rob Ross, Argonne National Laboratory
Nagiza Samatova, Oak Ridge National Laboratory
Mladen Vouk, North Carolina State University

The Three-Layer Organization of the SDM Center

As part of our evolutionary technology development and deployment process (from research through prototypes to deployment and infrastructure) we have organized our activities in three layers that abstract the end-to-end data flow described above. We labeled the layers as Storage Efficient Access (SEA), Data Mining and Analytics (DMA), and Scientific Process Automation (SPA). The SEA layer is immediately on top of hardware and operating systems, providing parallel data access to files and transparent access to archival storage. The DMA layer, which builds on the functionality of the SEA layer, consists of indexing, feature selection, and parallel statistical analysis technology. The SPA layer, which is on top of the DMA layer, provides the ability to compose workflows from the components in the DMA layer as well as application specific modules. Figure 1 shows this organization and the components developed by the center and applied to various scientific applications.

Figure 1. The three-layer organization of technologies in the SDM Center.

Over the last several years, the technologies supported by the SDM center have been deployed for a variety of application domains. Some of the most notable achievements are:

More than a tenfold speedup in writing and reading netCDF files has been achieved by developing MPI-IO based Parallel netCDF software being utilized by astrophysics, climate, and Parallel VTK.
An improved version of PVFS is freely available to the community and offered through cluster vendors. In addition to operating on clusters, it is routinely used on the IBM BlueGene/L and soon on the BlueGene/P.
Methods for the correct classification of orbits in puncture plots and for “blob tracking” from the National Compact Stellarator eXperiment (NCSX) at PPPL were using a combination of image processing, statistics, and pattern recognition techniques.
A new bitmap indexing method has enabled an efficient search over billions of collisions (events) in High Energy Physics, and is being applied to combustion, astrophysics, and visualization domains. It achieves more than a tenfold speedup in generating regions and tracking them over time.
The development of a Parallel R, an open source parallel version of the popular statistical package R. This is being applied to climate, GIS, and mass spec proteomics applications.
A scientific workflow management and execution system (called Kepler) has been developed and deployed within multiple scientific domains, including genomics and astrophysics. The system supports the design and the execution of flexible and reusable, component-oriented workflows.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By