CTWatch
November 2007
Software Enabling Technologies for Petascale Science
Arie Shoshani, Lawrence Berkeley National Laboratory
Ilkay Altintas, San Diego Supercomputer Center
Alok Choudhary, Northwestern University
Terence Critchlow, Pacific Northwest National Laboratory
Chandrika Kamath, Lawrence Livermore National Laboratory
Bertram Ludäscher, University of California, Davis
Jarek Nieplocha, Pacific Northwest National Laboratory
Steve Parker, University of Utah
Rob Ross, Argonne National Laboratory
Nagiza Samatova, Oak Ridge National Laboratory
Mladen Vouk, North Carolina State University

1
Introduction

Terascale computing and large scientific experiments produce enormous quantities of data that require effective and efficient management. The task of managing scientific data is so overwhelming that scientists spend much of their time managing the data by developing special purpose solutions, rather than using their time effectively for scientific investigation and discovery. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages, from the initial data acquisition to the final analysis of the data. Fortunately, the data management problems encountered by most scientific domains are common enough to be addressed through shared technology solutions. Based on community input, we have identified three significant requirements. First, more efficient access to storage systems is needed. In particular, parallel file system improvements are needed to read and write large volumes of data without slowing a simulation, analysis, or visualization engine. These processes are complicated by the fact that scientific data are structured differently for specific application domains, and are stored in specialized file formats. Second, scientists require technologies to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis and searches over large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. To facilitate efficient access, it is necessary to keep track of the location of the datasets, effectively manage storage resources, and efficiently select subsets of the data. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process. Tools for automation of this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration.

The Scientific Data Management (SDM) Center,1 funded under the DOE SciDAC program, focuses on the application of known and emerging data management technologies to scientific applications. The Center’s goals are to integrate and deploy software-based solutions to the efficient and effective management of large volumes of data generated by scientific applications. Our purpose is not only to achieve efficient storage and access to the data using specialized indexing, compression, and parallel storage and access technology, but also to enhance the effective use of the scientist’s time by eliminating unproductive simulations, by providing specialized data-mining techniques, by streamlining time-consuming tasks, and by automating the scientist’s workflows. Our approach is to provide an integrated scientific data management framework where components can be chosen by the scientists and applied to their specific domains. By overcoming the data management bottlenecks and unnecessary information-technology overhead through the use of this integrated framework, scientists are freed to concentrate on their science and achieve new scientific insights.

Pages: 1 2 3 4 5 6

Reference this article
Shoshani, A., Altintas, I., Choudhary, A., Critchlow, T., Kamath, C., Ludäscher, B., Nieplocha, J., Parker, S., Ross, R., Samatova, N., Vouk, M. "Scientific Data Management: Essential Technology for Accelerating Scientific Discoveries," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/scientific-data-management-essential-technology-for-accelerating-scientific-discoveries/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.