CTWatch Quarterly » Scientific Data Management in the Coming Decade

Scientific Data Management in the Coming Decade

Jim Gray, Microsoft
David T. Liu, University of California at Berkeley
Maria Nieto-Santisteban, Johns Hopkins University
Alex Szalay, Johns Hopkins University
David DeWitt, University of Wisconsin
Gerd Heber, Cornell University

Some hints of success

There are early signs that this is a good approach. One of us has shown that the doing analysis atop a database system is vastly simpler and runs much faster than the corresponding file-oriented approach.⁸ The speedup is due to better indexing and parallelism.

We have also had considerable success in adding user defined functions and stored procedures to astronomy databases. The MyDB and CasJobs work for the Sloan Digital Sky Survey give a good example of moving-programs-to-the-database.⁹

The BaBar experiments at SLAC manage a petabyte store of event data. The system uses a combination of Oracle to manage some of the file archive and also a physics-specific data analysis system called Root for data analysis.¹⁰

Adaptive Finite Element simulations spend considerable time and programming effort on input, output, and checkpointing. We (Heber) use a database to represent large Finite Element models. The initial model is represented in the database and each checkpoint and analysis step is written to the database. Using a database allows queries to define more sophisticated mesh partitions and allows concurrent indexed access to the simulation data for visualization and computational steering. Commercial Finite Element packages each use a proprietary form of a "database". They are, however, limited in scope, functionality, and scalability, and are typically buried inside the particular application stack. Each worker in the MPI job gets its partition from the database (as a query) and dumps its progress to the database. These dumps are two to four orders of magnitude larger than the input mesh and represent a performance challenge in both traditional and database environments. The database approach has the added benefit that visualization tools can watch and steer the computation by reading and writing the database. Finally, while we have focused on the ability of databases to simplify and speedup the production of raw simulation data, we cannot understate its core competency: providing declarative data analysis interfaces. It is with these tools that scientists spend most of their time. We hope to apply similar concepts to some turbulence studies being done at Johns Hopkins.

Summary

Science centers that curate and serve science data are emerging around next-generation science instruments. The world-wide telescope, GenBank, and the BaBar collaborations are prototypes of this trend. One group of scientists is collecting the data and managing these archives. A larger group of scientists are exploring these archives the way previous generations explored their private data. Often the results of the analysis are fed back to the archive to add to the corpus.

Because data collection is now separated from data analysis, extensive metadata describing the data in standard terms is needed so people and programs can understand the data. Good metadata becomes central for data sharing among different disciplines and for data analysis and visualization tools.

There is a convergence of the nascent-databases (HDF, NetCDF, FITS,..) which focus primarily on the metadata issues and data interchange, and the traditional data management systems (SQL and others) that have focused on managing and analyzing very large datasets. The traditional systems have the virtues of automatic parallelism, indexing, and non-procedural access, but they need to embrace the data types of the science community and need to co-exist with data in file systems. We believe the emphasis on extending database systems by unifying databases with programming languages, so that one can either embed or link new object types into the data management system, will enable this synthesis.

Three technical advances will be crucial to scientific analysis: (1) extensive metadata and metadata standards that will make it easy to discover what data exits, make it easy for people and programs to understand the data, and make it easy to track data lineage; (2) great analysis tools that allow scientists to easily ask questions, and to easily understand and visualize the answers; and (3) set-oriented data parallelism access supported by new indexing schemes and new algorithms that allow us to interactively explore peta-scale datasets.

The goal is a smart notebook that empowers scientists to explore the world’s data. Science data centers with computational resources to explore huge data archives will be central to enabling such notebooks. Because data is so large, and IO bandwidth is not keeping pace, moving code to data will be essential to performance. Consequently, science centers will remain the core vehicle and federations will likely be secondary. Science centers will provide both the archives and the institutional infrastructure to develop these peta-scale archives and the algorithms and tools to analyze them.

References

¹ Committee on Data Management, Archiving, and Computing (CODMAC) Data Level Definitions science.hq.nasa.gov/research/earth_science_formats.html
² hdf.ncsa.uiuc.edu/HDF5/
³ my.unidata.ucar.edu/content/software/netcdf/
⁴ fits.gsfc.nasa.gov/
⁵ vizier.u-strasbg.fr/doc/UCD.htx
⁶Dean, J., Ghemewat, S. "MapReduce: Simplified Data Processing on Large Clusters," ACM OSDI, Dec. 2004.
⁷ DeWitt, D., Gray, J. "Parallel Database Systems: the Future of High Performance Database Systems," CACM, Vol. 35, No. 6, June 1992.
⁸ M. Nieto-Santisteban, et al. "When Database Systems Meet the Grid," CIDR, 2005, www-db.cs.wisc.edu/cidr/papers/P13.pdf
⁹ O'Mullane, W., et al. in preparation "Batch is back: CasJobs serving multi-TB data on the Web,"
¹⁰ Becla, J., Wang, D. "Lessons Learned from Managing a Petabyte," CIDR, 2005, www-db.cs.wisc.edu/cidr/papers/P06.pdf

Pages: 1 2 3 4 5 6 7 8

CTWatch is a collaborative effort				Sponsored By