Today’s large-scale computational runs often result in large-scale data output. It is not uncommon for a simulation to generate a million files and tens of terabytes of data with over 30 individuals collaborating on the application runs. This level of data output requires dedicated handling to move the data from the originating disk cache into a digital library for future access, with replication on an archival storage system.
SDSC’s digital data collections are representative of the state of the art. Digital collections developed for specific scientific disciplines typically have unique usage models but can share the same evolving data management infrastructure, with the difference between usage and storage models mainly tied to differences in management policies for sustainability and governance. Table 1 lists three categories of digital holdings at SDSC, loosely characterized as data grids (primarily created to support data sharing), digital libraries (created to formally publish the digital holdings), and persistent archives (focused on the management of technology evolution).
Data management requirements can be derived from Table 1. Today, it is not uncommon for a collection to contain 10 to 100 hundred terabytes of data, with two to 10 million files. In fact, collections are now assembled that have too many files to house in a single file system – containers are used to aggregate files into a larger package before storage, or files are distributed across multiple file systems. The number of individuals that collaborate on developing a shared collection can range from tens to hundreds. In Table 1, the column on the right labeled ACLs (Users with Access Controls) shows how many individuals (including staff) are typically involved in writing files, adding metadata, or changing the digital holdings in the collection. The number of individuals who access the collection can be much larger, as most of the collections are publicly accessible.
Date | 5/17/02 | 6/30/04 | 1/3/06 | |||||
Project | GBs of data stored | 1000’s of files | GBs of data stored | 1000’s of files | Users with ACLs | GBs of data stored | 1000’s of files | Users with ACLs |
Data Grid | ||||||||
NSF / NVO | 17,800 | 5,139 | 51,380 | 8,690 | 80 | 93,252 | 11,189 | 100 |
NSF / NPACI | 1,972 | 1,083 | 17,578 | 4,694 | 380 | 34,452 | 7,235 | 380 |
Hayden | 6,800 | 41 | 7,201 | 113 | 178 | 8,013 | 161 | 227 |
Pzone | 438 | 31 | 812 | 47 | 49 | 19,674 | 10,627 | 68 |
NSF / LDAS-SALK | 239 | 1 | 4,562 | 16 | 66 | 104,494 | 131 | 67 |
NSF / SLAC-JCSG | 514 | 77 | 4,317 | 563 | 47 | 15,703 | 1,666 | 55 |
NSF / TeraGrid | 80,354 | 685 | 2,962 | 195,012 | 4,071 | 3,267 | ||
NIH / BIRN | 5,416 | 3,366 | 148 | 13,597 | 13,329 | 351 | ||
Digital Library | ||||||||
NSF / LTER | 158 | 3 | 233 | 6 | 35 | 236 | 34 | 36 |
NSF / Portal | 33 | 5 | 1,745 | 48 | 384 | 2,620 | 53 | 460 |
NIH / AfCS | 27 | 4 | 462 | 49 | 21 | 733 | 94 | 21 |
NSF / SIO Explorer | 19 | 1 | 1,734 | 601 | 27 | 2,452 | 1,068 | 27 |
NSF / SCEC | 15,246 | 1,737 | 52 | 153,159 | 3,229 | 73 | ||
Persistent Archive | ||||||||
NARA | 7 | 2 | 63 | 81 | 58 | 2,703 | 1,906 | 58 |
NSF / NSDL | 2,785 | 20,054 | 119 | 5,205 | 50,586 | 136 | ||
UCSD Libraries | 127 | 202 | 29 | 190 | 208 | 29 | ||
NHPRC / PAT | 101 | 474 | 28 | |||||
TOTAL | 28 TB | 6 mil | 194 TB | 40 mil | 4,635 | 655 TB | 106 mil | 5,383 |
For many digital holdings, the collection may be replicated among different storage systems and/or sites. The replication serves multiple purposes:
- To meet governance and sustainability policies, with a copy at the institution that has assumed long-term management of the collection
- To mitigate the risk of data loss. At least five different loss mechanisms are mitigated through replication; media corruption (e.g., disk crash or tape parity error), systemic vendor product error (such as bad microcode in a tape drive), operational error, malicious user attack, and natural disaster (e.g., fire, flood, hurricane, etc.).
- To improve access via disk caches. Wide-area-networks are characterized by access latencies (typically tens to hundreds of milliseconds) that are substantially higher than that of a spinning disk. Replicating data onto a local disk cache ensures interactive access for local users. Replicating data onto a remote disk cache ensures interactive access for the remote users.
- To provide high availability. Having multiple independent copies ensures that when any single system component is taken offline for maintenance, or is down because of failure, the digital holdings can still be accessed.
For many collections, data sources are inherently distributed. The National Virtual Observatory collection provides an example of this. Thus, a data management environment must provide the capabilities needed to manage data distributed over a wide-area-network. This requirement can be characterized as latency management and is typically achieved by minimizing the number of messages that are sent over wide-area-networks. Common mechanisms for latency management include
- replication,
- bulk operations for manipulating small files and loading metadata, and
- remote procedures to parse or filter data directly at the remote storage system.
Many data collections at SDSC are managed on top of federated data grids. Having multiple independent data grids, each with a copy of the data and metadata (both descriptive attributes and state information generated by operations on the data), ensures that no single disaster can destroy the aggregated digital holdings. Federation allows the management of shared name spaces between the independent data grids, enabling the cross registration of files, metadata, user names, and storage resources. The types of federation environments range from peer-to-peer data grids, with only public information shared between data grids, to central archives that hold a copy of records from otherwise independent data grids, to worker data grids that receive their data from a master data grid.