CTWatch Quarterly » Designing and Supporting Data Management and Preservation Infrastructure

1. Introduction

The 20th century brought about an “information revolution” that has forever altered the way we work, communicate, and live. In the 21st century, data is ubiquitous. Available in digital format via the web, desktop, personal device, and other venues, data collections both directly and indirectly enable a tremendous number of advances in modern science and engineering.

Today’s data collections span the spectrum in discipline, usage characteristics, size, and purpose. The life science community utilizes the continually expanding Protein Data Bank¹ as a worldwide resource for studying the structures of biological macromolecules and their relationships to sequence, function, and disease. The Panel Study of Income Dynamics (PSID),² a longitudinal study initiated in 1968, provides social scientists detailed information about more than 65,000 individuals spanning as many as 36 years of their lives. The National Virtual Observatory³ is providing an unprecedented resource for aggregating and integrating data from a wide variety of astronomical catalogs, observation logs, image archives, and other resources for astronomers and the general public. Such collections have broad impact, are used by tens of thousands of individuals on a regular basis, and constitute critical and valuable community resources.

However, the collection, management, distribution, and preservation of such digital resources does not come without cost. Curation of digital data requires real support in the form of hardware infrastructure, software infrastructure, expertise, human infrastructure, and funding. In this article, we look beyond digital data to its supporting infrastructure, and provide a holistic view of the software, hardware, human infrastructure, and costs required to support modern data-oriented applications in research, education, and practice.

2. Digital Data Curation

Digital data curation focuses on the generation of descriptive metadata and validation of the quality of the data. Digital data preservation focuses on the characterization of the data authenticity (provenance information), and the management of data integrity across multiple generations of storage technology. An example of a curated community digital collection is the Protein Data Bank (PDB). The PDB is a global resource for structural information about proteins that is maintained by the Worldwide Protein Data Bank (wwPDB). This organization is composed of the Research Collaboratory for Structural Bioinformatics (RCSB), a consortium consisting of groups at UCSD/SDSC and Rutgers; the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI) in Hinxton, UK; and PDBj in Osaka Japan.

When a user accesses a data portal for the PDB for information on HIV-1 protease, a target in the fight against AIDS, considerable infrastructure is provided to support this. Behind the scenes, the following components are involved in providing information on HIV-1 protease to the user:

Data are collected, annotated and validated at one of three wwPDB deposition sites; in the US this site is located at the Rutgers site of RCSB. The wwPDB has adopted the PDB Exchange Dictionary as a means of standardizing semantics to ensure uniform data and provides the foundation for data exchange within the wwPDB and delivery of a standard data representation to the public.
The acquisition, annotation and validation of PDB data requires about 20 highly trained personnel as well as significant computational, storage and network resources. An individual PDB dataset may consist of more than 1000 individual data items with some containing as many as 1M instances. Data acquisition requires reliable, low-latency network connections and high-performance servers to deliver real-time data validation. The validation and computation of derived features for these datasets is computationally intensive and is performed on clusters of Linux and Solaris servers at each of the wwPDB sites. PDB data that are ready for public release are transferred to the RCSB PDB site at SDSC, the main US PDB distribution site.
The RCSB PDB data portal accessed by the user is served from a Linux Cluster at SDSC using software written and maintained by the RCSB PDB team in collaboration with SDSC researchers. The portal comprises a Linux commodity cluster controlled by dual Cisco load balancers that handle traffic from 10,000 scientists a day making 2M page requests. Each cluster node has a redundant copy of the PDB (approx 1TB) and state is maintained using JBOSS and the load balancers. A failover is provided using UltraDNS (third party) that fails over to a small cluster at Rutgers University in New Jersey on the rare occasions that power or networking is lost to all nodes at SDSC. In this way 99.99% uptime is maintained. The RCSB PDB previously maintained mirrors around the world for fast access, but current high-speed networking to SDSC is good enough that these are no longer needed for this purpose.
A group of nine staff and students in the PDB group at UCSD develop access and mining tools for the PDB community. In addition to the PDB site at UCSD the wwPDB sites in Europe and Japan provide complementary views of the common set of archival data.
The PDB currently requires 20TB of storage associated with the collection at SDSC alone.
Additional costs for maintaining the curation involve business office and HR costs, the costs of administering the project by its leadership and management, the costs of advisory and evaluation input (travel, time, etc.), and other related costs.

The RCSB PDB database infrastructure at UCSD is coded in Java and completely built from public domain software. A MySQL database is used at SDSC to instantiate this schema, with middle and presentation layers built around Hibernate. Focus on software development includes systems to organize the material and provide services to support discovery, browsing, and presentation. The RCSB PDB infrastructure was developed to allow the collection to expand continuously and to allow ingestion and evaluation for new entries.

In short, to accommodate the request for information about HIV-1 protease from the PDB, substantial software, hardware, human, and funding support is required. At a recent AAAS Panel on data,⁴ RCSB PDB Director Helen Berman estimated that in 2005 more than one billion dollars of research funding were spent to generate the data that were collected, curated and distributed by the PDB.

3. Preserving Data over Time

Some digital collections will continue to be valuable resources for the foreseeable future. These typically include irreplaceable collections (e.g., the Shoah Collection of Holocaust survivor testimony),⁵ valuable community reference collections (e.g., PDB, NVO, PSID), and historically valuable collections such as federal digital records.6 ⁷ For these digital collections, lifetime is measured in decades, with continuous active preservation, and often new material is added over time. Over a collection’s decades of existence, the media on which it is stored will go through tens of generations, standard encoding formats will evolve, preservation staff and institutions may change, etc. In short, everything involved with the collection may evolve, and evolution must be planned and executed in a way that maintains the integrity of the data collection and minimizes disruption to access from its user community.

Because the time periods over which long-term digital collections are preserved are measured in decades, the need for preservation environments is critical. At SDSC, some of the current data collections have been migrated over the last 20 years onto six generations of storage technology. Over that period, the trend in tape media costs per byte has been exponential, dropping by half approximately every three years. If this exponential trend continues, the total life-time cost of media is only twice the original media cost, being

(1 + 1/2 + 1/4 + …) * (original cost).

Of course, tape media are only a modest portion of the true cost of long-term storage and the labor for administering the storage system, in particular managing the transitions between generations of storage technology, must be incorporated into cost models (see below). Generally, the number of individuals managing the collections can stay constant, after the initial period of implementation, even though both the size of the data files and the size of the storage media are growing. This means that costs related to storage management labor are increasing slower than costs related to collection building and maintenance.

4. Data Management and Preservation for the Science and Engineering Community

Today’s large-scale computational runs often result in large-scale data output. It is not uncommon for a simulation to generate a million files and tens of terabytes of data with over 30 individuals collaborating on the application runs. This level of data output requires dedicated handling to move the data from the originating disk cache into a digital library for future access, with replication on an archival storage system.

SDSC’s digital data collections are representative of the state of the art. Digital collections developed for specific scientific disciplines typically have unique usage models but can share the same evolving data management infrastructure, with the difference between usage and storage models mainly tied to differences in management policies for sustainability and governance. Table 1 lists three categories of digital holdings at SDSC, loosely characterized as data grids (primarily created to support data sharing), digital libraries (created to formally publish the digital holdings), and persistent archives (focused on the management of technology evolution).

Data management requirements can be derived from Table 1. Today, it is not uncommon for a collection to contain 10 to 100 hundred terabytes of data, with two to 10 million files. In fact, collections are now assembled that have too many files to house in a single file system – containers are used to aggregate files into a larger package before storage, or files are distributed across multiple file systems. The number of individuals that collaborate on developing a shared collection can range from tens to hundreds. In Table 1, the column on the right labeled ACLs (Users with Access Controls) shows how many individuals (including staff) are typically involved in writing files, adding metadata, or changing the digital holdings in the collection. The number of individuals who access the collection can be much larger, as most of the collections are publicly accessible.

Date	5/17/02		6/30/04			1/3/06
Project	GBs of data stored	1000’s of files	GBs of data stored	1000’s of files	Users with ACLs	GBs of data stored	1000’s of files	Users with ACLs
Data Grid
NSF / NVO	17,800	5,139	51,380	8,690	80	93,252	11,189	100
NSF / NPACI	1,972	1,083	17,578	4,694	380	34,452	7,235	380
Hayden	6,800	41	7,201	113	178	8,013	161	227
Pzone	438	31	812	47	49	19,674	10,627	68
NSF / LDAS-SALK	239	1	4,562	16	66	104,494	131	67
NSF / SLAC-JCSG	514	77	4,317	563	47	15,703	1,666	55
NSF / TeraGrid			80,354	685	2,962	195,012	4,071	3,267
NIH / BIRN			5,416	3,366	148	13,597	13,329	351
Digital Library
NSF / LTER	158	3	233	6	35	236	34	36
NSF / Portal	33	5	1,745	48	384	2,620	53	460
NIH / AfCS	27	4	462	49	21	733	94	21
NSF / SIO Explorer	19	1	1,734	601	27	2,452	1,068	27
NSF / SCEC			15,246	1,737	52	153,159	3,229	73
Persistent Archive
NARA	7	2	63	81	58	2,703	1,906	58
NSF / NSDL			2,785	20,054	119	5,205	50,586	136
UCSD Libraries			127	202	29	190	208	29
NHPRC / PAT						101	474	28
TOTAL	28 TB	6 mil	194 TB	40 mil	4,635	655 TB	106 mil	5,383

Table 1. Evolution of digital holdings at SDSC

For many digital holdings, the collection may be replicated among different storage systems and/or sites. The replication serves multiple purposes:

To meet governance and sustainability policies, with a copy at the institution that has assumed long-term management of the collection
To mitigate the risk of data loss. At least five different loss mechanisms are mitigated through replication; media corruption (e.g., disk crash or tape parity error), systemic vendor product error (such as bad microcode in a tape drive), operational error, malicious user attack, and natural disaster (e.g., fire, flood, hurricane, etc.).
To improve access via disk caches. Wide-area-networks are characterized by access latencies (typically tens to hundreds of milliseconds) that are substantially higher than that of a spinning disk. Replicating data onto a local disk cache ensures interactive access for local users. Replicating data onto a remote disk cache ensures interactive access for the remote users.
To provide high availability. Having multiple independent copies ensures that when any single system component is taken offline for maintenance, or is down because of failure, the digital holdings can still be accessed.

For many collections, data sources are inherently distributed. The National Virtual Observatory collection provides an example of this. Thus, a data management environment must provide the capabilities needed to manage data distributed over a wide-area-network. This requirement can be characterized as latency management and is typically achieved by minimizing the number of messages that are sent over wide-area-networks. Common mechanisms for latency management include

replication,
bulk operations for manipulating small files and loading metadata, and
remote procedures to parse or filter data directly at the remote storage system.

Many data collections at SDSC are managed on top of federated data grids. Having multiple independent data grids, each with a copy of the data and metadata (both descriptive attributes and state information generated by operations on the data), ensures that no single disaster can destroy the aggregated digital holdings. Federation allows the management of shared name spaces between the independent data grids, enabling the cross registration of files, metadata, user names, and storage resources. The types of federation environments range from peer-to-peer data grids, with only public information shared between data grids, to central archives that hold a copy of records from otherwise independent data grids, to worker data grids that receive their data from a master data grid.

5. Data Management System Components and Costs

The data management systems supporting digital collections support a variety of functions including sharing of data, publishing of data, preserving of data, and analyzing of data.

Data management system components may include the following:

Authenticity system for validating the identity of users
Authorization software system for controlling updates
Disk cache for interactive access
Archival storage for long term preservation of data
Databases for organizing metadata for each collection
Data grids for managing distributed data
High-performance networks between the disk caches and archives, capable of supporting parallel I/O streams
Workflow software for managing curation and preservation processes such as checksum validation, synchronization of replicas, transformative migration of encoding formats to new standards.
Portals for managing access to the collections, including digital library services for presenting the data
Analysis systems for applying classification and categorization filters to the data
Knowledge systems for managing the resulting relationships or inferences that have been made on the data

The costs of such data management environments are driven by the need for integrity (eg., multiple replicas, validation of checksums over time, management of access controls), authenticity (management of provenance information to understand data context), scalability (management of the future number of files and amount of storage), and access (support for interactive versus batch access). By minimizing capabilities, the cost can be reduced. For a system that promotes the advancement of science, ability to support intensive analysis is key. For a system that ensures high reliability, the risk of data loss must be minimized.

Component costs of the data management system include the costs of installation, maintenance, and evolution of

Authentication and authorization systems. Data grids are able to use the Grid Security Infrastructure to authenticate users. This requires a Certificate Authority and an associated dedicated server.
Disk storage. Today, disk storage costs between $1000-$5000 per Terabyte per year (amortized cost of capital equipment and labor to administer the storage) depending on the type of disk and built-in redundancy (e.g., mirroring).
Tape storage. Current tape archives cost about $400 per Terabyte per year (amortized cost of media, tape silos, archive software and labor to administer the storage). The amortized media cost is about 1/6 of this annual cost. However, note that this is the cost for a single copy. Three copies are preferred to minimize risk of data loss (original plus two replicas). If one of these copies is kept on disk, then only two tape copies are needed, preferably stored at different locations on different vendor equipment under different administrative control.
The Database. The metadata used for provenance and discovery is stored on-line in a database. A single database can support multiple database instances, allowing all of the collections to be managed using the same software. The cost of management of a database instance is about $5000 per year, for a database that supports 15-20 instances. The cost of the database software depends upon the vendor, with open-source databases requiring more local expertise to run and commercial databases requiring a service contract.
The Data grid. The software that supports distributed data is freely available to academic institutions, and the administrative support is similar to that of the database administrator.
The High-performance network. Note that the movement of a Terabyte of data per day is equivalent to a sustained data rate of 11.6 MB/second. Current collections at SDSC are growing in size at the rate of 1-2 Terabytes per day. The replication and access of this data requires networks that sustain 25-50 MB/second.
Workflow software. The platforms that support the workflows in the past have been the same as the application computer platforms and the data analyses. For the manipulation and analysis of 10 Terabytes of data per day, 10-Teraflop systems are required.
User portals. The server supporting the portal or digital library interface is typically accessed over the web. This implies the need to support JSR 168 java portlets as well as web servers. This is typically done on a server separate from the database server.
The Knowledge system. The management of relationships on the data is enabled by modern digital library middleware. This system typically runs on the same server as the database technology.

6. Management

The long-term management of data requires a sustainability and governance model that specifies the policies that will be used to guarantee funding support, minimize risk of data loss, assure integrity, and assure authenticity.

The management plan needs to address plans for future access if the sustainability model fails, where the collection might be housed, and how the material will be migrated to the new environment. The concept of infrastructure independence in persistent archives can be extended to include independence from a particular sustainability model through federation with other institutions that use alternate sustainability models. Guaranteed access to a collection requires a community that is willing to curate the collection, identify risks to the maintenance of the collection, and seek opportunities to replicate the collection as widely as possible.

7. Conclusion

For science and engineering, as in life, there is “no free lunch.” The ability to organize, analyze, and utilize today’s deluge of data to drive research, education, and practice incurs costs for management, curation, preservation and distribution. These costs must be included in project budgeting and infrastructure planning, and are non-zero.

They are better than the alternative, however. Without responsible data planning as part of the process of project development, organization, and management, valuable data collections will be lost, damaged, or become unavailable. Lack of planning can incur substantive cost for resurrecting, re-generating, or rescuing a data collection, and without critical data, science and engineering advancement and discovery can be slowed. At the end of the day, the costs of thoughtful and strategic data management, curation and preservation are a bargain.

Acknowledgements
The authors would like to thank Helen Berman, Phil Bourne, and Richard Moore for their comments and improvements.

¹ http://www.pdb.org/pdb/Welcome.do
² http://psidonline.isr.umich.edu/
³ http://www.us-vo.org/
⁴ http://php.aaas.org/meetings/MPE_01.php?detail=1110
⁵ http://www.usc.edu/schools/college/vhi/
⁶ http://www.archives.gov/
⁷ http://www.loc.gov/index.html