CTWatch Quarterly » Designing and Supporting Data Management and Preservation Infrastructure

Designing and Supporting Data Management and Preservation Infrastructure

Fran Berman and Reagan Moore, San Diego Supercomputer Center

2. Digital Data Curation

Digital data curation focuses on the generation of descriptive metadata and validation of the quality of the data. Digital data preservation focuses on the characterization of the data authenticity (provenance information), and the management of data integrity across multiple generations of storage technology. An example of a curated community digital collection is the Protein Data Bank (PDB). The PDB is a global resource for structural information about proteins that is maintained by the Worldwide Protein Data Bank (wwPDB). This organization is composed of the Research Collaboratory for Structural Bioinformatics (RCSB), a consortium consisting of groups at UCSD/SDSC and Rutgers; the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI) in Hinxton, UK; and PDBj in Osaka Japan.

When a user accesses a data portal for the PDB for information on HIV-1 protease, a target in the fight against AIDS, considerable infrastructure is provided to support this. Behind the scenes, the following components are involved in providing information on HIV-1 protease to the user:

Data are collected, annotated and validated at one of three wwPDB deposition sites; in the US this site is located at the Rutgers site of RCSB. The wwPDB has adopted the PDB Exchange Dictionary as a means of standardizing semantics to ensure uniform data and provides the foundation for data exchange within the wwPDB and delivery of a standard data representation to the public.
The acquisition, annotation and validation of PDB data requires about 20 highly trained personnel as well as significant computational, storage and network resources. An individual PDB dataset may consist of more than 1000 individual data items with some containing as many as 1M instances. Data acquisition requires reliable, low-latency network connections and high-performance servers to deliver real-time data validation. The validation and computation of derived features for these datasets is computationally intensive and is performed on clusters of Linux and Solaris servers at each of the wwPDB sites. PDB data that are ready for public release are transferred to the RCSB PDB site at SDSC, the main US PDB distribution site.
The RCSB PDB data portal accessed by the user is served from a Linux Cluster at SDSC using software written and maintained by the RCSB PDB team in collaboration with SDSC researchers. The portal comprises a Linux commodity cluster controlled by dual Cisco load balancers that handle traffic from 10,000 scientists a day making 2M page requests. Each cluster node has a redundant copy of the PDB (approx 1TB) and state is maintained using JBOSS and the load balancers. A failover is provided using UltraDNS (third party) that fails over to a small cluster at Rutgers University in New Jersey on the rare occasions that power or networking is lost to all nodes at SDSC. In this way 99.99% uptime is maintained. The RCSB PDB previously maintained mirrors around the world for fast access, but current high-speed networking to SDSC is good enough that these are no longer needed for this purpose.
A group of nine staff and students in the PDB group at UCSD develop access and mining tools for the PDB community. In addition to the PDB site at UCSD the wwPDB sites in Europe and Japan provide complementary views of the common set of archival data.
The PDB currently requires 20TB of storage associated with the collection at SDSC alone.
Additional costs for maintaining the curation involve business office and HR costs, the costs of administering the project by its leadership and management, the costs of advisory and evaluation input (travel, time, etc.), and other related costs.

The RCSB PDB database infrastructure at UCSD is coded in Java and completely built from public domain software. A MySQL database is used at SDSC to instantiate this schema, with middle and presentation layers built around Hibernate. Focus on software development includes systems to organize the material and provide services to support discovery, browsing, and presentation. The RCSB PDB infrastructure was developed to allow the collection to expand continuously and to allow ingestion and evaluation for new entries.

In short, to accommodate the request for information about HIV-1 protease from the PDB, substantial software, hardware, human, and funding support is required. At a recent AAAS Panel on data,⁴ RCSB PDB Director Helen Berman estimated that in 2005 more than one billion dollars of research funding were spent to generate the data that were collected, curated and distributed by the PDB.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By