The data management systems supporting digital collections support a variety of functions including sharing of data, publishing of data, preserving of data, and analyzing of data.
Data management system components may include the following:
- Authenticity system for validating the identity of users
- Authorization software system for controlling updates
- Disk cache for interactive access
- Archival storage for long term preservation of data
- Databases for organizing metadata for each collection
- Data grids for managing distributed data
- High-performance networks between the disk caches and archives, capable of supporting parallel I/O streams
- Workflow software for managing curation and preservation processes such as checksum validation, synchronization of replicas, transformative migration of encoding formats to new standards.
- Portals for managing access to the collections, including digital library services for presenting the data
- Analysis systems for applying classification and categorization filters to the data
- Knowledge systems for managing the resulting relationships or inferences that have been made on the data
The costs of such data management environments are driven by the need for integrity (eg., multiple replicas, validation of checksums over time, management of access controls), authenticity (management of provenance information to understand data context), scalability (management of the future number of files and amount of storage), and access (support for interactive versus batch access). By minimizing capabilities, the cost can be reduced. For a system that promotes the advancement of science, ability to support intensive analysis is key. For a system that ensures high reliability, the risk of data loss must be minimized.
Component costs of the data management system include the costs of installation, maintenance, and evolution of
- Authentication and authorization systems. Data grids are able to use the Grid Security Infrastructure to authenticate users. This requires a Certificate Authority and an associated dedicated server.
- Disk storage. Today, disk storage costs between $1000-$5000 per Terabyte per year (amortized cost of capital equipment and labor to administer the storage) depending on the type of disk and built-in redundancy (e.g., mirroring).
- Tape storage. Current tape archives cost about $400 per Terabyte per year (amortized cost of media, tape silos, archive software and labor to administer the storage). The amortized media cost is about 1/6 of this annual cost. However, note that this is the cost for a single copy. Three copies are preferred to minimize risk of data loss (original plus two replicas). If one of these copies is kept on disk, then only two tape copies are needed, preferably stored at different locations on different vendor equipment under different administrative control.
- The Database. The metadata used for provenance and discovery is stored on-line in a database. A single database can support multiple database instances, allowing all of the collections to be managed using the same software. The cost of management of a database instance is about $5000 per year, for a database that supports 15-20 instances. The cost of the database software depends upon the vendor, with open-source databases requiring more local expertise to run and commercial databases requiring a service contract.
- The Data grid. The software that supports distributed data is freely available to academic institutions, and the administrative support is similar to that of the database administrator.
- The High-performance network. Note that the movement of a Terabyte of data per day is equivalent to a sustained data rate of 11.6 MB/second. Current collections at SDSC are growing in size at the rate of 1-2 Terabytes per day. The replication and access of this data requires networks that sustain 25-50 MB/second.
- Workflow software. The platforms that support the workflows in the past have been the same as the application computer platforms and the data analyses. For the manipulation and analysis of 10 Terabytes of data per day, 10-Teraflop systems are required.
- User portals. The server supporting the portal or digital library interface is typically accessed over the web. This implies the need to support JSR 168 java portlets as well as web servers. This is typically done on a server separate from the database server.
- The Knowledge system. The management of relationships on the data is enabled by modern digital library middleware. This system typically runs on the same server as the database technology.