CTWatch Quarterly » End-to-End Data Solutions for Distributed Petascale Science

End-to-End Data Solutions for Distributed Petascale Science

Jennifer M. Schopf, University of Chicago and Argonne National Laboratory
Ann Chervenak, University of Southern California
Ian Foster, University of Chicago and Argonne National Laboratory
Dan Fraser, University of Chicago and Argonne National Laboratory
Dan Gunter, Lawrence Berkeley National Laboratory
Nick LeRoy, University of Wisconsin
Brian Tierney, Lawrence Berkeley National Laboratory

2. Current Data Placement Approaches

Large quantities of data must frequently be moved among computers in a petascale computing environment, whether because there are insufficient resources to perform analysis on the platform that generated the data, because analysis requires specialized resources or involves comparison with other data, or because the data must be published, that is, moved and augmented with metadata, to facilitate use by the community.

Our data placement work addresses three classes of application requirements. First, staging to and from active computations and workflows requires placement of data at advantageous locations. By using a data placement service to perform staging operations asynchronously with respect to a workflow or execution engine, rather than explicitly staging data at run time, we hope to demonstrate improved application performance, as suggested in simulations⁷ and initial measurements of workflow execution.⁸ A current example of where these methods can be applied is the visualization of the results of a combustion simulation at NERSC, which produces 100 TB of data. Smarter placement of the data during simulation execution will enable better use of the visualization component and let scientists understand the resulting data in a more timely fashion.

Second, archival storage is often the final location of data products that are staged out of a running application, and better data placement services can make archiving operations more efficient. When an application runs on a compute resource such as a cluster or supercomputer, data products must often be staged off the storage system associated with that computational resource onto more permanent secondary or archival storage. These staging out operations can limit application performance, particularly if the compute resource is storage-limited; using an asynchronous data placement service to stage out data products should improve performance. For example, the team running the CCSM climate simulation code at ORNL wants to publish its output data to the Earth System Grid (ESG).⁹ They must both transfer the output data to an HPSS archive at NERSC (perhaps while the model is running) and also register each file in a metadata catalog for ESG.

Finally, we are interested in data placement services that maintain required levels of redundancy in a distributed environment. For example, it might be the policy of the data placement service to ensure that there are always three copies of every data item stored in the system. If the number of replicas of any data item falls below this threshold, the placement service is responsible for creating additional replicas to meet this requirement. An example of where this requirement arises in practice is the data produced by the CMS experiment at the LHC (at a sustained rate of 400 MB/s), which must be delivered to a Tier 1 site in the US for further processing and then distributed among several US domestic and 20 non-US Tier-2 sites.

Such scenarios, for which we can give many other examples across a wide range of applications, can involve many of the following six elements:

Data registration and metadata tagging as well as data movement;
Bulk data transfer over high-speed long-haul networks from different sources and sinks;
Coordinated data movement across multiple sources, destinations, and intermediate locations, including parallel file systems, virtual disks, and hierarchical storage, and among multiple users and applications;
Failure reduction techniques, such as storage reservation and data replication;
Failure detection techniques including online monitoring and operation retry to detect and recover from multiple failure modalities; and
A need for predictability and coordinated scheduling in spite of variations in load and competing use of storage space, bandwidth to the storage system, and network bandwidth.

To summarize the motivation for CEDPS in a sentence: not only must we be able to transfer data and manage end-point storage systems and resource managers; we must also be able to support the coordinated orchestration of data across many community resources.

Currently available tools address portions of this functionality. Basic high-performance data transfer (2) is supported by GridFTP,¹⁰which provides fast performance through parallelism and stripping between data sources. The Replica Location Service and associated Globus data services¹¹ can provide basic ways to look up where a replica is stored, but metadata tagging (1) is generally an application-specific tool. The NeST¹² and dCache¹³ storage management services provide disk-side support for data placement and some of the reliability and error prevention required (4), but not the broader coordinated data movement (3) needed by today’s applications. Failure detection (5) and performance prediction (6) are considered open areas of research by many. In general, these requirements go well beyond our current data transfer and storage resource management capabilities. We will discuss the ways in which our new technology addresses these six elements in the following sections.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By