CTWatch
November 2007
Software Enabling Technologies for Petascale Science
Jennifer M. Schopf, University of Chicago and Argonne National Laboratory
Ann Chervenak, University of Southern California
Ian Foster, University of Chicago and Argonne National Laboratory
Dan Fraser, University of Chicago and Argonne National Laboratory
Dan Gunter, Lawrence Berkeley National Laboratory
Nick LeRoy, University of Wisconsin
Brian Tierney, Lawrence Berkeley National Laboratory

4
4. The CEDPS Data Placement Service

CEDPS is also developing the Data Placement Service (DPS) that will perform data transfer operations using MOPS. For data-intensive scientific applications running in a distributed environment, the placement of data onto storage systems can have a significant impact on the performance of scientific computations and on the reliability and availability of data sets. These scientific applications may produce and consume terabytes or petabytes of data stored in millions of files or objects, and they may run complex computational workflows consisting of millions of interdependent tasks. A variety of data placement algorithms could be used, depending on the requirements of a scheduler or workflow management system as well as the data distribution goals of the scientific collaboration, or Virtual Organization (VO). For example, a placement algorithm might distribute data in a way that is advantageous for application or workflow execution by placing data sets near high-performance computing resources so that they can be staged into computations efficiently; by moving data off computational resources quickly when computation is complete; and by replicating data sets for performance and reliability. These goals might be considered policies of the workflow manager or VO, and a policy-driven data placement service is responsible for replicating and distributing data items in conformance with these policies or preferences. A data placement service could also make use of hints from a workflow management system about applications and their access patterns, for example, whether a set of files is likely to be accessed together and therefore should be replicated together on storage systems.

To demonstrate the effectiveness of intelligent data placement, we integrated the Pegasus workflow management system14 from the USC Information Sciences Institute with the Globus Data Replication Service,15 which provides efficient replication and registration of data sets. We demonstrated8 that using hints from the workflow management system allowed us to reduce the execution time of scientific workflows when we were able to successfully prestage necessary data onto appropriate computational resources.

This initial work has led us to design a general, asynchronous Data Placement Service (DPS) that will operate on behalf of a virtual organization and accept data placement requests from clients that reflect, for example, grouping of files, order of file requests, etc. Figure 2 illustrates the operation of a DPS for stage in requests issued by a workflow management system. We also plan to incorporate configurable policies into the data placement service that reflect the data distribution policies of a particular VO. Our goal is to produce a placement service that manages the competing demands of VO data distribution policies, data staging requests from multiple competing workflows, and additional on-demand data requests from other clients.

Figure 2


Figure 2. Shows a workflow management system acting as a client of a data placement service and issuing requests for staging of data sets. The DPS issues MOPS data transfers from appropriate storage elements to the compute cluster(s) on which workflow execution will take place.

We have implemented an initial version of the data placement service with a planned software release in October 2007. This implementation modifies and significantly extends the existing Globus Data Replication Service. The implementation uses several Globus components, including the Java WS Core that provides the basic infrastructure for supporting web service deployment and generic operation support such as basic state management, query operations, endpoint references, etc.; the Globus Replica Location Service that provides registration and discovery of of data items; the GridFTP data transfer service for secure and efficient data staging operations; and the Grid Security Infrastructure for secure access to resources in the distributed environment.

Pages: 1 2 3 4 5 6

Reference this article
Schopf, J. M., Chervenak, A., Foster, I., Fraser, D., Gunter, D., LeRoy, N., Tierney, B. "End-to-End Data Solutions for Distributed Petascale Science," CTWatch Quarterly, Volume 3, Number 4, November 2007. http://www.ctwatch.org/quarterly/articles/2007/11/end-to-end-data-solutions-for-distributed-petascale-science/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.