CEDPS is also developing the Data Placement Service (DPS) that will perform data transfer operations using MOPS. For data-intensive scientific applications running in a distributed environment, the placement of data onto storage systems can have a significant impact on the performance of scientific computations and on the reliability and availability of data sets. These scientific applications may produce and consume terabytes or petabytes of data stored in millions of files or objects, and they may run complex computational workflows consisting of millions of interdependent tasks. A variety of data placement algorithms could be used, depending on the requirements of a scheduler or workflow management system as well as the data distribution goals of the scientific collaboration, or Virtual Organization (VO). For example, a placement algorithm might distribute data in a way that is advantageous for application or workflow execution by placing data sets near high-performance computing resources so that they can be staged into computations efficiently; by moving data off computational resources quickly when computation is complete; and by replicating data sets for performance and reliability. These goals might be considered policies of the workflow manager or VO, and a policy-driven data placement service is responsible for replicating and distributing data items in conformance with these policies or preferences. A data placement service could also make use of hints from a workflow management system about applications and their access patterns, for example, whether a set of files is likely to be accessed together and therefore should be replicated together on storage systems.
To demonstrate the effectiveness of intelligent data placement, we integrated the Pegasus workflow management system14 from the USC Information Sciences Institute with the Globus Data Replication Service,15 which provides efficient replication and registration of data sets. We demonstrated8 that using hints from the workflow management system allowed us to reduce the execution time of scientific workflows when we were able to successfully prestage necessary data onto appropriate computational resources.
This initial work has led us to design a general, asynchronous Data Placement Service (DPS) that will operate on behalf of a virtual organization and accept data placement requests from clients that reflect, for example, grouping of files, order of file requests, etc. Figure 2 illustrates the operation of a DPS for stage in requests issued by a workflow management system. We also plan to incorporate configurable policies into the data placement service that reflect the data distribution policies of a particular VO. Our goal is to produce a placement service that manages the competing demands of VO data distribution policies, data staging requests from multiple competing workflows, and additional on-demand data requests from other clients.
We have implemented an initial version of the data placement service with a planned software release in October 2007. This implementation modifies and significantly extends the existing Globus Data Replication Service. The implementation uses several Globus components, including the Java WS Core that provides the basic infrastructure for supporting web service deployment and generic operation support such as basic state management, query operations, endpoint references, etc.; the Globus Replica Location Service that provides registration and discovery of of data items; the GridFTP data transfer service for secure and efficient data staging operations; and the Grid Security Infrastructure for secure access to resources in the distributed environment.