While largely transparent to end-users, any national grid facility must be supported by a deep foundation of operational infrastructure. This need is particularly important for facilities such as TeraGrid that operate national-scale resources, purchased and supported on behalf of government agencies, where accountability for the use of those resources is required, coupled with an open peer-review process for allocating access to the resources. Operational services discussed here also include networking, security coordination, and an operations center.
Resource Allocation and Management
Many national-scale grid consortia operate “best-effort” services that provide access to excess capacity to stakeholder user groups. In contrast, TeraGrid operates resources on behalf of broad national communities, and these resources are allocated by formal processes. Specifically, resources are allocated by a peer-review committee that meets quarterly to review user requests for allocations. (Allocations are specified in service units, analogous to CPU hours.) The mechanisms needed to support this nationally peer-reviewed system include a distributed accounting system that works in concert with authentication and authorization systems to debit project allocations according to use by users authorized by the principal investigator of the given project. In addition, support for the allocation review process itself requires a proposal request and review infrastructure, databases for users and usage, and information exchange systems for usage data and user credentials. The TeraGrid has obtained much of this infrastructure from its predecessor, the NSF Partnerships for Advanced Computational Infrastructure (PACI) program, in which several million dollars of software development was invested during the past decade.
The operation of the TeraGrid resource allocation and management infrastructure requires four GIG FTEs for coordination along with seven FTEs at resource provider facilities to support the various databases and proposal support systems, and to perform local accounting integration with the distributed TeraGrid system.
Security Coordination
Security management in a national grid facility requires a high degree of coordination among security professionals at many sites. TeraGrid security coordination is based on a set of agreed-upon policies ranging from minimum security practices to change management and protocols for incident response and notification.
The GIG team provides coordination of the distributed security team for general communication, incident response management, and analysis of the security impact of system changes (e.g., software, new systems, etc.). However, the provision of distributed authentication and authorization services for individual users and groups (or “virtual organizations”27), as is required in grid facilities, is also a significant part of the security coordination effort.
Security coordination across TeraGrid requires two GIG FTEs working with three FTEs at resource provider sites, with participation from additional security operations staff from each resource provider organization. While participation in a national grid security coordination team requires investment of time on the part of local security staff, the benefits to the site are high in terms of training, assistance, and early notification of events that might impact the local site.
Networking
Many national-scale grid facilities rely on existing Internet connectivity. In contrast, TeraGrid operates a dedicated network. Irrespective of the networking strategy, effort is needed to optimize services over networks between resource provider locations, particularly with respect to data movement over high bandwidth-delay product networks. In addition, distributed applications and services often require assistance from networking experts at multiple sites. Thus, a national-scale grid facility such as TeraGrid requires a networking team consisting of contacts from each resource provider site. As with the security team, the benefits to the site far outweigh the time-investment on the part of local networking staff.
In the case of TeraGrid, this component of the support infrastructure comprises a network architect/coordinator within the GIG to oversee the networking team, which includes five FTEs from resource provider facilities along with general networking contacts at all sites. The networking working group coordinates the operation of the TeraGrid network. Participants also assist in user support, such as diagnosing problems and optimizing performance of distributed services and applications.
Operations
TeraGrid provides a distributed operations center, leveraging the 24/7 operations centers at two of the resource provider facilities (NCSA and SDSC) to provide around-the-clock support. The distributed 24/7 operations center plays several essential roles in the TeraGrid facility, including the management of a common trouble-ticket system and ongoing measurement of key metrics related to the health and performance of the facility. TeraGrid operations requirements also include management of the distributed accounting system, which involves the collection of usage information into a central usage database. The TeraGrid GIG funds two FTEs for various aspects of operations and two FTEs at resource provider facilities.