CTWatch
May 2006
Designing and Supporting Science-Driven Infrastructure
Introduction
Fran Berman, Director, San Diego Supercomputer Center
Thom H. Dunning, Jr., Director, National Center for Supercomputing Applications
Professor and Distinguished Chair for Research Excellence, Department of Chemistry, University of Illinois at Urbana-Champaign

The seminal 2003 Report from the Blue Ribbon Advisory Panel on Cybeirnfrastructure (the “Atkins Report”) states that

“The term infrastructure has been used since the 1920’s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function. Although good infrastructure is often taken for granted and noticed only when it stops functioning, it is among the most complex and expensive thing that society creates. The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy.”

No-one would question that bridges, roads, and telephones require maintenance, regular upgrades, support staff, long-range planning, continuous monitoring, and other items, just as no-one would question their importance to the successful functioning of society.

In the cyber-world, compute, data, networking, software, and other kinds of infrastructure are equally important as enablers for education and new discovery. Information technologies are ubiquitous to modern scientists and engineers, and the ability to utilize relevant resources in a coordinated way is critical for progress.

Analogous to public infrastructure, cyberinfrastructure requires maintenance, regular refresh/upgrade, support staff, long-range planning, monitoring, and other items. For our nationally allocated resources, help-desks, bug fixes, community codes, public data, and other components are all part of the researcher’s toolkit. Unlike research that is characterized by innovation, flexibility, and risk, successful cyberinfrastructure must be characterized by usefulness, usability, functionality, and stability.

In this issue of Cyberinfrastructure Technology Watch, we look behind the scenes to understand what is required to develop and deploy successful cyberinfrastructure for compute, data, software, and grids. We have asked a distinguished set of colleagues and authors – some of the most experienced individuals in the community – to contribute to this issue on the “real” costs of cyberinfrastructure. We thank our colleagues for their efforts, and hope that CTWatch readers enjoy this glimpse behind the scenes of cyberinfrastructure.

Charlie Catlett, Pete Beckman, Dane Skow and Ian Foster, The Computation Institute, University of Chicago and Argonne National Laboratory

1
1. Introduction

The term “cyberinfrastructure” is broadly defined to include computer applications, services, data, networks, and many other components supporting science.1 Here we discuss the underlying resources and integrative systems and software that together comprise a grid “facility” offering a variety of services to users and applications. These services can range from application execution services to data management and analysis services, presented in such a way that end-user applications can access these services separately or in combination (e.g., in a workflow).

We use the TeraGrid2 project to illustrate the functions and costs of providing national cyberinfrastructure. Developed and deployed in its initial configuration between 2001 and 2004, the TeraGrid is a persistent, reliable, production national facility that today integrates eighteen distinct resources at eight “resource provider” facilities.3 This facility supports over 1000 projects and several thousand users (Fig. 1) across the sciences. TeraGrid architecture, planning, coordination, operation, and common software and services are provided through the Grid Infrastructure Group (GIG), led by the University of Chicago. TeraGrid staff work with end-users, both directly and through surveys and interviews, to drive the technical design and evolution of the TeraGrid facility in support of science. In addition, TeraGrid is developing partnerships with major science facilities and communities to provide needed computational, information management, data analysis, and other services and resources, thus allowing those communities to focus on their science rather than on the creation and operation of services.

TeraGrid supports a variety of use scenarios, ranging from traditional supercomputing to advanced Grid workflow and distributed applications. In general terms, TeraGrid emphasizes two complementary types of use. TeraGrid “Deep” involves harnessing TeraGrid’s integrated high-capability resources to enable scientific discovery that would not otherwise be possible. TeraGrid “Wide” is an initiative that is adapting TeraGrid services and capabilities to be readily used by the broader scientific community through interfaces such as web portals and desktop applications. All of these use scenarios—even traditional supercomputing users—benefit from the common services that are operated across the participating organizations, such as uniform access to storage, common data movement mechanisms, facility-wide authentication, and distributed accounting and allocations systems that provide the basis for authorization.

Creating and operating a grid facility involves integrating resources, software, and user support services into a coherent set of services for users and applications. Resources are explored by Roskies,4 while Killeen and Simon5 discuss user and community support. We discuss here the software infrastructure and policies required to integrate these diverse components to create a persistent, reliable national-scale facility. While the federation of multiple, independent computing centers requires carefully designed federation, governance, and sociological policies and processes, in this article we focus only on the functional and technical costs of operating a national grid infrastructure.

Figure 1

Figure 1. TeraGrid allocations by science discipline, April 2006 (1000 projects). Data from David Hart, SDSC.

Pages: 1 2 3 4 5 6 7

Ralph Roskies, Pittsburgh Supercomputing Center
Thomas Zacharia, Oak Ridge National Laboratory

1

In this article, we outline the types of activities required (and an estimate of their cost) in designing and supporting high-end computational facilities. The major categories are facility costs, system software, and the human effort involved in designing and keeping the systems running. This discussion does not include any costs associated with direct user support, application software, application support, or for the development of new technology. Nor does it include the networking issues related to connecting outside the machine room. Those are covered elsewhere in this volume.

Facility Issues

The principal points to be included in planning an HPC facility are sufficient space, power, and cooling. Equally important, but often more easily amenable to improvement, are physical security, water and fire protection, pathways to the space, and automatic monitoring systems.

In the provision of space there is more to consider than the required number of square feet. This is especially true for today’s air-cooled clusters, which were not designed to be used together in the large quantities found in leading HPC centers. Today’s dense, air-cooled systems require large volumes of air for cooling. The size of the plenum under the floor, i.e. the area between the solid subfloor and the bottom of the raised floor tile is an important measure of the ability to deliver adequate air. Distribution is also an issue. Masses of under-floor cable tend to cause air dams which impede the ability to deliver air where it is needed. Conversely, moving large volumes of air through a barely adequate plenum will tend to cause streamlining, particularly when vents are located close to air handling units. Optimal location of the air handling units within the space often seems counter-intuitive. For example, one might think that placing air handlers close to the machine is better and more efficient. But that is likely to cause problems with streamlining and result in low pressure areas. Establishing the correct flow of air is an iterative process no matter what your CFD study says. These issues get a lot simpler with liquid-cooled systems.

There are also many mundane problems to attend to. Subfloors should be sealed to prevent cement dust from proliferating. Floor drains are needed for disposal of condensing moisture from air handler coils. Floor tiles should be carefully selected to avoid the problem of “zinc whiskers” the dispersion of tiny metallic slivers from the undersides of older tiles that cause seemingly random hardware reliability problems. Since computer equipment, air handlers, PDUS, etc. are both large and heavy, it is of great benefit to have a level pathway between the computer room and the loading dock where the equipment will be delivered. Be sure to take into account any hallway corners to ensure that aisles are sufficiently wide to enable corners to be turned. Also, make note of sprinkler heads that will be below ceiling height on the path as well as door locking mechanisms and door jams on the floor that will reduce the effective clearance. Some equipment is sufficiently heavy that the use of metal plates is necessary to avoid floor damage or collapse during delivery to the computer room. With systems requiring much cooling, very large pipes carry very large volumes of water. These pipes may be under the floor or overhead. Smoke detectors and moisture detectors must be correctly installed. Most modern detection systems interface to a site management/security system. It is important to make sure the detection system is integrated so that the proper people are notified in a timely manner.

Power consideration begins with the ability of the utility company to deliver adequate power to the site from its substations. Be prepared for a shocked reaction from your utility company the first time you call and make your request, especially if you have never done this before. During installation, it is wise to label and record every path that the electrical supply will follow to enable quick traceback in the event of problems or electrical capacity questions.

Pages: 1 2 3 4

Fran Berman and Reagan Moore, San Diego Supercomputer Center

1
1. Introduction

The 20th century brought about an “information revolution” that has forever altered the way we work, communicate, and live. In the 21st century, data is ubiquitous. Available in digital format via the web, desktop, personal device, and other venues, data collections both directly and indirectly enable a tremendous number of advances in modern science and engineering.

Today’s data collections span the spectrum in discipline, usage characteristics, size, and purpose. The life science community utilizes the continually expanding Protein Data Bank1 as a worldwide resource for studying the structures of biological macromolecules and their relationships to sequence, function, and disease. The Panel Study of Income Dynamics (PSID),2 a longitudinal study initiated in 1968, provides social scientists detailed information about more than 65,000 individuals spanning as many as 36 years of their lives. The National Virtual Observatory3 is providing an unprecedented resource for aggregating and integrating data from a wide variety of astronomical catalogs, observation logs, image archives, and other resources for astronomers and the general public. Such collections have broad impact, are used by tens of thousands of individuals on a regular basis, and constitute critical and valuable community resources.

However, the collection, management, distribution, and preservation of such digital resources does not come without cost. Curation of digital data requires real support in the form of hardware infrastructure, software infrastructure, expertise, human infrastructure, and funding. In this article, we look beyond digital data to its supporting infrastructure, and provide a holistic view of the software, hardware, human infrastructure, and costs required to support modern data-oriented applications in research, education, and practice.

Pages: 1 2 3 4 5 6

Thom H. Dunning, Jr, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
Robert J. Harrison and Jeffrey A. Nichols, Computing and Computational Sciences Directorate, Oak Ridge National Laboratory

1
1. Background

In the 1980s, it became clear that decommissioning and rehabilitation of the nuclear weapons complex operated by contractors of the U.S. Department of Energy (DOE) was a monumental challenge. The weapons sites contained tens of millions of gallons of high level radioactive wastes and hundreds of cubic kilometers of contaminated soils as well as thousands of contaminated facilities. Towards the end of the 1980s, Robert S. Marianelli, Director of the Chemical Sciences Division in DOE’s Office of Science (DOE-SC) and William R. Wiley, Director of the Pacific Northwest National Laboratory began laying plans for a major new laboratory that would focus on gaining the fundamental understanding needed to tackle these problems. Their work eventually led to the construction of the Environmental Molecular Sciences Laboratory (EMSL), a national user facility dedicated to molecular research related to environmental science and waste processing.

The size of molecular systems involved in environmental science (e.g., aqueous solutions) and high level wastes (e,g., trans-uranic compounds and metal ion chelating agents) was considerably beyond those that could be studied with the molecular modeling software and computing resources available at the time. A workshop was convened in February 1990 to discuss the approach to be taken. The report from the workshop recommended that the DOE-SC establish a major new computing facility in the EMSL and, simultaneously, make a major investment in the development of new quantum chemistry software designed explicitly for massively parallel computing systems. Thus began the development of the Northwest Chemistry package (NWChem).1 2 3 4 Although the official start of the project would be delayed for another couple of years, work began soon thereafter exploring technologies that could be used for a new, scalable quantum chemistry application that included the major atomic and molecular electronic structure methods (e.g., Hartree-Fock, perturbation theory, coupled cluster theory, etc.) as well as molecular dynamics simulations with empirical, semiempirical or ab initio potentials.

One of the authors (Dunning) instigated the NWChem project, while the other authors were the chief architect (Harrison) and project manager (Nichols).

Pages: 1 2 3 4 5 6 7 8

Timothy L. Killeen, National Center for Atmospheric Research
Horst D. Simon, NERSC Center Division, Ernest Orlando Lawrence Berkeley National Laboratory, University of California

1
1. Introduction

The National Energy Research Scientific Computing Center (NERSC) and the National Center for Atmospheric Research (NCAR) are two computing centers that have traditionally supported large national user communities. Both centers have developed responsive approaches to support these communities and their changing needs by providing end-to-end computing solutions. In this report we provide a short overview of the strategies used at our centers in supporting our scientific users, with an emphasis on some examples of effective programs and future needs.

2. Science-Driven Computing at NERSC
2.1 NERSC’s Mission

The mission of NERSC is to accelerate the pace of scientific discovery by providing high performance computing, information, data, and communications services for research sponsored by the DOE Office of Science (DOE-SC). NERSC is the principal provider of high performance computing services for the capability needs of Office of Science programs — Fusion Energy Sciences, High Energy Physics, Nuclear Physics, Basic Energy Sciences, Biological and Environmental Research, and Advanced Scientific Computing Research.

Computing is a tool as vital as experimentation and theory in solving the scientific challenges of the 21st century. Fundamental to the mission of NERSC is enabling computational science of scale, in which large, interdisciplinary teams of scientists attack fundamental problems in science and engineering that require massive calculations and have broad scientific and economic impacts. Examples of these problems include global climate modeling, combustion modeling, magnetic fusion, astrophysics, computational biology, and many more. NERSC uses the Greenbook process1 to collect user requirements and drive its future development.

Lawrence Berkeley National Laboratory (Berkeley Lab) operates and has stewardship responsibility for NERSC, which, as a national resource, serves about 2,400 scientists annually throughout the United States. These researchers work at DOE laboratories, other Federal agencies, and universities (over 50% of the users are from universities). Computational science conducted at NERSC covers the entire range of scientific disciplines but is focused on research that supports DOE’s missions and scientific goals.

Pages: 1 2 3 4 5 6 7 8 9

Reference this article
Killeen, T. L., Simon, H. D. "Supporting National User Communities at NERSC and NCAR," CTWatch Quarterly, Volume 2, Number 2, May 2006. http://www.ctwatch.org/quarterly/articles/2006/05/supporting-national-user-communities-at-nersc-and-ncar/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.