NWChem: Development of a Modern Quantum Chemistry Program
|
May 2006 |
In the 1980s, it became clear that decommissioning and rehabilitation of the nuclear weapons complex operated by contractors of the U.S. Department of Energy (DOE) was a monumental challenge. The weapons sites contained tens of millions of gallons of high level radioactive wastes and hundreds of cubic kilometers of contaminated soils as well as thousands of contaminated facilities. Towards the end of the 1980s, Robert S. Marianelli, Director of the Chemical Sciences Division in DOE’s Office of Science (DOE-SC) and William R. Wiley, Director of the Pacific Northwest National Laboratory began laying plans for a major new laboratory that would focus on gaining the fundamental understanding needed to tackle these problems. Their work eventually led to the construction of the Environmental Molecular Sciences Laboratory (EMSL), a national user facility dedicated to molecular research related to environmental science and waste processing.
The size of molecular systems involved in environmental science (e.g., aqueous solutions) and high level wastes (e,g., trans-uranic compounds and metal ion chelating agents) was considerably beyond those that could be studied with the molecular modeling software and computing resources available at the time. A workshop was convened in February 1990 to discuss the approach to be taken. The report from the workshop recommended that the DOE-SC establish a major new computing facility in the EMSL and, simultaneously, make a major investment in the development of new quantum chemistry software designed explicitly for massively parallel computing systems. Thus began the development of the Northwest Chemistry package (NWChem).1 2 3 4 Although the official start of the project would be delayed for another couple of years, work began soon thereafter exploring technologies that could be used for a new, scalable quantum chemistry application that included the major atomic and molecular electronic structure methods (e.g., Hartree-Fock, perturbation theory, coupled cluster theory, etc.) as well as molecular dynamics simulations with empirical, semiempirical or ab initio potentials.
One of the authors (Dunning) instigated the NWChem project, while the other authors were the chief architect (Harrison) and project manager (Nichols).
In the early 1990s, the development of quantum chemistry software that would scale to hundreds, if not thousands, of processors was a challenging task. It was not known how to parallelize some of the basic mathematical algorithms used in molecular computations, let alone how to parallelize the many algorithms specific to these computations. In addition, many basic software technologies (e.g., interprocessor communication) were still evolving. To address these issues, the NWChem Project Team included theoretical and computational chemists, computer scientists, and applied mathematicians. This long-term partnership, which continues to this day, was critical to meeting the goals of the NWChem Project and led to the development of new mathematical algorithms (e.g., PEIGS for diagonalizing matrices) as well as to new computer system technologies (e.g., Global Arrays for interprocessor communication). The multidisciplinary NWChem Project served as a model for the approach taken in DOE-SC’s Scientific Discovery through Advanced Computing (SciDAC) program.
The NWChem project was supported by the Office of Biological and Environmental Research (OBER) in DOE-SC as an integral part of the EMSL Project. The EMSL Project provided approximately $2 million per year for the period FY1992-7. During this same period, DOE-SC’s Office of Advanced Scientific Computing Research provided another $0.5 million per year to support a Grand Challenge project in computational chemistry. The NWChem project leveraged many of the results from the Grand Challenge project. In addition, the initial exploratory work was funded by the Laboratory Director’s Research and Development program at the Pacific Northwest Laboratory. The “core” NWChem project team involved five computational chemists and three computer scientists and applied mathematicians. In addition, 14 postdoctoral fellows were involved in the project. All told, it is estimated that approximately 100 person-years and $12 million were devoted to the development of NWChem v1.0, not including the effort required to develop the technology incorporated in NWChem from external sources.
Several goals were set for the NWChem software package. These included:
Achieving these goals required a combination of research to determine the best solutions to the above problems, modern software engineering practices to implement these solutions, and a world-wide set of collaborators to provide expertise and experience missing in the core NWChem software development team. Fifteen external collaborators were involved in the development of NWChem; seven from the US, seven from Europe, and one from Australia.
A number of basic computing issues had to be addressed to optimize the performance and scalability of NWChem. These included: processor architecture, node memory latency and bandwidth, interprocessor communications latency and bandwidth, and load balancing. Solving the associated problems often required rewriting and restructuring the software, explorations that were carried out by the postdoctoral fellows associated with the NWChem project. Another issue that was always in the foreground was the portability of the software. Computational chemists typically have access to a wide range of computer hardware, from various brands of desktop workstations, to various brands of departmental computers, to some of the world’s largest supercomputers. To most effectively support their work, it was important that NWChem run on all of these machine, if possible.
The process for designing, developing and implementing NWChem used modern software engineering practices. The process can be summarized as follows:
Although the above is a far more rigorous process that is followed in most scientific software development projects, we found it to be critical to meeting the goals set for NWChem and for managing a distributed software development effort. The above cycle was actually performed at least twice for each type of NWChem method implemented (e.g., classical, uncorrelated quantum, highly correlated quantum, density functional, etc). Going through the cycle multiple times generated “beta” software that could be released to users for feedback and refinement of user requirements.
Although the combination of an on-site core team plus off-site collaborators provided the range of technical capabilities needed to develop NWChem, there are lessons to be learned about managing such a highly distributed project. For example
Our experience suggests that a distributed software development team can be successful if the core team is large enough to develop all of the software components on the critical path and if sufficient guidance is provided to the collaborators on the format and content for their contributions and their progress is carefully monitored.
In addition to achieving high performance and being scalable to large numbers of processors, scientific codes must be carefully designed so that they can easily accommodate new mathematical models and algorithms as knowledge advances. If scientific codes can not evolve as new knowledge is gained, they will rapidly become outdated. It must also be possible to move the codes from one generation of computers to the next without undue difficulty as computer technology advances—the lifetime of scientific codes is measured in decades, the lifetime of computers in years.
For the above reasons, NWChem is best thought of as a framework or environment for chemical computation rather than a single, fully integrated application. The framework defines and supports a “virtual machine” model and mandates a certain structure for new modules. A well-defined virtual machine model hides details of the underlying hardware and encourages programmers to focus on the essentials; correctness, good sequential performance, expressing concurrency, and minimizing data motion. The high-level structure ensures correct operation, provides a consistent look-and-feel for users, and enables code reuse. New chemical functionality can then be developed using a well defined set of capabilities, which are described below. Testimony to the success of this framework (and perhaps to the inadequacy of our current programmer’s manual) is a recent comment from a developer new to NWChem – “my program is running correctly in parallel but I don’t know how.”
The key to achieving the above goals is a carefully designed architecture that emphasizes layering and modularity (Fig. 1). At each layer of NWChem, subroutine interfaces or styles were specified in order to control critical characteristics of the code, such as ease of restart, the interaction between different tasks in the same job, and reliable parallel execution. Object-oriented design concepts were used extensively within NWChem. Basis sets, molecular geometries, chunks of dynamically allocated local memory, and shared parallel arrays are all examples of “objects” within NWChem. NWChem is implemented in a mixture of C and Fortran-77, since neither C++ nor Fortran-90/95 were suitable at the start of the project. Since we did not employ a true object-oriented language, and in particular did not support inheritance, NWChem does not have “objects” in the strict sense of the word. However, careful design with consideration of both the data and the actions performed upon the data, and the use of data hiding and abstraction, permits us to realize many of the benefits of an object-oriented design.
In the very bottom layer of NWChem is the Software Development Toolkit. It includes the Memory Allocator (MA), Global Arrays (GA), and Parallel IO (ParIO). The “Software Development Toolkit” was (and still is) the responsibility of the computer scientists involved in NWChem. It essentially defines a “hardware abstraction” layer that provides a machine-independent interface to the upper layers of NWChem. When NWChem is ported from one computer system to another, nearly all changes occur in this layer, with most of the changes elsewhere being for tuning or to accommodate machine specific problems such as compiler flaws. The “Software Development Toolkit” contains only a small fraction of the code in NWChem, less than 2%, and only a small fraction of the code in the Toolkit is machine dependent (notably the address-translation and transport mechanisms for the one-sided memory operations).
The next layer, the “Molecular Modeling Toolkit,” provides the functionality commonly required by computational chemistry algorithms. This functionality is provided through “objects” and application programmer interfaces (APIs). Examples of objects include basis sets and geometries. Examples of the APIs include those for the integrals, quadratures, and a number of basic mathematical routines (e.g., linear algebra and Fock-matrix construction). Nearly everything that might be used by more than one type of computational method is exposed through a subroutine interface. Common blocks are not used for passing data across APIs, but are used to support data hiding behind APIs.
The runtime database (RTDB) is a key component of NWChem, tying together all of the layers of NWChem. Arrays of typed data are stored in the database using simple ASCI strings for keys (or names) and the database may be accessed either sequentially or in parallel.
The next layer within NWChem, the “Molecular Calculation Modules,” is comprised of independent modules that communicate with other modules only via the RTDB or other persistent forms of information. This design ensures that, when a module completes, all persistent information is in a consistent state. Some of the inputs and outputs of modules (via the database) are also prescribed. Thus, all modules that compute an energy store it in a consistently named database entry—in this case
The highest layer within NWChem is the “task” layer, sometimes called the “generic-task” layer. Functions at this level are also modules—all of their inputs and outputs are communicated via the RTDB, and they have prescribed inputs and outputs. However, these capabilities are no longer tied to specific types of wave functions or other computational details. Thus, regardless of the type of wave function requested by the user, the energy may always be computed by invoking task_energy() and retrieving the energy from the database entry named task:energy. This greatly simplifies the use of generic capabilities such as optimization, numeric differentiation of energies or gradients, and molecular dynamics. It is the responsibility of the “task”-layer routines to determine the appropriate module to invoke.
NWChem was designed to be extensible in several senses. First, the clearly defined task and module layers make it easy to add substantial new capabilities to NWChem. Second, the wide selection of lower-level APIs makes it easier to develop new capabilities within NWChem than within codes in which these capabilities are not easy to access. Finally, having a standard API means that a change to an implementation will affect the whole code.
Virtual Machine Model – Non-uniform Memory Access (NUMA)
By the late 1980s it was apparent that distributed-memory computers were the only path to truly scalable computational power and the only portable programming model available for these systems was message passing. Although NWChem initially adopted the TCGMSG message passing interface, members of the NWChem team participated in development of the message passing interface (MPI) standard,5 and the official NWChem message-passing interface has been MPI for several years. Without fear of contradiction, the MPI standard has been the most significant advancement in practical parallel programming in over a decade, and it is the foundation of the vast majority of modern parallel programs. The vision and tireless efforts of those who initiated and led this communal effort must be acknowledged. It has also been pointed out that the existence of such a standard was a prerequisite to the emergence of very successful application frameworks such as PETSc.6
A completely consistent (and deliberately provocative) viewpoint is that MPI is evil. The emergence of MPI coincided with an almost complete cessation of parallel programming tool/paradigm research. This was due to many factors, but in particular to the very public and very expensive failure of HPF. The downsides of MPI are that it standardized (in order to be successful itself) only the primitive and already old communicating sequential process7 (CSP) programming model, and MPI’s success further stifled adoption of advanced parallel programming techniques since any new method was by definition not going to be as portable. Since one of the major goals of NWChem was to enable calculations larger than would fit into a single processor, it was essential to manage distributed data structures. Scalable algorithms also demand dynamic load balancing to accommodate the very problem dependent sparsity in matrix elements and wide ranging cost of evaluating integrals. Both of these tasks are difficult to accomplish using only simple message passing and a more powerful solution was demanded.
The Global Arrays (GA) toolkit8 9 10 provides an efficient and portable “shared-memory” programming interface for distributed-memory computers. Each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed, dense multi-dimensional arrays, without need for explicit cooperation by other processes (Fig. 2). Unlike other shared-memory environments, the GA model exposes to the programmer the non-uniform memory access (NUMA) characteristics of the high performance computers and acknowledges that access to a remote portion of the shared data is slower than to the local portion. Locality information for the shared data is available, and direct access to local portions of shared data is provided. The GA toolkit has been in the public domain since 1994 and is fully compatible with MPI.
Essentially all chemistry functionality within NWChem is written using GA. MPI is only employed in those sections of code that benefit from the weak synchronization implied by passing messages between processes, for instance to handle the task dependencies in classical linear algebra routines or to coordinate data flow in a highly optimized parallel fast Fourier transform. This success is due to combining the correct abstraction (multi-dimensional arrays of distributed data) with the programming ease and scalability of one-sided access to remote data. Performance comes from algorithms (Fig. 3) designed to accommodate the NUMA machine characteristics, e.g., Hartree-Fock,11 four-index transformation,12 and multi-reference CI.13
When discussing computer software, it is impossible to ignore the on-going cost associated with maintenance and evolution of the software. This phenomenon is well understood by anyone who uses computers in their daily life. Even for our personal computers, there is a continuing level of effort associated with fixing bugs in, as well as adding new features to, the software. The need for maintenance and evolution of scientific and engineering software is several-fold greater. This is a result of the extraordinary complexity of the software coupled with the continual development of new methods resulting from increased scientific understanding and the technological upheaval associated with the often rapid evolution of cutting-edge computing technologies.
In the early formative years of NWChem development prior to the first official release of version 1.0 in 1997, it was critical to get the software out to users as quickly as possible, as often as possible, and always “for free.” This allowed feedback from the users, including bug reports and fixes, as well as established a large user base. In addition, early usage generated revised user requirements, e.g., a very important element of computational chemistry (Density Functional Theory) was added to the development effort (in 1993) after the project had already been initiated. The first beta releases of NWChem occurred in 1994 with subsequent trial releases occurring annually until the first official release in 1997. This also allowed the team to get experience on the eventual mechanisms deployed for maintenance and operations.
The initial NWChem development platform was a KSR-2 – this system was appropriate for exploration of programming models for both shared and distributed memory implementations. The first actual production hardware (an IBM SP2) and, in fact, all subsequent production hardware systems were purchased based on NWChem requirements and benchmarks. Since the initial development of NWChem, the program has been ported to a broad range of computer systems, from IBM systems running AIX, SGI IRIX and Altix systems, and HP systems running HPUX, Tru64 and Linux to Apple personal computers running OS X. In addition, the performance and capabilities of NWChem have increased substantially since version 1.0 was released in 1997. The latest version (4.7) includes many improvements to the algorithms used in NWChem as well as the ability to perform many new types of calculations.
Unlike many other supercomputing facilities, EMSL’s Molecular Science Computing Facility supports a software development effort for NWChem and related software as well as supercomputer operations. The High Performance Software Development Group is presently led by Theresa Windus. Within this group, the Molecular Science Software project is responsible for the evolution, distribution, and support of Ecce, an extensible computational chemistry environment, and ParSoft, a set of software tools for massively parallel computers, as well as NWChem. In the High Performance Software Development Group, there are five computational chemists associated with evolution, distribution and support of NWChem—the same number (but not necessarily the same people) that were originally involved in the development of the software. It should be noted that the integration of NWChem with Ecce (the user interface) was much more difficult to achieve than originally anticipated. The integration process should probably have been initiated in 1995 (two years prior to the first official release in 1997).
Petascale computing is now a realizable goal that will impact all of science and engineering, not just those applications requiring the highest capability. But the optimum pathway to petascale science and engineering—the pathway that will realize the full potential of petascale computers to drive science and engineering—is unclear. Future computers cannot rely on continuing increases in clock speed to drive performance increases—heat dissipation problems will limit these increases. Instead, tomorrow’s computing systems will include processors with multiple “processor cores” on each chip, special application accelerators, and reprogrammable logic devices (FPGAs). In addition, all of these types of processors may be included in a single system, interconnected by a high-performance communications fabric. Individual processors may even have heterogeneous “processor cores” in the fashion of the new Cell processor from IBM, Sony and Toshiba.14 These technologies have the potential to dramatically increase the fidelity and range of computational simulations as well as the scope and responsiveness of data mining, analysis, and visualization applications. However, they also pose significant technical problems that must be addressed before their full potential can be realized.
So, the advances promised by petascale computers will not come gratis. The problems encountered in developing scientific codes for supercomputers with a performance exceeding 100-teraflops are technically complex, and their resolution will (once again) require an in-depth understanding of both the scientific algorithms and the computer hardware and systems software. Hardware problems to be overcome range from the memory bandwidth limitations of multicore microprocessor-based compute nodes to the utilization of “exotic” computing technologies (e.g., FPGAs) to the bandwidth and latency limitations of the interprocessor communications fabric. Software problems to be overcome range from the choice of programming model to the development of numerical algorithms that scale t (at least!) tens of thousands of processors. And, in the end, we want a code that is extensible, portable, and maintainable. As the NWChem project illustrated, scientific codes that achieve these goals can be met by teams that include all of the needed expertise and that draw on talent both near and far. The pacing item for petascale science and engineering, as opposed to petascale computing, will be the state of the art in scientific applications.
As daunting as the above problems seem, it will be worth it! Combining the computing advances described above with advances in mathematical models and computational algorithms will lead to revolutionary new modeling and simulation capabilities. Problems that currently seem intractable will not only become doable, they will become routine. In chemistry, computational studies will become an integral and irreplaceable part of studies aimed at understanding the chemical processes involved in the environment, the burning of hydrocarbon fuels, and the industrial production of chemicals. The fidelity of modeling complex biomolecules will also take a major step forward, greatly increasing the contributions of computational chemistry to the life sciences. To realize these opportunities, however, the federal agencies must make investments in scientific simulation software, computing system software, and mathematical libraries necessary to capitalize on the potential of petascale computing.