Designing and Supporting High-end Computational Facilities
Ralph Roskies, Pittsburgh Supercomputing Center
Thomas Zacharia, Oak Ridge National Laboratory
CTWatch Quarterly
May 2006

In this article, we outline the types of activities required (and an estimate of their cost) in designing and supporting high-end computational facilities. The major categories are facility costs, system software, and the human effort involved in designing and keeping the systems running. This discussion does not include any costs associated with direct user support, application software, application support, or for the development of new technology. Nor does it include the networking issues related to connecting outside the machine room. Those are covered elsewhere in this volume.

Facility Issues

The principal points to be included in planning an HPC facility are sufficient space, power, and cooling. Equally important, but often more easily amenable to improvement, are physical security, water and fire protection, pathways to the space, and automatic monitoring systems.

In the provision of space there is more to consider than the required number of square feet. This is especially true for today’s air-cooled clusters, which were not designed to be used together in the large quantities found in leading HPC centers. Today’s dense, air-cooled systems require large volumes of air for cooling. The size of the plenum under the floor, i.e. the area between the solid subfloor and the bottom of the raised floor tile is an important measure of the ability to deliver adequate air. Distribution is also an issue. Masses of under-floor cable tend to cause air dams which impede the ability to deliver air where it is needed. Conversely, moving large volumes of air through a barely adequate plenum will tend to cause streamlining, particularly when vents are located close to air handling units. Optimal location of the air handling units within the space often seems counter-intuitive. For example, one might think that placing air handlers close to the machine is better and more efficient. But that is likely to cause problems with streamlining and result in low pressure areas. Establishing the correct flow of air is an iterative process no matter what your CFD study says. These issues get a lot simpler with liquid-cooled systems.

There are also many mundane problems to attend to. Subfloors should be sealed to prevent cement dust from proliferating. Floor drains are needed for disposal of condensing moisture from air handler coils. Floor tiles should be carefully selected to avoid the problem of “zinc whiskers” the dispersion of tiny metallic slivers from the undersides of older tiles that cause seemingly random hardware reliability problems. Since computer equipment, air handlers, PDUS, etc. are both large and heavy, it is of great benefit to have a level pathway between the computer room and the loading dock where the equipment will be delivered. Be sure to take into account any hallway corners to ensure that aisles are sufficiently wide to enable corners to be turned. Also, make note of sprinkler heads that will be below ceiling height on the path as well as door locking mechanisms and door jams on the floor that will reduce the effective clearance. Some equipment is sufficiently heavy that the use of metal plates is necessary to avoid floor damage or collapse during delivery to the computer room. With systems requiring much cooling, very large pipes carry very large volumes of water. These pipes may be under the floor or overhead. Smoke detectors and moisture detectors must be correctly installed. Most modern detection systems interface to a site management/security system. It is important to make sure the detection system is integrated so that the proper people are notified in a timely manner.

Power consideration begins with the ability of the utility company to deliver adequate power to the site from its substations. Be prepared for a shocked reaction from your utility company the first time you call and make your request, especially if you have never done this before. During installation, it is wise to label and record every path that the electrical supply will follow to enable quick traceback in the event of problems or electrical capacity questions.

On-going non-personnel expenses

The power costs must not only take into account the power needs of the computer, but also the cost of the cooling. As a rule of thumb, multiply the power consumption of the system alone by 35-40% to estimate the additional power consumption of the required cooling. Today’s rates for power vary substantially over the country, ranging from under 3 cents/kwh to over 10 cents/kwh.

First year maintenance may be included in the price of a new system. After that, unless the purchase has explicitly included multi-year maintenance, annual maintenance costs seem to range between 4-8% of the purchase price of the machine. It is not necessary to get a maintenance contract with extremely rapid response. For a system with a large node count, it is much more important to be able to remove a node from the system rapidly, reconfigure, preferably with spares, and continue. Next day service may be adequate for the vendor to then do any required hardware maintenance on the removed nodes. It is almost always better to negotiate maintenance options with the vendor while negotiating for the original system, for that is when you have most leverage with the vendor. It is wise to structure these as annual options so that you can cancel the maintenance contract with the vendor if you can find a better deal.

Operation expenses can be kept down by developing operator-free systems. For this, you need an extensive alerting infrastructure, which relays system events to system administrators via pagers or text messaging on their cell phones. Underlying it is a monitoring system extensive and reliable enough to report any of the anomalies that system operators would likely catch. You actually need a hierarchy of monitoring, from simple pass/fail on individual low level devices, like nodes, disks, etc. to high level testing of several components in sequence and verifying that the end-to-end results are correct.

As a new trend, the four to five year operating cost including maintenance, space, power, and cooling of a major computer, which for many years was a small part of the total cost of ownership of a system, is now becoming a much more significant factor, and may even exceed the original capital investment.

Software

Increasingly, system software for debuggers, mathematical libraries, job scheduling, performance analysis, and even compilers, is provided by companies other than the hardware vendor. The cost of this required third party software can be substantial, and often the suppliers do not have early access to hardware from the vendors. Make certain that you understand exactly what software will be supplied with the system, and what arrangements the vendor has with the independent software vendors who will supply these other needed tools. The cost of these licenses can be large. However, it is not always necessary to license tools such as debuggers for the full system. For example, debugging tools are not very effective above 100-200 tasks, so don’t bother to license the debugger for 2000 nodes. This can save a substantial amount of money. There are high-quality, robust mathematical libraries that are available for free from universities and government laboratories as a result of many years of development from the NSF and DOE. Often, vendors have optimized versions of these libraries available for their systems. 


Systems and operations personnel tasks

There are a large number of different tasks that get lumped into systems and operations. We break them down into Core System Software, Machine Room Networking, User Access, Resilience, and Management. We briefly describe each of them, and return at the end to estimate the FTE effort required to carry them out.

Core System Software – This includes support for the operating system (OS), as well as for tools layered on top of the OS, including debuggers, scientific libraries, system monitoring displays, and many more. Get used to the idea that this work is never done. You will continually be installing new versions and patches. Best practices in version control are a necessity. New versions often introduce new bugs, and you will want the ability to fall back easily on the previous version. There are really two aspects to the OS and tool support. Some of it runs on the individual nodes of the system. Others are concerned with the system-wide aspects. Both need attention. Moreover, most HPC centers run multiple computing systems, each with a different OS, and each, of course, needs attention. For large systems, be prepared to have a system larger than what the vendor has for internal testing of software. This implies that patches and new system software versions may have never been tested on a large machine before being delivered. You should negotiate with the vendor to provide test time on your system to run validation and regression testing before installing new software.

Rather distinct are issues related to file systems. Often, there are at least three different file systems, which we can call home directories, local files, and system-wide files. Home directories are associated with individual users or projects. It is here that users store select information long term. These are usually backed up and often subject to quotas. Local file systems are local to the individual nodes, while system-wide file systems are globally accessible, and can often be written and read in parallel. Both sets of these files are viewed as temporary. They are associated with running jobs, or jobs which have recently run. Permanent file storage is found in the mass store system, which usually has a disk cache and a tape back end, with system algorithms that determine when to move the files from the rapidly accessible but expensive disk cache to the less rapidly accessible but much less expensive tape system. Data on the mass storage system is also usually backed up. Monitoring and managing the file systems is necessarily an ongoing operational requirement.

Machine Room Networking – The tasks include engineering support for the design, development, installation, operation, testing and debugging of all network infrastructure. At minimum, one needs a good background in network protocols, including but not limited to TCP/IP, as well as network diagnostics, end-to-end network performance and network routing.

User Access – The issues here can be categorized as user accounts, job processing, and reporting. User accounts have to be continually created, monitored and shut down. Authentication mechanisms have to be installed, which usually also means maintaining a Kerberos server. Job processing may include such things as developing and maintaining a scheduler, which implements scheduling policy, as well as exposing a queuing system that the users see. Reporting means processing individual job records and creating a database to easily answer questions such as how many users have used what resources in what discipline, what fraction of the usage has used what number of processors, how many new users have been added in the past year, and what is the demographics of the users. Today’s management usually wants a web interface to be able to easily query the database and directly extract the kinds of information it needs.

Resilience means keeping the system as secure as possible, testing that it is operating reliably and taking the necessary steps if it is not. It includes such matters as security, monitoring system status, node management and regression testing. Security is an increasing concern at all large HPC sites. Items include reliable authentication of users, developing and enforcing best practices for staff including one-time passwords for those with system privileges. Knowing the availability of processors for jobs is not nearly as simple as it sounds. Usually it is felt that hardware errors should be dealt with by the manufacturer, while software errors can be cured by system reboot. But deciding whether errors are hardware or software is actually not clearcut. Nodes can seem to be available to the scheduler, but in fact are not. Most common is the case where processes from a previous job don’t get cleaned up. Starting another job on this node will then not perform properly. Node management and analysis means both calling field service in case of hardware issues and maintaining an easily queried database to identify problematic nodes- those that fail repeatedly in hardware or software. Although most manufacturers support error logging, those logs often produce flat files and in many different places. Effort needs to be expended to collect that data into a useful database. Then the data has to be analyzed, so that one can spot problem nodes ahead of when they actually fail. Moreover, you cannot safely assume that nodes that have been returned by field service are either in good working order or even that the problem you had detected has been solved. After installation of new software or hardware, regression testing must re-establish that the system now gives acceptable answers and performance for all the tests in the test suite.

Management – This includes the staff time to manage the people, including hiring people, project management of carrying out the individual subtasks outlined above, management reporting, as well as keeping an up to date inventory of equipment. It also includes vendor relationships, not only with your current suppliers but with others as well, so that you can keep abreast of new technological developments that might be relevant for your next endeavors. For your current vendors, there is often significant effort in keeping track of the progress on problems you have reported that they have promised to fix.

Personnel effort required

The FTE effort required to carry out the systems and operations functions clearly depends on many factors, such as the number of different systems, the size (processor count) of each, the number of users, whether or not you take early systems or wait until they are mature (i.e. others have worked out the inevitable bugs that are encountered in early systems). Estimates are that it takes 8-12 FTEs for a single, large, stand-alone HPC system to provide systems administration, security, networking, storage, and 24x7 monitoring and operational support. After that, when you add more people for each new system, the growth is much slower. However, Tim Thomas of the University of New Mexico reports (private communication) that he has looked at many systems and found the following amusing rule of thumb. For a small installation (less than 50 processors), you don’t usually find a dedicated system person (this means that you are exploiting graduate students or the PI). After that, he finds that sites add one FTE for roughly every 250 processors. Of course, this rule shouldn’t work because it is insensitive to the number of systems, the number of users, whether these are early systems or not. A quick look at other TeraGrid sites indicates that this rule gives a fairly good estimate of the systems effort expended. However, the more systems involved and the richer the underlying infrastructure, the more personnel are needed to provide stable support and assistance to users, so these numbers are highly dependent on the facility.

Acknowledgements
We received valuable input from Lynn Layman, J. Ray Scott, and Wendy Huntoon.

URL to article: http://www.ctwatch.org/quarterly/articles/2006/05/designing-and-supporting-high-end-computational-facilities/