CTWatch Quarterly » Designing and Supporting High-end Computational Facilities

Designing and Supporting High-end Computational Facilities

Ralph Roskies, Pittsburgh Supercomputing Center
Thomas Zacharia, Oak Ridge National Laboratory

Systems and operations personnel tasks

There are a large number of different tasks that get lumped into systems and operations. We break them down into Core System Software, Machine Room Networking, User Access, Resilience, and Management. We briefly describe each of them, and return at the end to estimate the FTE effort required to carry them out.

Core System Software – This includes support for the operating system (OS), as well as for tools layered on top of the OS, including debuggers, scientific libraries, system monitoring displays, and many more. Get used to the idea that this work is never done. You will continually be installing new versions and patches. Best practices in version control are a necessity. New versions often introduce new bugs, and you will want the ability to fall back easily on the previous version. There are really two aspects to the OS and tool support. Some of it runs on the individual nodes of the system. Others are concerned with the system-wide aspects. Both need attention. Moreover, most HPC centers run multiple computing systems, each with a different OS, and each, of course, needs attention. For large systems, be prepared to have a system larger than what the vendor has for internal testing of software. This implies that patches and new system software versions may have never been tested on a large machine before being delivered. You should negotiate with the vendor to provide test time on your system to run validation and regression testing before installing new software.

Rather distinct are issues related to file systems. Often, there are at least three different file systems, which we can call home directories, local files, and system-wide files. Home directories are associated with individual users or projects. It is here that users store select information long term. These are usually backed up and often subject to quotas. Local file systems are local to the individual nodes, while system-wide file systems are globally accessible, and can often be written and read in parallel. Both sets of these files are viewed as temporary. They are associated with running jobs, or jobs which have recently run. Permanent file storage is found in the mass store system, which usually has a disk cache and a tape back end, with system algorithms that determine when to move the files from the rapidly accessible but expensive disk cache to the less rapidly accessible but much less expensive tape system. Data on the mass storage system is also usually backed up. Monitoring and managing the file systems is necessarily an ongoing operational requirement.

Machine Room Networking – The tasks include engineering support for the design, development, installation, operation, testing and debugging of all network infrastructure. At minimum, one needs a good background in network protocols, including but not limited to TCP/IP, as well as network diagnostics, end-to-end network performance and network routing.

User Access – The issues here can be categorized as user accounts, job processing, and reporting. User accounts have to be continually created, monitored and shut down. Authentication mechanisms have to be installed, which usually also means maintaining a Kerberos server. Job processing may include such things as developing and maintaining a scheduler, which implements scheduling policy, as well as exposing a queuing system that the users see. Reporting means processing individual job records and creating a database to easily answer questions such as how many users have used what resources in what discipline, what fraction of the usage has used what number of processors, how many new users have been added in the past year, and what is the demographics of the users. Today’s management usually wants a web interface to be able to easily query the database and directly extract the kinds of information it needs.

Resilience means keeping the system as secure as possible, testing that it is operating reliably and taking the necessary steps if it is not. It includes such matters as security, monitoring system status, node management and regression testing. Security is an increasing concern at all large HPC sites. Items include reliable authentication of users, developing and enforcing best practices for staff including one-time passwords for those with system privileges. Knowing the availability of processors for jobs is not nearly as simple as it sounds. Usually it is felt that hardware errors should be dealt with by the manufacturer, while software errors can be cured by system reboot. But deciding whether errors are hardware or software is actually not clearcut. Nodes can seem to be available to the scheduler, but in fact are not. Most common is the case where processes from a previous job don’t get cleaned up. Starting another job on this node will then not perform properly. Node management and analysis means both calling field service in case of hardware issues and maintaining an easily queried database to identify problematic nodes- those that fail repeatedly in hardware or software. Although most manufacturers support error logging, those logs often produce flat files and in many different places. Effort needs to be expended to collect that data into a useful database. Then the data has to be analyzed, so that one can spot problem nodes ahead of when they actually fail. Moreover, you cannot safely assume that nodes that have been returned by field service are either in good working order or even that the problem you had detected has been solved. After installation of new software or hardware, regression testing must re-establish that the system now gives acceptable answers and performance for all the tests in the test suite.

Management – This includes the staff time to manage the people, including hiring people, project management of carrying out the individual subtasks outlined above, management reporting, as well as keeping an up to date inventory of equipment. It also includes vendor relationships, not only with your current suppliers but with others as well, so that you can keep abreast of new technological developments that might be relevant for your next endeavors. For your current vendors, there is often significant effort in keeping track of the progress on problems you have reported that they have promised to fix.

Pages: 1 2 3 4

CTWatch is a collaborative effort				Sponsored By