The FTE effort required to carry out the systems and operations functions clearly depends on many factors, such as the number of different systems, the size (processor count) of each, the number of users, whether or not you take early systems or wait until they are mature (i.e. others have worked out the inevitable bugs that are encountered in early systems). Estimates are that it takes 8-12 FTEs for a single, large, stand-alone HPC system to provide systems administration, security, networking, storage, and 24x7 monitoring and operational support. After that, when you add more people for each new system, the growth is much slower. However, Tim Thomas of the University of New Mexico reports (private communication) that he has looked at many systems and found the following amusing rule of thumb. For a small installation (less than 50 processors), you don’t usually find a dedicated system person (this means that you are exploiting graduate students or the PI). After that, he finds that sites add one FTE for roughly every 250 processors. Of course, this rule shouldn’t work because it is insensitive to the number of systems, the number of users, whether these are early systems or not. A quick look at other TeraGrid sites indicates that this rule gives a fairly good estimate of the systems effort expended. However, the more systems involved and the richer the underlying infrastructure, the more personnel are needed to provide stable support and assistance to users, so these numbers are highly dependent on the facility.
We received valuable input from Lynn Layman, J. Ray Scott, and Wendy Huntoon.






