The studies above identified some system and usage trends. This section draws conclusions about the conjectures made in the introduction section of this paper, and discusses how the studies influenced our understanding of, and the implications for, HPC centers and users.
While it is clear that users have a wide range of system demands, as seen in the variety of allocation and job sizes, it is not uncommon to assume that as long as users share a system, they share the same challenges staying productive. However, as details of individual usage patterns emerged, we found that not all users experienced the same problems with the same systems and tools, and some problems affected some users more than others.
Three distinct classes of users emerged based on HPC resources utilization, project scale, and the problems encountered. While these classes are clearly related to allocation size, they are defined by characteristics of their system usage.
Marquee Users: Marquee users run at very large scale, often using the full system and stressing the site policies and system resources. For these reasons they will be the most affected by node failures and are generally unable to use the interactive nodes to debug or benefit from backfill opportunities to the queue. We have named this group “Marquee” users because such projects are often used to publicize HPC centers.
Marquee users often have a consultant, or are the consultant, working on the application to improve the performance, to port to a new system or scale to larger numbers of processors. The marquee users are generally represented in our study through the job logs and personal interviews.
Normal Users: The largest class, “Normal” users, tend to run jobs using between 128 and 512 processors. Their problem size is not necessarily limited by the available resources of the systems they use. This greater flexibility in their resource usage, allows them the ability to run on smaller systems that may not be as heavily loaded. Normal users are less likely to have been forced to tune for performance optimization, or to have used performance tools. Normal users are generally represented in our study in the form of user surveys and job logs.
Small Users: Small users have a minor impact on system resources, do not tend to use large allocations, and generally run jobs with fewer than 16 processors, which are often started quickly by backfilling schedulers. In general, small users are learning parallel programming, and their productivity challenges are more likely due to unfamiliarity with the concepts of HPC computing. “Small” users are generally represented in our study in the form of help tickets and job logs.