CTWatch Quarterly » What’s Working in HPC: Investigating HPC User Behavior and Productivity

What’s Working in HPC: Investigating HPC User Behavior and Productivity

Nicole Wolter, San Diego Supercomputing Center
Michael O. McCracken, San Diego Supercomputing Center
Allen Snavely, San Diego Supercomputing Center
Lorin Hochstein, University of Nebraska, Lincoln
Taiga Nakamura, University of Maryland, College Park
Victor Basili, University of Maryland, College Park

Clear examples of the differences between the normal and marquee users can be seen in the different responses from the verbal interviews and the survey results. In some cases the interviewees and survey respondents were using the same systems and the same software, but due to different processor counts and memory requirements, their opinions of the productivity impact of system traits were very different.

The interviews with SDSC performance consultants showed that memory or I/O was the primary bottleneck for their codes. However, 8 out of 12 of our survey respondents believed that processor floating-point performance was their top bottleneck. Since some survey respondents were using the same code as our interview subjects, we attribute this discrepancy to the scale at which the marquee users run. In many cases performance bottlenecks will only become apparent when the system resources are stressed. The survey respondents, representing our normal users, report standard production runs requesting 256-512 processors, while interview subjects often use more than 1,024 processors.

The interactive running capabilities of a system provide another opportunity to distinguish user classes. Most survey respondents were able to successfully debug their codes on 1-8 processors running for less than one hour, which is feasible on current interactive nodes. Marquee users, on the other hand, must submit to the batch queue and are subject to long wait times to run very short jobs to reproduce bugs and test fixes. The marquee users expressed their frustration with the lack of on-demand computing for a large number of processors, specifically for debugging purposes.

This implies that different system policies are appropriate for different users. For example, in various phases of a project, resource demands may vary. When major development and tuning has ceased and production has begun, a policy that would allow the entire system to be reserved could be a major improvement to productivity. One marquee user gave the example that it would take a week running around the clock to get one simulation completed and that repeatedly waiting in the queue is wasteful and can extend this process by at least factor of four. In contrast, another researcher was uncomfortable with having dedicated time due to the risk of wasting allocation while fixing problems, preferring a high-priority queue reservation policy instead. System design and site policies should reflect the different types of users and stages of development.

Conjecture 2: Users with the largest allocations and most experience are the most productive.

A common policy of HPC centers is to give preference in queue priority to jobs with higher node requirements, which encourages users to make full use of rare high capacity systems. Large users may also receive more personal attention, even to the point of having a dedicated consultant working on their programs. Does this conclusively imply that larger users are the most productive?

Through our interviews it emerged that productivity, in terms of generating scientific results, is just as difficult to achieve for large users, if not more so, than for smaller users. Queue wait time, reliability and porting issues all cause major problems for large users. Large-scale programs often push the limits of systems and therefore run into problems not often seen at lower scale, such as system and code performance degradation and system reliability problems.

Evaluating the queue based on job logs can be complicated. There are a number of factors that could affect queue priority. Some of the most influential variables are length of runtime, processor count requested, the size of a user's unspent allocation, as well as site administrators' ability to change the priority of individual jobs directly.

Pages: 1 2 3 4 5 6 7 8 9

CTWatch is a collaborative effort				Sponsored By