November 2006 A
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Lorin Hochstein, University of Nebraska
Taiga Nakamura, University of Maryland, College Park
Victor R. Basili, University of Maryland, College Park; Fraunhofer Center, Maryland
Sima Asgari, University of Maryland, College Park
Marvin V. Zelkowitz, University of Maryland, College Park; Fraunhofer Center, Maryland
Jeffrey K. Hollingsworth, University of Maryland, College Park
Forrest Shull, Fraunhofer Center, Maryland
Jeffrey Carver, Mississippi State University
Martin Voelp, University of Maryland, College Park
Nico Zazworka, University of Maryland, College Park
Philip Johnson, University of Hawaii


Additional classroom results include the following:

  1. Measuring productivity in the HPC domain is part of understanding HPC workflows. However, what does productivity mean in this domain 6? Figure 3 is one model that we can derive from the fact that the critical component of HPC programs is the speedup achieved by using a multiprocessor HPC machine over a single processor 7. Productivity is defined as the relative speedup of a program using an HPC machine compared to a single processor divided by the relative effort to produce the HPC version of the program compared to a single processor version.
    Program → 1 2* 3 4 5
    Serial effort (hrs) 3 7 5 15
    Total effort (hrs) 16 29 10 34.5 22
    Serial Exec (sec) 123.2 75.2 101.5 80.1 31.1
    Parallel Exec (sec) 47.7 15.8 12.8 11.2 8.5
    Speedup 1.58 4.76 5.87 6.71 8.90
    Relative Effort 2.29 4.14 1.43 4.93 3.14
    Productivity 0.69 1.15 4.11 1.36 2.83
    *- Reference serial implementation
    Table 3. Productivity experiment: Game of Life.

    Table 3 shows the results for one group of students programming the Game of Life (a simple nearest neighbor cellular automaton problem where the next generation of “life” depends upon surrounding cells in a grid and a popular first parallel program for HPC classes) 8. The data shows that our definition of productivity had a negative correlation compared to both total effort and HPC execution time, and a positive correlation compared to relative speedup. While the sample size is too small for a test of significance, the relationships all indicate that productivity does behave as we would want a productivity measure to behave for HPC programs, i.e., good productivity means lower total effort, lower HPC execution time and higher speedup.

    Serial MPI OpenMP Co-Array Fortran StarP XMT
    Nearest-Neighbor Type Problems
    Game of Life C3A3 C3A3
    Grid of Resistors C2A2 C2A2 C2A2 C2A2
    Sharks & Fishes C6A2 C6A2 C6A2
    Laplace’s Eq. C2A3 C2A3
    SWIM C0A2
    Broadcast Type Problems
    LU Decomposition C4A1
    Parallel Mat-vec C3A4
    Quantum Dynamics C7A1
    Embarrassingly Parallel Type Problems
    Buffon-Laplace Needle C2A1
    Parallel Sorting C3A2 C3A2 C3A2
    Array Compaction C5A1
    Randomized Selection C5A2
    Table 4. Some of the early classroom experiments on specific architectures.

    Table 4 shows the distribution of programming assignments across different programming models for the first seven classes (using the same CxAy coding used in Table 2). Multiple instances of the same programming assignment lend the results to meta-analysis to be able to consider larger populations of students 2.

    Dataset Programming Model Application Lines of Code
    C3A3 Serial Game of Life mean 175, sd 88, n=10
    MPI mean 433, sd 486, n=13
    OpenMP mean 292, sd 383, n=14
    C2A2 Serial Resistors 42 (given)
    MPI mean 174, sd 75, n=9
    OpenMP mean 49, sd 3.2, n=10
    Table 5. MPI Program size compared to OpenMP program size.

    For example, we can use this data to partially answer an earlier stated hypothesis (Hyp. 4: An MPI implementation will require more code than an OpenMP implementation). Table 5 shows the relevant data giving credibility to this hypothesis (but this early data is not statistically significant yet).

  2. An alternative parallel programming model is the PRAM model, which supports fine-grained parallelism and has a substantial history of algorithmic theory 9. XMT-C is an extension of the C language that supports parallel directives to provide a PRAM-like model to the programmer. A prototype compiler exists that generates code that runs on a simulator for an XMT architecture. We conducted a feasibility study in a class to compare the effort required to solve a particular problem. After comparing XMT-C development to MPI, on average, students required less effort to solve the problem using XMT-C compared to MPI. The reduction in mean effort was approximately 50%, which was statistically significant at the level of p< .05 using a t-test 10.
  3. While OpenMP generally required less effort to complete (Figure 4), the comparison of defects between MPI and OpenMP, however, did not yield statistically significant results, which contradicted a common belief that shared memory programs are harder to debug. However, our defect data collection was based upon programmer-supplied effort forms, which we know are not very accurate. This led to the defect analysis mentioned previously 5, where we intend to do a more thorough analysis of defects made.

    Figure 4

    Figure 4. Time saved using OpenMP over MPI for 10 programs. (MPI used less time only in case 1 above).
  4. We are collecting low-level behavioral data from developers in order to understand the "workflows" that exist during HPC software development. A useful representation of HPC workflow could both help characterize the bottlenecks that occur during development and support a comparative analysis of the impact of different tools and technologies upon workflow. One hypothesis we are studying is that the workflow can be divided into one of five states; serial coding, parallel coding, testing, debugging, and optimization.

    In a pilot study at the University of Hawaii in Spring of 2006, students worked on the Gauss-Seidel iteration problem using C and PThreads in a development environment that included automated collection of editing, testing, and command line data using Hackystat. We were able to automatically infer the "serial coding" workflow state as the editing of a file not containing any parallel constructs (such as MPI, OpenMP, or PThread calls), and the "parallel coding" workflow state as the editing of a file containing these constructs. We were also able to automatically infer the "testing" state as the occurrence of unit test invocation using the CUTest tool. In our pilot study, we were not able to automatically infer the debugging or optimization workflow states, as students were not provided with tools to support either of these activities that we could instrument.

    Our analysis of these results leads us to conclude that workflow inference may be possible in an HPC context. We hypothesize that it may actually be easier to infer these kinds of workflow states in a professional setting, since more sophisticated tool support is often available that can help support inferencing regarding the intent of a development activity. Our analyses also cause us to question whether the five states that we initially selected are appropriate for all HPC development contexts. It may be that there is no "one size fits all" set of workflow states, and that we will need to define a custom set of states for different HPC organizations in order to achieve our goals.

    Additional early classroom results are given in Hochstein, et al. 11.

Pages: 1 2 3 4 5 6

Reference this article
Hochstein, L., Nakamura, T., Basili, V. R., Asgari, S., Zelkowitz, M. V., Hollingsworth, J. K., Shull, F., Carver, J., Voelp, M., Zazworka, N., Johnson, P. "Experiments to Understand HPC Time to Development ," CTWatch Quarterly, Volume 2, Number 4A, November 2006 A. http://www.ctwatch.org/quarterly/articles/2006/11/experiments-to-understand-hpc-time-to-development/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.