Additional classroom results include the following:
- Measuring productivity in the HPC domain is part of understanding HPC workflows. However, what does productivity mean in this domain 6? Figure 3 is one model that we can derive from the fact that the critical component of HPC programs is the speedup achieved by using a multiprocessor HPC machine over a single processor 7. Productivity is defined as the relative speedup of a program using an HPC machine compared to a single processor divided by the relative effort to produce the HPC version of the program compared to a single processor version.
Program → 1 2* 3 4 5 Serial effort (hrs) 3 7 5 15 Total effort (hrs) 16 29 10 34.5 22 Serial Exec (sec) 123.2 75.2 101.5 80.1 31.1 Parallel Exec (sec) 47.7 15.8 12.8 11.2 8.5 Speedup 1.58 4.76 5.87 6.71 8.90 Relative Effort 2.29 4.14 1.43 4.93 3.14 Productivity 0.69 1.15 4.11 1.36 2.83 *- Reference serial implementation Table 3 shows the results for one group of students programming the Game of Life (a simple nearest neighbor cellular automaton problem where the next generation of “life” depends upon surrounding cells in a grid and a popular first parallel program for HPC classes) 8. The data shows that our definition of productivity had a negative correlation compared to both total effort and HPC execution time, and a positive correlation compared to relative speedup. While the sample size is too small for a test of significance, the relationships all indicate that productivity does behave as we would want a productivity measure to behave for HPC programs, i.e., good productivity means lower total effort, lower HPC execution time and higher speedup.
Serial MPI OpenMP Co-Array Fortran StarP XMT Nearest-Neighbor Type Problems Game of Life C3A3 C3A3
C0A1
C1A1C3A3 Grid of Resistors C2A2 C2A2 C2A2 C2A2 Sharks & Fishes C6A2 C6A2 C6A2 Laplace’s Eq. C2A3 C2A3 SWIM C0A2 Broadcast Type Problems
LU Decomposition C4A1 Parallel Mat-vec C3A4 Quantum Dynamics C7A1 Embarrassingly Parallel Type Problems Buffon-Laplace Needle C2A1
C3A1C2A1
C3A1C2A1
C3A1Other Parallel Sorting C3A2 C3A2 C3A2 Array Compaction C5A1 Randomized Selection C5A2 Table 4 shows the distribution of programming assignments across different programming models for the first seven classes (using the same CxAy coding used in Table 2). Multiple instances of the same programming assignment lend the results to meta-analysis to be able to consider larger populations of students 2.
Dataset Programming Model Application Lines of Code C3A3 Serial Game of Life mean 175, sd 88, n=10 MPI mean 433, sd 486, n=13 OpenMP mean 292, sd 383, n=14 C2A2 Serial Resistors 42 (given) MPI mean 174, sd 75, n=9 OpenMP mean 49, sd 3.2, n=10 For example, we can use this data to partially answer an earlier stated hypothesis (Hyp. 4: An MPI implementation will require more code than an OpenMP implementation). Table 5 shows the relevant data giving credibility to this hypothesis (but this early data is not statistically significant yet).
- An alternative parallel programming model is the PRAM model, which supports fine-grained parallelism and has a substantial history of algorithmic theory 9. XMT-C is an extension of the C language that supports parallel directives to provide a PRAM-like model to the programmer. A prototype compiler exists that generates code that runs on a simulator for an XMT architecture. We conducted a feasibility study in a class to compare the effort required to solve a particular problem. After comparing XMT-C development to MPI, on average, students required less effort to solve the problem using XMT-C compared to MPI. The reduction in mean effort was approximately 50%, which was statistically significant at the level of p< .05 using a t-test 10.
- While OpenMP generally required less effort to complete (Figure 4), the comparison of defects between MPI and OpenMP, however, did not yield statistically significant results, which contradicted a common belief that shared memory programs are harder to debug. However, our defect data collection was based upon programmer-supplied effort forms, which we know are not very accurate. This led to the defect analysis mentioned previously 5, where we intend to do a more thorough analysis of defects made.
- We are collecting low-level behavioral data from developers in order to understand the "workflows" that exist during HPC software development. A useful representation of HPC workflow could both help characterize the bottlenecks that occur during development and support a comparative analysis of the impact of different tools and technologies upon workflow. One hypothesis we are studying is that the workflow can be divided into one of five states; serial coding, parallel coding, testing, debugging, and optimization.
In a pilot study at the University of Hawaii in Spring of 2006, students worked on the Gauss-Seidel iteration problem using C and PThreads in a development environment that included automated collection of editing, testing, and command line data using Hackystat. We were able to automatically infer the "serial coding" workflow state as the editing of a file not containing any parallel constructs (such as MPI, OpenMP, or PThread calls), and the "parallel coding" workflow state as the editing of a file containing these constructs. We were also able to automatically infer the "testing" state as the occurrence of unit test invocation using the CUTest tool. In our pilot study, we were not able to automatically infer the debugging or optimization workflow states, as students were not provided with tools to support either of these activities that we could instrument.
Our analysis of these results leads us to conclude that workflow inference may be possible in an HPC context. We hypothesize that it may actually be easier to infer these kinds of workflow states in a professional setting, since more sophisticated tool support is often available that can help support inferencing regarding the intent of a development activity. Our analyses also cause us to question whether the five states that we initially selected are appropriate for all HPC development contexts. It may be that there is no "one size fits all" set of workflow states, and that we will need to define a custom set of states for different HPC organizations in order to achieve our goals.
Additional early classroom results are given in Hochstein, et al. 11.