CTWatch Quarterly » Symbolic Performance Modeling of HPCS Applications

Symbolic Performance Modeling of HPCS Applications

Sadaf R. Alam, Oak Ridge National Laboratory
Nikhil Bhatia, Oak Ridge National Laboratory
Jeffrey S. Vetter, Oak Ridge National Laboratory

The validation of an MA performance model is a two-stage process. When a model is initially being created, validation plays an important role in guiding the resolution of the model at various phases in the application. Later, the same model and validation technique can be used to validate against historical data and across the parameter space. The model verification output enables us to identify the most floating-point intensive loop block of the code in the CG benchmark. This loop block is shown in Figure 3, which is called twice during a conjugate gradient calculation in the CG benchmark. The symbolic floating point operation cost of the loop is approximately 2*na/(num_proc_cols*nonzer*ceiling(nonzer/nprows)).

do j=1,lastrow-firstrow+1 sum = 0.d0 do k=rowstr(j),rowstr(j+1)-1 sum = sum + a(k)*p(colidx(k)) enddo w(j) = sum Eddo

Figure 3. The partition submatrix-vector multiply

Using the MA models, we generated the scaling of the floating-point operation cost of the loop block in Figure 3 with the other loop blocks within a conjugate gradient iteration. The model predictions are shown in Figure 4. The total cost of two invocations of the submatrix vector multiply operation contributes to a large fraction of the total floating-point operation cost. Loop L1 is the first loop block in the CG timestep iteration and L2 is the second. Figure 4 shows that the workload is not evenly distributed among the different loop blocks (or phases of calculations), and the submatrix vector multiply loop can be a serious bottleneck. Furthermore, as we scale the problem to a large number of processors, we begin to identify loops that are either the Amdahl’s proportions of the serial code or their loop count is directly proportional to the number of MPI tasks in the system. We found that the loop count of loop number 3 and 8 depend on the number of MPI tasks (log₂(log₂(MPI_tasks)), while loop 1 and 8 scale at a slower rate than loop 2 and 7 (submatrix vector multiply loop), since the cost of loop 2 and 7 is divided twice by the scaling parameters as compared to 1 and 8, which is divided once by the scaling parameter. Another interesting feature is the scaling pattern, which is not linear because the mapping and distribution of workload depends on the ceiling (log₂(MPI_tasks)).

We collected the runtime data for the loops blocks in CG time step iterations on XT3⁸ and Blue Gene/L⁹ processors to validate our workload distribution and scaling patterns. Figure 5 shows the percentage of runtime spent in individual loop blocks. Comparing it with the workload distribution in Figure 4, we observe not only a similar workload distribution but also a similar scaling pattern. Note that the message passing communication times are not included in these runtime measurements. We collected data for the Class D CG benchmark on the XT3 system, which also validates the floating-point message count distribution and scaling characteristics that are generated by the symbolic MA models.

Figure 4. Distribution of floating-point operation cost within a time step iteration in the NAS CG benchmark.

Figure 5. Percentage of total runtime spent in individual loop blocks in a CG iteration.

Pages: 1 2 3 4 5 6

CTWatch is a collaborative effort				Sponsored By