We now present preliminary experimental results of the performance expectation and sensitivity analysis for a synthetic code, the UMT2K, modeled after a real engineering application, a 3D, deterministic, multigroup, photon transport code for unstructured meshes. 6 We first describe the methodology followed in these experiments and then present and discussion our findings using our analysis approach.
We have built the basic analysis described above in section 2 using the Open64 compilation infrastructure.10 Our implementation takes an input source program file and focuses on the computationally intensive basic blocks. We used our analysis to extract DFG and generate performance expectation metrics for various combinations of architectural elements. We also applied manual unrolling to the significant loops in the kernel code as a way to compare the expected performance for the various code variants given the potential increase in instruction-level parallelism.
The computationally intensive section of the UMT2K code is located in what is designated the angular-loop located in the snswp3D subroutine. This loop contains a long basic block spanning about 300 lines of C code. This basic block, executed at each iteration of the loop, is the "core" of the computation, and has many high-latency floating point operations, such as divides (4) and multiplies (41) as depicted in Table 1.
| Total Operations |
FP Operations |
Integer Operations |
Load/Store Operations |
FP Multiplies |
FP Divides |
Integer Multiply |
| 272 | 93 | 95 | 84 | 41 | 4 | 22 |
In the next section we review some of the experimental results for this basic block in an unmodified form, as well as for manually unrolled versions to explore the performance impact of data dependences and number of arithmetic and load/store units on the projected performance. The unrolled versions will expose many opportunities to explore pipelined and non-pipelined execution of the various functional units and individual operations.
Table 2 below depicts the latencies of the individual operations used in our approach. These latencies are notional rather then being representative of a real system.
|
Load (cache miss) |
Load Address |
32-bit int. Add |
32-bit Int. Multiply |
32-bit FP Multiply |
32-bit FP Divide |
Array Address Calculation (Non-affine) |
Array Address Calculation (Affine) |
| 20 | 2 | 1 | 12 | 50 | 60 | 33 | 13 |






