CTWatch Quarterly » SLOPE - A Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

SLOPE - A Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

Pedro C. Diniz, Information Sciences Institute, University of Southern California
Jeremy Abramson, Information Sciences Institute, University of Southern California

3. A Case Study

We now present preliminary experimental results of the performance expectation and sensitivity analysis for a synthetic code, the UMT2K, modeled after a real engineering application, a 3D, deterministic, multigroup, photon transport code for unstructured meshes. ⁶ We first describe the methodology followed in these experiments and then present and discussion our findings using our analysis approach.

3.1 Methodology

We have built the basic analysis described above in section 2 using the Open64 compilation infrastructure.¹⁰ Our implementation takes an input source program file and focuses on the computationally intensive basic blocks. We used our analysis to extract DFG and generate performance expectation metrics for various combinations of architectural elements. We also applied manual unrolling to the significant loops in the kernel code as a way to compare the expected performance for the various code variants given the potential increase in instruction-level parallelism.

3.2. The Kernel Code

The computationally intensive section of the UMT2K code is located in what is designated the angular-loop located in the snswp3D subroutine. This loop contains a long basic block spanning about 300 lines of C code. This basic block, executed at each iteration of the loop, is the "core" of the computation, and has many high-latency floating point operations, such as divides (4) and multiplies (41) as depicted in Table 1.

Table 1. Instruction breakdown for the core of UMT2K.

Total Operations	FP Operations	Integer Operations	Load/Store Operations	FP Multiplies	FP Divides	Integer Multiply
272	93	95	84	41	4	22

In the next section we review some of the experimental results for this basic block in an unmodified form, as well as for manually unrolled versions to explore the performance impact of data dependences and number of arithmetic and load/store units on the projected performance. The unrolled versions will expose many opportunities to explore pipelined and non-pipelined execution of the various functional units and individual operations.

Table 2 below depicts the latencies of the individual operations used in our approach. These latencies are notional rather then being representative of a real system.

Table 2. Selected Operation Latencies.

Load (cache miss)	Load Address	32-bit int. Add	32-bit Int. Multiply	32-bit FP Multiply	32-bit FP Divide	Array Address Calculation (Non-affine)	Array Address Calculation (Affine)
20	2	1	12	50	60	33	13

Pages: 1 2 3 4 5

CTWatch is a collaborative effort				Sponsored By