CTWatch Quarterly » SLOPE - A Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

SLOPE - A Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

Pedro C. Diniz, Information Sciences Institute, University of Southern California
Jeremy Abramson, Information Sciences Institute, University of Southern California

2. Technical Approach

We now describe the details of our technical approach to the problem of performance prediction and sensitivity analysis using high-level source code information using static compiler data and control- dependence analysis techniques.

2.1 Basic Analyses: Data-Flow Graph (CFG) Crtical Path (CP) Analysis

This analysis extracts the basic blocks at the source code level by inspection of the compiler intermediate representations in Whirl.⁷ Because the front-end of the compiler does perform some deconstruction of high-level constructs, most notably while-loops, it is not always possible to map the intermediate representation constructs back to source code constructs. Despite some of these shortcomings, the front-end does keep track of line number information that allows the tool to provide reasonably accurate feedback to the programmer.

For each basic block, the compiler extracts a data-flow graph accurately keeping track of the dependencies (true-, anti-, input- and output- dependences) via scalar variables and conservatively by considering that any reference to an array variable may induce a dependency. In some cases, the compiler can use data dependence analysis techniques (see e.g., ⁸) to disambiguate the references to arrays and thus eliminate false dependences in the DFG. In the current implementation, we make the optimistic assumption that arrays with distinct symbolic names are unaliased. While this assumption is clearly not realistic in the general case, it holds for the kernel code in our controlled experimental results.

Finally, we identify the loops of the code across the various basic blocks of a procedure to uncover basic and derived induction variables. This information is vital in determining array access stride information as explained below.

2.2 Data Access Pattern Analysis

In this analysis, the compiler extracts the affine relations between scalar variables in the array indexing functions whenever possible taking into account the basic and derived induction variables. For example, knowing the fact that the scalar variables i and j are both loop induction variables, the array reference a[i][j+1] will have as affine coefficients the first and second dimension, respectively the values (1,0,0) and (0,1,1), where the last element in each tuple corresponds to the constants 0 and 1 in the expressions i+0 and j+1. Using this access information and the layout of the array (i.e., either column-wise and row-wise), the analysis determines the stride information for each access and thus makes judgments about the latency of the corresponding memory operations. Regular memory accesses are very likely to hit the cache or reside in registers as a result of an aggressive pre-fetching algorithm⁹ whereas an irregular or random memory reference is very likely to miss the cache. The construction of the DFG uses this knowledge to decorate the latency of the individual array accesses as either regular or irregular thus taking into account, to some extent, part of the memory hierarchy effects.

2.3 Scheduling Analysis

Using the extracted DFG, we then develop our own operation scheduler for determining the latency of the execution of each basic block. In this scheduler we can program the latencies of the individual operation as well as if they are executed in a pipelined fashion or not. Our scheduler also allows us to specify the number of functional units for either each individual type of operations or for a generic functional unit. For example, we can segregate the arithmetic and floating-point operations in a single functional unit or allow all of them to be executed in a generic functional unit with integer and floating-point operations. We can also specify multiple load and store units thus modeling the available bandwidth of the target architecture. Finally, we assume the scheduler is an on-line as-soon-as-possible scheduling algorithm with zero-time overhead in scheduling of the various operations in the various functional units.

For simplicity, in this first implementation, we assume the target architecture has enough registers available to support the manipulation of the various arithmetic and addressing operations. This assumption, while not unrealistic for basic blocks with a small number of statements, is clearly unrealistic for larger basic blocks that result from the application of program transformations such as loop unrolling. As such, the performance expectations derived using these assumptions should be viewed as an upper bound. For the same reason, we have folded the information about the data access pattern of array accesses in the scheduling by assuming that accesses that exhibit a very regular, strided data access pattern will have a low memory latency and those that are very irregular in nature will exhibit a high latency. This is because the first set is likely to hit the cache while the second is not.

Our scheduler, though simple, allows us to anticipate the completion time of the operations corresponding to a given basic block, along with various efficiency metrics, such as the number of clock cycles a given computation was stalled by while awaiting an available functional unit or awaiting a data dependency of the code to be satisfied. In our reporting, we use generic and architecture independent performance metrics such as the number of operations per clock cycle or number of floating point operations per clock cycle, rather then focusing on FLOPs or CPIs.

Pages: 1 2 3 4 5

CTWatch is a collaborative effort				Sponsored By