CTWatch
November 2006 B
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Pedro C. Diniz, Information Sciences Institute, University of Southern California
Tejus Krishna, Information Sciences Institute, University of Southern California

1
Abstract

The standard approach to the problem of application behavior understanding relies on instruction-level instrumentation. This approach generates humongous volumes of data that overwhelm programmers in trying to understand the cause-effect relationships in their applications and thus improve the performance of their codes. This article describes an integrated compiler and run-time approach that allows the extraction of relevant program behavior information by judiciously instrumenting the source code and deriving performance metrics such as range of array reference addresses, array access stride information or data reuse characteristics. This information ultimately allows programmers to understand why the performance is what it is on a given machine as they relate to program constructs that can be reasoned about. We describe the overall organization of our compiler and run-time instrumentation system and present preliminary results for a selected set of kernel codes. The approach allow programmers to derive a wealth of information about the program behavior with a run-time overhead of less than 15% of the original code’s execution time making this attractive to instrument and analysis of code with extremely long running times where binary-level approaches are simply impractical.

1. Introduction and Motivation

Understanding the performance of modern high-performance computing machines has become notoriously difficult. These architectures expose to the programmers different hardware mechanisms that often interact in not so predictable ways. Aiming at improving their performance, programmers must resort to low-level instrumentation techniques, such as binary rewriting, to capture run-time execution data from which they hope to understand the performance behavior of the program. This approach, however, generates huge volumes of raw data at a level of abstraction that is seldom adequate for programmers to relate to the source code of their applications. Programmers are thus left guessing which style of coding is adequate for the compiler and the target architecture at hand. Worse, when one of these two components change, the learning investment is wasted, as the process must be restarted.

In this work we describe an alternative approach that relies on high-level source code instrumentation guided by the application of compiler analyses techniques. The basic idea is to judiciously instrument the application source code to extract various execution metrics that can be related to the source code. While some of the aspects of the compiler-architecture interactions are not directly modeled, the fundamental program behavior is retained at a level of abstraction that allows the compiler to relate the observed metrics to source-level program constructs programmers can understand. An added benefit exists in the isolation of the approach from the compiler effects or idiosyncrasies and thus provides a path for program behavior prediction for future architectures for which a compiler or the actual machines do not yet exist.

Consider as an illustrative example the code in Figure 1. For each invocation of the innermost loop, the computation accesses m elements of the array A with a stride of 1 and m elements of the array B with a stride that is dictated by the array size of B. There are several key observations in this example. First, the accesses to the array A and B are dependent on the range of values n and m which are not known at compile time. Second, the range of addresses to the array A are repeated across the invocations of the i loop. Finally, and although some of the values are not known, such as the actual value of k, its value is not modified throughout the execution of both loops and thus can be safely considered to be loop invariant with respect to these two loops.


for(i=0; i < n; i++){
for(j = 0; j < m; j++){
... = A[k][i];
... = B[j][i];
}
}

saveLoopBounds(loop1,0,n-1);
saveLoopBounds(loop2,0,m-1);
saveAddrRange(Ref1,A[k][0],A[k][n-1]);
for(i=0; i < n; i++){
saveAddrRange(Ref2,B[j][i],B[m-1][i]);
for(j = 0; j < m; j++){
... = A[k][j];
... = B[j][i];
}
}
Figure 1. Illustrative example: original source code (left) and instrumented source code (right).

A way to capture the behavior of the program in more detail is to instrument it at the source-code level as shown on the right-hand-side of figure 1. Here we have augmented the original code with calls to instrumentation functions with the appropriate arguments. For instance, saveAddrRange(Ref1,A[k][0],A[k][N-1])will record that, at the particular execution point in the program, the range defined by the two arguments A[k][0],A[k][n-1]will be accessed m times. In this particular instrumentation, we are taking advantage of the fact that the range of addresses is clearly invariant with respect to the i-loop and thus can be hoisted outside the two loops. In this example the loop bounds are symbolically constant and the array access functions are affine with respect to the loop index variables. In the end with the results obtained by this instrumentation, we can learn valuable information regarding the range of addresses the computation accesses and/or the access patterns and thus locality of reference.

Many other authors have addressed the problem of understanding and quantifying the amount of reuse in applications. In the context of scientific application codes, such as the SpecFP benchmark suite, Sherwood et. al. 1 developed a low-level instruction instrumentation approach that identifies the phases of a given computation in terms of the similarities between executed instructions. Other authors have focused on cache locality metrics (e.g., 2 3 4) whereas other use temporal reuse distance or stack distances.5 6 Weinberg et. al.7 have implemented the locality scores described in this document using at a very low-level instrumentation approach.

All these efforts have in common the fact that they instrument the addresses and instruction issued by a running executable directly at a very low-level.8 9 10 11 They offer the promise of high precision, provided one can cope with the enormous data volume they generate or the substantial application slowdown. The approach described in this document explores the opposite end of the instrumentation spectrum. It does not rely on low-level instrumentation, but on high-level source code instrumentation. This approach is faster but comes at a loss in precision, with the additional handicap that it does not take into account the compiler effects – how the compiler has modified the generated code to included hardware-oriented execution transformations such as pre-fetching or speculation. The fundamental premise of this work is that it is possible with little source level instrumentation overhead to capture a large amount of program execution information.

Pages: 1 2 3 4 5 6

Reference this article
"A Compiler-guided Instrumentation for Application Behavior Understanding," CTWatch Quarterly, Volume 2, Number 4B, November 2006 B. http://www.ctwatch.org/quarterly/articles/2006/11/a-compiler-guided-instrumentation-for-application-behavior-understanding/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.