November 2006 B
High Productivity Computing Systems and the Path Towards Usable Petascale Computing
Pedro C. Diniz, Information Sciences Institute, University of Southern California
Tejus Krishna, Information Sciences Institute, University of Southern California

4.3. Discussion

The results presented here resulting from the combination of static and dynamic program behavior information are very encouraging. First, the spatial and temporal locality are qualitatively very close to the results for the same kernel codes as reported by Snavely et. al.,7 and are qualitatively identical. This is depicted by the black diamonds in figure 4. The basic difference for the random access kernel stems from the fact that in different implementations and with different compilers they do have slightly different methods to deal with table (array) accesses. Second, they rely on high-level instrumentation with much lower overheads making them feasible for instrumentation in larger scalar applications that can run to completion much more realistic runs that are possible today with low-level instrumentation approaches. Lastly, this approach allows us to derive much more accurate pictures of the behavior of the program without being cluttered or overwhelmed with the amount of data generated by low level approaches. For instance, using the approach described here we can determine why on the spatial or temporal locality, what they are and what are the dynamic references that contribute to the various metrics. Whereas low-level instrumentation approaches can also determine the source of the program behavior they lack the connection to the source code level this approach inherently preserves.

5. Conclusion

Understanding the behavior of program is a long-standing and sought-after goal. The traditional approach relies on low-level instrumentation techniques to extract performance or execution metrics that can indicate why the performance of a given program is what it is. The high overhead of this approach, the inherent Heissenberg effects of instrumentation and the sheer volume of generated data make it difficult to correlate program behavior with source code constructs. In this article we have described an integrated compiler and run-time approach that allows the extraction of relevant program behavior information by judiciously instrumenting the source code, and deriving performance metrics such as range of array reference addresses and stride and reuse distance information. We have illustrated the application of this approach in instrumenting several kernel codes for spatial and temporal locality scoring. For these kernel codes, this approach derives concrete values for these locality metrics that are qualitatively identical to the scoring obtained by low-level instrumentation of the code but at a much lower execution time cost. This suggests the approach to be extremely valuable in allowing large codes to execute in realistic settings.

1Sherwood, T., Sair, S., Calder, B. “Phase Tracking and Prediction,” In Proceedings of the 30th Intl. Symp. on Computer Architecture (ISCA), ACM Press, June 2003.
2Saavedra, R., Smith, A. “Analysis of Benchmark Characteristics and Benchmark Performance Prediction,” ACM Trans. Computer Systems, Vol. 14, No. 4, pp. 344-384.
3Agrawala, A., Mohr, J. “A Model for Workload Characterization,” in Proceedings of the 1975 Symposium on Simulation of Computer Systems, August 1975.
4Wong, W., Morris, R. “Benchmark Synthesis Using the LRU Cache Hit Function,” IEEE Trans. on Computers, Vol 37, No. 6, June 1988.
5Strohmaier, E., Shan, H. “Architecture Independent Performance Characterization and Benchmarking for Scientific Applications,” In Proceedings of the Second Intl. Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’04), IEEE Computer Society Press, pp. 467-474 2004.
6Brehob, M., Enbody, R. “An Analytical Model of Locality and Caching,” Technical Report, Michigan State University, MSU-CSE-99-31.
7Weinberg, J., McCracken, M., Strohmaier, E., Snavely, A. “Quantifying Locality In The Memory Access Patterns of HPC Applications,” In Proceedings of the ACM/IEEE Supercomputing Conference (SC’05), 2005.
8Malony, A., Shende, S., Trebon, N., Ray, J., Armstrong, R., Rasmussen, C., Sottile, M. "Performance Technology for Parallel and Distributed Component Software," Concurrency and Computation: Practice and Experience, Vol. 17, Issue 2-4, pp. 117-141, John Wiley & Sons, Ltd., Feb - Apr, 2005.
9Wang, M., Ailamaki, A., Faloutsos, C. “Capturing the Spatio-Temporal Behavior of Real Traffic Data Performance,” 2002 IFIP Intl. Symp. on Comp. Performance Modeling, Measurement, and Evaluation, Rome, Italy.
10Nagel, W., Arnold, A., Weber. M., Hoppe, H. C., Solchenbach, K. “VAMPIR: Visualization and Analysis of MPI Resources,” Supercomputer 1996; 12(1): 69–80.
11Shende, S., Malony, A. D. "The TAU Parallel Performance System," Internatinoal Journal of High Performance Computing Applications, 20(2): 287-331 Summer 2006.
12Open64 Open Source Compiler Infrastructure – open64.sourceforge.net/

Pages: 1 2 3 4 5 6

Reference this article
"A Compiler-guided Instrumentation for Application Behavior Understanding," CTWatch Quarterly, Volume 2, Number 4B, November 2006 B. http://www.ctwatch.org/quarterly/articles/2006/11/a-compiler-guided-instrumentation-for-application-behavior-understanding/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.