CTWatch Quarterly » Performance Engineering: Understanding and Improving the Performance of Large-Scale Codes

Performance Engineering: Understanding and Improving the Performance of Large-Scale Codes

David H. Bailey, Lawrence Berkeley National Laboratory
Robert Lucas, University of Southern California
Paul Hovland, Argonne National Laboratory
Boyana Norris, Argonne National Laboratory
Kathy Yelick, Lawrence Berkeley National Laboratory
Dan Gunter, Lawrence Berkeley National Laboratory
Bronis de Supinski, Lawrence Livermore National Laboratory
Dan Quinlan, Lawrence Livermore National Laboratory
Pat Worley, Oak Ridge National Laboratory
Jeff Vetter, Oak Ridge National Laboratory
Phil Roth, Oak Ridge National Laboratory
John Mellor-Crummey, Rice University
Allan Snavely, University of California, San Diego
Jeff Hollingsworth, University of Maryland
Dan Reed, University of North Carolina
Rob Fowler, University of North Carolina
Ying Zhang, University of North Carolina
Mary Hall, University of Southern California
Jacque Chame, University of Southern California
Jack Dongarra, University of Tennessee, Knoxville
Shirley Moore, University of Tennessee, Knoxville

4. Application Engagement

The key long-term research objective of PERI is to automate as much of the performance tuning process as possible. Ideally, in five years we will produce a prototype of the kind of system that will free scientific programmers from the burden of tuning their codes, especially when simply porting from one system to another. While this may offer today’s scientific programmers hope for a brighter future, it does little to help with the immediate problems they face as they try ready their codes for Petascale. PERI has therefore created a third activity that we are calling application engagement, wherein PERI researchers will bring their tools and skills to bear in order both to help DOE meet its performance objectives and to ground our own research in practical experience. This section discusses the current status of our application engagement activities.

PERI has a two-pronged application engagement strategy. Our first strategy is establishing long term liaison relationships with many of the application teams. PERI liaisons who work with application teams without significant, immediate performance optimization needs provide these application teams with advice on how to collect performance data and track performance evolution, and ensure that PERI becomes aware of any changes in these needs. For application teams with immediate performance needs, the PERI liaison works actively with the team to help them meet their needs, utilizing other PERI personnel as needed. The status of a PERI liaison activity, passive or active, changes over time as the performance needs of the application teams change. As of June 2007, PERI is working actively with six application teams and passively with ten others. The nature of each interaction is specific to each application team.

The other primary PERI application engagement strategy is tiger teams. A tiger team works directly with application teams with immediate, high-profile performance requirements. Our tiger teams, consisting of several PERI researchers, strive to improve application performance by applying the full range of PERI capabilities, including not only performance modeling and auto-mated tuning research but also in-depth familiarity with today’s state-of-the-art performance analysis tools. Tiger team assignments are of a relatively short duration, lasting between 6 and 12 months. As of June 2007, PERI tiger teams are working with two application codes that will be part of the 2007 JOULE report: S3D¹⁸ and GTC_s.¹⁹ We have already identified significant opportunities for performance improvements for both applications. Current work is focused on providing these improvements through automated tools that support the continuing code evolution required by the JOULE criteria.

5. Summary

The Performance Engineering Research Institute was created to focus on the increasingly difficult problem of achieving high scientific throughput on large-scale computing systems. These performance challenges arise not only from the scale and complexity of leadership class computers, but also from the increasing sophistication of today’s scientific software. Experience has shown that scientists want to focus their programming efforts on discovery and do not want to be burdened by the need to constantly refine their codes to maximize performance. Performance tools that they can use themselves are not embraced, but rather viewed as a necessary evil.

To alleviate scientists from the burden of performance tuning, PERI has embarked on a research program addressing three different aspects of performance tuning: performance modeling of applications and systems; automatic performance tuning; and application engagement and tuning. Our application engagement activities are intended to help scientists address today’s performance related problems. We hope that our automatic performance tuning research will lead to technology that, in the future, will significantly reduce this burden. Performance modeling in-forms both of these activities.

While PERI is a new project, as are all SciDAC-2 efforts, it builds on five years of SciDAC-1 research and decades of prior art. We believe that PERI is off to a good start, and that its investigators have already made contributions to SciDAC-2 and to DOE’s 2007 Joule codes. We confidently look forward to an era of Petascale computing in which scientific codes migrate amongst a variety of leadership class computing systems without their developers being overly burdened by the need to continually refine them so as to achieve acceptable levels of throughput.

References

¹ DOE SciDAC Program – www.scidac.gov/
² Bailey, D. H., Snavely, A. “Performance Modeling: Understanding the Present and Predicting the Future,” EuroPar 2005 , September 2005, Lisbon.
³ Hoisie, A., Lubeck, O., Wasserman, H. “Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications,” International Journal of High Performance Computing Applications, Vol. 14 (2000), no. 4, pg 330-346.
⁴ Ipek, E., de Supinski, B. R., Schulz, M., McKee, S. A. “An Approach to Performance Prediction for Parallel Applications,” Euro-Par 2005, Lisbon, Portugal, Sept. 2005.
⁵ Weinberg, J., MCracken, M. O., Snavely, A., Strohmaier, E. “Quantifying Locality In The Memory Access Patterns of HPC Applications,” Proceedings of SC2005, Seattle, WA, Nov. 2005.
⁶ Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, S., Subramonian, R., von Eicken, T. “LogP: Towards a Realistic Model of Parallel Computation,” Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM Press, 1993, pg. 1-12.
⁷ Alexandrov, A., Ionescu, M. F., Schauser, K. E., Scheiman, C. “LogGP: Incorporating long messages into the LogP model,” In Proceedings of the 7th annual ACM Symposium on Parallel Algorithms and Architectures, ACM Press, 1995, pg. 95-105.
⁸ Kielmann, T., Bal, H. E., Verstoep, K. “Fast Measurement of LogP Parameters for Message Passing Platforms,” in Jose D.P. Rolim, editor, IPDPS Workshops, volume 1800 of Lecture Notes in Computer Science, pp. 1176-1183, Cancun, Mexico, May 2000. Springer-Verlag.
⁹ Culler, D., Lui, L. T., Martin, R. P., Yoshikawa, C. “Assessing Fast Network Interfaces,” IEEE Micro, Vol. 16 (1996), pg. 35-43.
¹⁰ Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. “Performance Analysis of MPI Collective Operations,” 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS 2005), Denver, CO, Apr. 2005.
¹¹ Alam, S. R., Vetter, J. S. “A Framework to Develop Symbolic Performance Models of Parallel Applications,” Proc. 5th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS 2006), 2006.
¹² Buck, B. R., Hollingsworth, J. K. “An API for Runtime Code Patching,” Journal of High Performance Computing Applications, Vol. 14 (2000) no. 4.
¹³ Whaley, C., Petitet, A., Dongarra, J. “Automated Empirical Optimizations of Software and the ATLAS Project,” Parallel Computing, Vol. 27 (2001), no. 1, pg. 3-25.
¹⁴ Frigo, M., Johnson, S. “FFTW: An Adaptive Software Architecture for the FFT,” Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, Washington, May 1998.
¹⁵ Bilmes, J., Asanovic, K., Chin, C. W., Demmel, J. “Optimizing Matrix Multiply using PHi-PAC: a Portable, High-Performance, ANSI C Coding Methodology,” Proceedings of the International Conference on Supercomputing, Vienna, Austria, ACM SIGARCH, July 1997.
¹⁶ Vuduc, R., Demmel, J., Yelick, K. “OSKI: A Library of Automatically Tuned sparse Matrix Kernels,” Proceedings of SciDAC 2005, Journal of Physics: Conference Series, June 2005.
¹⁷ Chen, C., Chame, J., Hall, M. “Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy,” Proceedings of the Conference on Code Generation and Optimization, March, 2005.
¹⁸ Echekki, T., Chen, J. H. "DNS of Autoignition in Nonhomogeneous Hydrogen-Air Mixtures," Combust. Flame, Vol.134 (2003), pg. 169-191.
¹⁹ Lin, Z., Hahm, T. S., Lee, W. W., Tang, W. M., White, R. B. “Turbulent Transport Reduction by Zonal Flows: Massively Parallel Simulations,” Science, 281 (1998), pg. 1835-1837.

Pages: 1 2 3 4

CTWatch is a collaborative effort				Sponsored By