CTWatch Quarterly » Performance Engineering: Understanding and Improving the Performance of Large-Scale Codes

Performance Engineering: Understanding and Improving the Performance of Large-Scale Codes

David H. Bailey, Lawrence Berkeley National Laboratory
Robert Lucas, University of Southern California
Paul Hovland, Argonne National Laboratory
Boyana Norris, Argonne National Laboratory
Kathy Yelick, Lawrence Berkeley National Laboratory
Dan Gunter, Lawrence Berkeley National Laboratory
Bronis de Supinski, Lawrence Livermore National Laboratory
Dan Quinlan, Lawrence Livermore National Laboratory
Pat Worley, Oak Ridge National Laboratory
Jeff Vetter, Oak Ridge National Laboratory
Phil Roth, Oak Ridge National Laboratory
John Mellor-Crummey, Rice University
Allan Snavely, University of California, San Diego
Jeff Hollingsworth, University of Maryland
Dan Reed, University of North Carolina
Rob Fowler, University of North Carolina
Ying Zhang, University of North Carolina
Mary Hall, University of Southern California
Jacque Chame, University of Southern California
Jack Dongarra, University of Tennessee, Knoxville
Shirley Moore, University of Tennessee, Knoxville

2. Performance Modeling and Prediction

The goal of performance modeling is to understand the performance of an application on a computer system via measurement and analysis. This information can be used for a variety of tasks: evaluating architectural tradeoffs early in the system design cycle, validating performance of a new system installation, guiding algorithm choice when developing a new application, improving optimization of applications on specific platforms, and guiding the application of techniques for automated tuning and optimization.² Modeling is now an integral part of many high-end system procurements,³ thus making performance research useful beyond the confines of performance tuning. For performance engineering, modeling analyses (when coupled with empirical data) can inform us when tuning is needed, and just as importantly, when we are done. Naturally, if they are to support automatic performance tuning, then the models themselves must be automatically generated.

Traditional performance modeling and prediction have been done via some combination of three methods: (1) analytical modeling; (2) statistical modeling derived from measurement; and (3) simulation. In the earlier SciDAC-1 Performance Evaluation Research Center (PERC), researchers developed a semi-automatic yet accurate methodology based on application signatures, ma-chine profiles and convolutions. These methodologies allow us to predict performance to within reasonable tolerances for an important set of applications on traditional clusters of SMPs for specific inputs and processor counts.

PERI is extending these techniques not only to account for the effects of emerging architectures, but also to model scaling of input and processor counts. It has been shown that modeling the response of a system’s memory hierarchy to an application’s workload is crucial for accurately predicting its performance on today’s systems with deep their memory hierarchies. The current state-of-the-art works well for weak scaling (i.e., increasing the processor count proportionally with input). PERI is developing advanced schemes for modeling application performance, such as by using neural networks.⁴ We are also exploring variations of existing techniques and parameterized statistical models built from empirical observations to predict application scaling. We are also pursuing methods for automated extrapolation of scaling models, as a function of increasing processor count, while holding the input constant.⁵ One of our goals is to provide the ability to reliably forecast the performance of a code on a machine size that has not yet been built.

Within PERI, we are also extending our framework to model communication performance as a function of the type, size, and frequency of application messages, and the characteristics of the interconnect. Several parallel communication models have been developed that predict performance of message-passing operations based on system parameters.⁶ ⁷ ⁸ Assessing the parameters for these models within local area networks is relatively straightforward and the methods to approximate them have already been established and are well understood.⁸ ⁹ Our models, which are similar to PlogP, capture the effects of network bandwidth and latency; however, a more robust model must also account for noise, contention and concurrency limits. We are developing performance models directly from observed characteristics of applications on existing architectures. Predictions from such models can serve as the basis to optimize collective MPI operations,¹⁰ and permit us to predict network performance in a very general way. This work will require us to develop a new open-source network simulator to analyze communication performance.

Finally, we will reduce the time needed to develop models, since automated tuning requires on-the-fly model modification. For example, a compiler, or application, may propose a code change in response to a performance observation, and need an immediate forecast of the performance impact of the change. Dynamic tracing, the foundation of current modeling methods, requires running existing codes and can be quite time consuming. Static analysis of binary executables can make trace acquisition much faster by limiting it to only those features that are not known before execution. User annotations¹¹ can broaden the reach of modeling by specifying at a high level the expected characteristics of code fragments. Application phase modeling can re-duce the amount of data required to form models. We are exploring less expensive techniques to identify dynamic phases through statistical sampling and time-series cluster analysis. For on-the-fly observation, we are using DynInst to attach to a running application, slow it down momentarily to measure something, then detach.¹² In PERI, we will advance automated, rapid, machine-independent model formation to push the efficacy of performance modeling down into lower levels of the application and architecture lifecycle.

Pages: 1 2 3 4

CTWatch is a collaborative effort				Sponsored By