Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements

TitleMemory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements
Publication TypeConference Proceedings
Year of PublicationTo appear
AuthorsBarry, D., H. Jagode, A. Danalis, and J. Dongarra
Conference Name2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Conference LocationSt. Petersburg, Florida
KeywordsGPU power, High Performance Computing, network traffic, papi, performance analysis, Performance Counters

Some of the most important categories of performance events count the data traffic between the processing cores and the main memory. However, since these counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP); however, doing so adds an indirection layer that involves querying the PCP daemon. This paper performs a quantitative study of the accuracy of the measurements obtained through this component on the Summit supercomputer. We use two linear algebra kernels---a generalized matrix multiply, and a modified matrix-vector multiply---as benchmarks and a distributed, GPU-accelerated 3D-FFT mini-app (using cuFFT) to compare the measurements obtained through the PAPI PCP component against the expected values across different problem sizes. We also compare our measurements against an in-house machine with a very similar architecture to Summit, where elevated privileges allow PAPI to access the hardware counters directly (without using PCP) to show that measurements taken via PCP are as accurate as the those taken directly. Finally, using both QMCPACK and the 3D-FFT, we demonstrate the diverse hardware activities that can be monitored simultaneously via PAPI hardware components.

Project Tags: 
External Publication Flag: