Performance Tuning Using
Hardware Counter Data
Outline
Slide 3
HPC Architecture
Slide 5
Slide 6
Hardware Counters
Pipelined Functional
Units
Super-scalar Processors
Super-scalar Processors
(cont.)
Super-scalar Processors
(cont.)
Out of Order Execution
Speculative Execution
Instruction Counts and
Functional Unit Status
Cache and Memory
Hierarchy
Cache and Memory
Hierarchy (cont.)
Cache Structure
Cache Contention
Cache Contention (cont.)
TLB and Virtual Memory
Memory Latencies
Steps of Optimization
Slide 23
Goals
Overview of PAPI
PAPI Counter Interfaces
PAPI Implementation
PAPI Preset Events
PAPI Release
PAPI Release (cont.)
Slide 31
High-level Interface
High-level API
Setting up the High-level
Interface
Controlling the Counters
PAPI_flops
PAPI High-level Example
Return codes
Slide 39
Low-level Interface
Low-level Functionality
Event sets
Event set Operations
Simple Example
Overlapping Counters
Counter Domains
Counter Granularity
Using PAPI with Threads
Using PAPI with
Multiplexing
Issues with Multiplexing
Multiplex Code Examples
Native Events
Native Event Examples
Slide 54
Callbacks on Counter
Overflow
PAPI_overflow
Overflow Code Examples
Statistical Profiling
PAPI_profil
Profiling Code Examples
Slide 61
Perfometer
Perfometer Display
Perfometer Parallel
Interface
Third-party Tools
that use PAPI
DEEP/PAPI
SvPablo
TAU
vprof
Slide 70
Code Examples
Particle - particle
simulator
Algorithm used
Reversed neighborlist
Final
performance
Wall clock time per time step
Explanation
Frequency domain MHD
Expected behaviour
Observed behaviour
Obtained speed up vs.
streams
PAPI measurements — IBM
Raw results
Deduced results
For More Information