Performance Tuning Using Hardware Counter Data

Outline

Slide 3

HPC Architecture

Slide 5

Slide 6

Hardware Counters

Pipelined Functional Units

Super-scalar Processors

Super-scalar Processors (cont.)

Super-scalar Processors (cont.)

Out of Order Execution

Speculative Execution

Instruction Counts and Functional Unit Status

Cache and Memory Hierarchy

Cache and Memory Hierarchy (cont.)

Cache Structure

Cache Contention

Cache Contention (cont.)

TLB and Virtual Memory

Memory Latencies

Steps of Optimization

Slide 23

Goals

Overview of PAPI

PAPI Counter Interfaces

PAPI Implementation

PAPI Preset Events

PAPI Release

PAPI Release (cont.)

Slide 31

High-level Interface

High-level API

Setting up the High-level Interface

Controlling the Counters

PAPI_flops

PAPI High-level Example

Return codes

Slide 39

Low-level Interface

Low-level Functionality

Event sets

Event set Operations

Simple Example

Overlapping Counters

Counter Domains

Counter Granularity

Using PAPI with Threads

Using PAPI with Multiplexing

Issues with Multiplexing

Multiplex Code Examples

Native Events

Native Event Examples

Slide 54

Callbacks on Counter Overflow

PAPI_overflow

Overflow Code Examples

Statistical Profiling

PAPI_profil

Profiling Code Examples

Slide 61

Perfometer

Perfometer Display

Perfometer Parallel Interface

Third-party Tools
that use PAPI

DEEP/PAPI

SvPablo

TAU

vprof

Slide 70

 Code Examples

Particle - particle simulator

Algorithm used

Reversed neighborlist

Final performance
Wall clock time per time step

Explanation

Frequency domain MHD

Expected behaviour

Observed behaviour

Obtained speed up vs. streams

PAPI measurements — IBM

Raw results

Deduced results

For More Information