Notes
Outline
Using PAPI for hardware performance monitoring on Linux systems
Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci and Daniel Terpstra
Tools for
Performance Evaluation
Timing and performance evaluation has been an art
Resolution of the clock
Issues about cache effects
Different interfaces andtools on different systems
Can be cumbersome and inefficient with traditional tools
Situation changing
Today’s processors have internal counters
Performance Counters
Intel Pentium, IA-64,  AMD Athlon and future processors contain performance counters.
Counters exist as a small set of registers that count events.
On most platforms the counter APIs, if they exist, are not appropriate for the end  user nor well documented.
Performance Data
That May Be Available
Cycle count
Floating point instruction count
Integer instruction count
Instruction count
Load/store count
Branch taken / not taken count
Branch mispredictions
Pipeline stalls due to memory subsystem
Pipeline stalls due to resource conflicts
I/D cache misses for different levels
TLB misses
TLB invalidations
Overview of PAPI
Parallel Tools Consortium (PTools) sponsored project
Performance Application Programming Interface
The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
Implementation
PAPI provides two interfaces to the underlying counter hardware:
The low level interface manages hardware events in user defined groups called EventSets.
The high level interface simply provides the ability to start, stop and read the counters for a specified list of events.
Timers and system information
C and Fortran bindings
PAPI Architecture
Low Level API
Increased efficiency and functionality over the high level PAPI interface and access to native events
About 40 functions
Obtain information about the executable and the hardware
Set options for multiplexing and overflow handling
Thread safe
High Level API
Meant for application programmers wanting simple but accurate measurements
Calls the lower level API
Currently not thread safe
Allows only PAPI preset events
Multiplexing Support
Incorporated MPX code written by John May at LLNL
Allows simultaneous use of more counters than are supported by hardware
Call PAPI_multiplex_init
Then use normal PAPI routines
Statistical Profiling
Ability to call user-defined handlers when an event count overflows a specified threshold
PAPI_profil() call creates a histogram of overflow counts for a specified region of the application code
PAPI supports execution profiling based on any counter event
Making use of hardware and OS support where available
Accuracy issues
PAPI - Supported Processors
Pentium Pro,II,III,P6
Linux 2.4 and perfctr kernel patch
AMD Athlon
Linux 2.4 and perfctr kernel patch
IBM Power 3,604,604e
AIX 4.3 and pmtoolkit (laderose@us.ibm.com)
Sun UltraSparc
Solaris 8
MIPS R10K, R12K
Cray T3E, SV1, SV2
Underway: Alpha EV6, EV67, IA-64, Microsoft Windows
IA-32 Counters
Introduced by Intel with the Pentium  processor
Two 40-bit  counters, except on the P4 which has 18 40-bit counters
Not guaranteed to be fully accurate
Need an event-monitoring device driver to allow user level access
Fine grain timing is provided by TSC (time stamp counter) and the RDTSC instruction.
AMD Athlon counters
Four 48-bit performance counters
Not guaranteed to be fully accurate
Fine grain timing is provided by TSC (time stamp counter) and the RDTSC instruction.
Access to counters under
Linux/x86
 Linux/x86 Performance Monitoring Counters Driver by Mikael Pettersson (perfctr)
 http://www.csd.uu.se/~mikpe/linux/perfctr/
Provides per-process 64-bit memory-mapped virtual counters
Interrupt-mode support using the local APIC interrupt on the P6
Provides per-process virtual Time Stamp Counter (TSC)
IA-64 Counters
All IA-64 architectures are to provide
At least four performance counters
Corresponding counter overflow status registers
Basic events:  clock cycles, retired instructions
Architected support for context switching of performance counters but must be implemented by the operating system
Itanium processor
Three-level memory hierarchy
Events can be qualified for monitoring based on:
Instruction address range
Particular instruction opcode
Data address range
Privilege level
Itanium Profiling Support
Dedicated performance monitor overflow interrupt mechanism
Due to out-of-order execution, etc., the sampled program counter may not be the instruction that caused an event.
The Itanium provides a set of Event Address Registers (EARs) that record instruction and/or data addresses for some events
Access to counters under Linux/IA-64
Latest IA-64 Linux patch 2.4.5 (http://www.kernel.org/)
pfmlib by Stephane Eranian, Hewlett-Packard
Included with PAPI distribution
Provides functions to control monitoring from user level
Handles EAR events
PAPI Predefined Events
Common set of events deemed relevant and useful for application performance tuning
papiStdEventDefs.h
accesses to the memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit and pipeline status
Run PAPI avail utility to determine which predefined events are available on a given platform
Semantics may differ on different platforms
Access to Native Events
PAPI provides access to native events on all supported platforms through the low-level interface.
Explained in platform-specific README files in the papi/src directory
papi/src/tests includes examples of using native events.
Details in processor architecture manuals
PAPI Timers
Use most accurate timers available on the platform
Real time (I.e., wall clock time)
PAPI_get_real_cyc
PAPI_get_real_usec
Virtual time – time accrued when processor is running in user mode
PAPI_get_virt_cyc
PAPI_get_virt_usec
PAPI System Information
PAPI_get_executable_info
returns the executable’s address space information, such as the start and end addresses of the text, data, and bss segments
PAPI_get_hardware_info
returns information about the hardware on which the program is running, such as the number of CPUs, CPU model information, and the cycle time of the CPU
Accuracy of Hardware
Counter Data
Granularity of the measured code
If not sufficiently large enough, overhead of the counter interfaces may dominate
Problems with out-of-order processors
Developing microbenchmarks to determine accuracy of the hardware counters
PAPI’s common interface should help in determining the accuracy of the hardware
Graphical Tool:
Perfometer
Application is instrumented with PAPI
call perfometer()
 At the call to perfometer,  a signal handler and timer are set to collect and send the information to a Java applet containing the graphical view.
Sections of code that are of interest can be designated with specific colors
 call set_perfometer(‘color’)
Perfometer Features
Platform independent visualization of PAPI metrics
Flexible interface
Quick interpretation of complex results
Small footprint
(compiled code size < 15k)
Color coding to highlight selected procedures
Trace file generation or real time viewing
Perfometer GUI
Perfometer Parallel Interface
Conclusions
PAPI provides a window on the processor that can provide accurate and relevant information for application performance tuning.
PAPI to be installed on Alliance clusters and included in Cluster in a Box
PAPI may be used directly as a library or via end-user tools
Conclusions (cont.)
PAPI becoming widely supported and adopted by vendors and third-party tool developers.
Working on documentation
Future work
Memory utilization information
Dynamic instrumentation
For More Information
PAPI home page: http://icl.cs.utk.edu/papi/
Download software and documentation
PTools home page: http://www.ptools.org/
Projects and tools using PAPI
Intel architecture manuals: http://developer.intel.com/
PAPI Mailing Lists
ptools-perfapi@ptools.org is a general discussion list for the PAPI software.  Send bug reports here.
perfapi-devel@ptools.org is a mailing list for developers of PAPI, performance tools and kernel patches. Interested hackers are welcomed. All the CVS log messages go here.
To subscribe to these mailing lists send a message with blank subject to majordomo@ptools.org. In the body of the message, include 'subscribe <mailing_list>’ without the single quotes.