|
|
|
Jack Dongarra, Kevin London, Shirley Moore,
Philip Mucci and Daniel Terpstra |
|
|
|
|
|
Timing and performance evaluation has been an
art |
|
Resolution of the clock |
|
Issues about cache effects |
|
Different interfaces andtools on different
systems |
|
Can be cumbersome and inefficient with
traditional tools |
|
Situation changing |
|
Today’s processors have internal counters |
|
|
|
|
|
|
Intel Pentium, IA-64, AMD Athlon and future processors contain performance
counters. |
|
Counters exist as a small set of registers that
count events. |
|
On most platforms the counter APIs, if they
exist, are not appropriate for the end
user nor well documented. |
|
|
|
|
|
|
|
|
|
|
|
Cycle count |
|
Floating point instruction count |
|
Integer instruction count |
|
Instruction count |
|
Load/store count |
|
Branch taken / not taken count |
|
Branch mispredictions |
|
Pipeline stalls due to memory subsystem |
|
Pipeline stalls due to resource conflicts |
|
I/D cache misses for different levels |
|
TLB misses |
|
TLB invalidations |
|
|
|
|
|
|
Parallel Tools Consortium (PTools) sponsored
project |
|
Performance Application Programming Interface |
|
The purpose of the PAPI project is to design,
standardize and implement a portable and efficient API to access the
hardware performance monitor counters found on most modern microprocessors. |
|
|
|
|
|
|
|
PAPI provides two interfaces to the underlying
counter hardware: |
|
The low level interface manages hardware events
in user defined groups called EventSets. |
|
The high level interface simply provides the
ability to start, stop and read the counters for a specified list of
events. |
|
Timers and system information |
|
C and Fortran bindings |
|
|
|
|
|
|
|
|
Increased efficiency and functionality over the
high level PAPI interface and access to native events |
|
About 40 functions |
|
Obtain information about the executable and the
hardware |
|
Set options for multiplexing and overflow
handling |
|
Thread safe |
|
|
|
|
Meant for application programmers wanting simple
but accurate measurements |
|
Calls the lower level API |
|
Currently not thread safe |
|
Allows only PAPI preset events |
|
|
|
|
Incorporated MPX code written by John May at
LLNL |
|
Allows simultaneous use of more counters than
are supported by hardware |
|
Call PAPI_multiplex_init |
|
Then use normal PAPI routines |
|
|
|
|
Ability to call user-defined handlers when an
event count overflows a specified threshold |
|
PAPI_profil() call creates a histogram of
overflow counts for a specified region of the application code |
|
PAPI supports execution profiling based on any
counter event |
|
Making use of hardware and OS support where
available |
|
Accuracy issues |
|
|
|
|
|
Pentium Pro,II,III,P6 |
|
Linux 2.4 and perfctr kernel patch |
|
AMD Athlon |
|
Linux 2.4 and perfctr kernel patch |
|
IBM Power 3,604,604e |
|
AIX 4.3 and pmtoolkit (laderose@us.ibm.com) |
|
Sun UltraSparc |
|
Solaris 8 |
|
MIPS R10K, R12K |
|
Cray T3E, SV1, SV2 |
|
Underway: Alpha EV6, EV67, IA-64, Microsoft
Windows |
|
|
|
|
|
Introduced by Intel with the Pentium processor |
|
Two 40-bit
counters, except on the P4 which has 18 40-bit counters |
|
Not guaranteed to be fully accurate |
|
Need an event-monitoring device driver to allow
user level access |
|
Fine grain timing is provided by TSC (time stamp
counter) and the RDTSC instruction. |
|
|
|
|
|
Four 48-bit performance counters |
|
Not guaranteed to be fully accurate |
|
Fine grain timing is provided by TSC (time stamp
counter) and the RDTSC instruction. |
|
|
|
|
|
|
Linux/x86 Performance Monitoring Counters Driver by Mikael
Pettersson (perfctr) |
|
http://www.csd.uu.se/~mikpe/linux/perfctr/ |
|
Provides per-process 64-bit memory-mapped
virtual counters |
|
Interrupt-mode support using the local APIC
interrupt on the P6 |
|
Provides per-process virtual Time Stamp Counter
(TSC) |
|
|
|
|
|
All IA-64 architectures are to provide |
|
At least four performance counters |
|
Corresponding counter overflow status registers |
|
Basic events:
clock cycles, retired instructions |
|
Architected support for context switching of
performance counters but must be implemented by the operating system |
|
|
|
|
|
Three-level memory hierarchy |
|
Events can be qualified for monitoring based on: |
|
Instruction address range |
|
Particular instruction opcode |
|
Data address range |
|
Privilege level |
|
|
|
|
Dedicated performance monitor overflow interrupt
mechanism |
|
Due to out-of-order execution, etc., the sampled
program counter may not be the instruction that caused an event. |
|
The Itanium provides a set of Event Address
Registers (EARs) that record instruction and/or data addresses for some
events |
|
|
|
|
|
|
|
Latest IA-64 Linux patch 2.4.5 (http://www.kernel.org/) |
|
pfmlib by Stephane Eranian, Hewlett-Packard |
|
Included with PAPI distribution |
|
Provides functions to control monitoring from
user level |
|
Handles EAR events |
|
|
|
|
|
Common set of events deemed relevant and useful
for application performance tuning |
|
papiStdEventDefs.h |
|
accesses to the memory hierarchy, cache
coherence protocol events, cycle and instruction counts, functional unit
and pipeline status |
|
Run PAPI avail utility to determine which
predefined events are available on a given platform |
|
Semantics may differ on different platforms |
|
|
|
|
|
|
PAPI provides access to native events on all
supported platforms through the low-level interface. |
|
Explained in platform-specific README files in
the papi/src directory |
|
papi/src/tests includes examples of using native
events. |
|
Details in processor architecture manuals |
|
|
|
|
|
Use most accurate timers available on the
platform |
|
Real time (I.e., wall clock time) |
|
PAPI_get_real_cyc |
|
PAPI_get_real_usec |
|
Virtual time – time accrued when processor is
running in user mode |
|
PAPI_get_virt_cyc |
|
PAPI_get_virt_usec |
|
|
|
|
|
PAPI_get_executable_info |
|
returns the executable’s address space
information, such as the start and end addresses of the text, data, and bss
segments |
|
PAPI_get_hardware_info |
|
returns information about the hardware on which
the program is running, such as the number of CPUs, CPU model information,
and the cycle time of the CPU |
|
|
|
|
|
Granularity of the measured code |
|
If not sufficiently large enough, overhead of
the counter interfaces may dominate |
|
Problems with out-of-order processors |
|
Developing microbenchmarks to determine accuracy
of the hardware counters |
|
PAPI’s common interface should help in
determining the accuracy of the hardware |
|
|
|
|
|
Application is instrumented with PAPI |
|
call perfometer() |
|
At the
call to perfometer, a signal
handler and timer are set to collect and send the information to a Java
applet containing the graphical view. |
|
Sections of code that are of interest can be
designated with specific colors |
|
call
set_perfometer(‘color’) |
|
|
|
|
|
|
|
Platform independent visualization of PAPI
metrics |
|
Flexible interface |
|
Quick interpretation of complex results |
|
Small footprint |
|
(compiled code size < 15k) |
|
Color coding to highlight selected procedures |
|
Trace file generation or real time viewing |
|
|
|
|
|
|
|
|
PAPI provides a window on the processor that can
provide accurate and relevant information for application performance
tuning. |
|
PAPI to be installed on Alliance clusters and
included in Cluster in a Box |
|
PAPI may be used directly as a library or via
end-user tools |
|
|
|
|
|
PAPI becoming widely supported and adopted by
vendors and third-party tool developers. |
|
Working on documentation |
|
Future work |
|
Memory utilization information |
|
Dynamic instrumentation |
|
|
|
|
|
|
|
PAPI home page: http://icl.cs.utk.edu/papi/ |
|
Download software and documentation |
|
PTools home page: http://www.ptools.org/ |
|
Projects and tools using PAPI |
|
Intel architecture manuals: http://developer.intel.com/ |
|
|
|
|
|
|
ptools-perfapi@ptools.org is a general
discussion list for the PAPI software.
Send bug reports here. |
|
perfapi-devel@ptools.org is a mailing list for
developers of PAPI, performance tools and kernel patches. Interested
hackers are welcomed. All the CVS log messages go here. |
|
To subscribe to these mailing lists send a
message with blank subject to majordomo@ptools.org. In the body of the
message, include 'subscribe <mailing_list>’ without the single
quotes. |
|
|
|