C
AND FORTRAN CALLING INTERFACES.
C AND FORTRAN CALLING INTERFACES
INITIALIZING THE HIGH-LEVEL API
READING, ACCUMULATING, AND STOPPING COUNTERS
INITIALIZATION OF THE LOW-LEVEL API
STARTING, READING, ADDING, AND STOPPING
EVENTS IN AN EVENT SET
RESETTING EVENTS IN AN EVENT SET
REMOVING EVENTS IN AN EVENT SET
EMPTYING AND DESTROYING AN EVENT SET
USING PAPI
WITH PARALLEL PROGRAMS
BEGINNING OVERFLOWS IN EVENT SETS
WHAT IS STATISTICAL PROFILING?
CONVERTING ERROR CODES TO ERROR MESSAGES
This document is intended to provide the PAPI user with a discussion of how to use the different components and functions of PAPI. The intended users are application developers and performance tool writers who need to access performance data to tune and model application performance. The user is expected to have some level of familiarity with either the C or Fortran programming language.
This section provides an introduction to PAPI by describing the project, its motivation, and its architecture.
This section provides an installation guide for PAPI. It states the necessary steps in order to install PAPI on the various supported operating systems.
This section states the header files in which function calls are defined and the form of the function calls for both the C and Fortran calling interfaces. Also, it provides a table that shows the relation between certain pseudo-types and Fortran variable types.
This section provides an explanation of events as well as an explanation of native and preset events. The preset query and translation functions are also discussed in this section. There are code examples using native events, preset query, and preset translation with the corresponding output.
This section discusses the high-level and low-level interfaces in detail. The initialization and functions of these interfaces are also discussed. Code examples along with the corresponding output are included as well.
This section explains the PAPI functions associated with obtaining real and virtual time from the platform’s timers. Code examples along with the corresponding output are included as well.
This section explains the PAPI functions associated with obtaining hardware and executable information. Code examples along with the corresponding output are included as well.
This
section discusses the advanced features of PAPI, which includes multiplexing,
threads, MPI, overflows, and statistical profiling. The functions that are use
to implement these features are also discussed. Code examples along with the
corresponding output are included as well.
This section discusses the various negative error codes that are returned by the PAPI functions. A table with the names, values, and descriptions of the return codes are given as well as a discussion of the PAPI function that can be used to convert error codes to error messages along with a code example with the corresponding output.
This section provides information on two PAPI mailing lists for the users to ask various questions about the project.
These appendices provide various listings and tables, such as: a table of preset events and the platforms on which they are supported, a table of PAPI supported tools, more information on native events, multiplexing, overflow, and etc.
handle_error(1)
A function that passes the argument of 1. The user should provide this function to handle errors.
PAPI is an acronym for Performance Application Programming Interface. The PAPI Project is being developed at the University of Tennessee’s Innovative Computing Laboratory in the Computer Science Department. This project was created to design, standardize, and implement a portable and efficient API (Application Programming Interface) to access the hardware performance counters found on most modern microprocessors.
Platform |
PAPI_read() – PAPI 2.3.4 |
PAPI_read() – PAPI 3.0 |
Altix (Itanium 2 -Madison Chip) |
2352 Cycles/Call |
1357 Cycles/Call |
IBM Power 4 |
6715 Cycles/Call |
4034 Cycles/Call |
Itanium 2 (libpfm 2.0) |
2929 Cycles/Call |
|
Pentium 3 (perfctr 2.4.5) |
3023 Cycles/Call |
324 Cycles/Call |
Pentium 4 (perfctr 2.4.5) |
332 Cycles/Call * |
401 Cycles/Call |
SGI R12k |
9636 Cycles/Call |
3681 Cycles/Call |
Ultrasparc II |
3378 Cycles/Call |
2150 Cycles/Call |
* Implemented as
PAPI-3.0 like structure without overlapping eventsets.
If you intend to convert your code to use PAPI 3 (which we heartily recommend!) there are a few other things you should be aware of:
If you want to convert your application to PAPI 3, a separate document is available on the PAPI website that carefully details this process. This document also includes details on the use of the PAPIvi layer to provide version independent use of PAPI calls.
That document is available in a variety of formats as the PAPI
Conversion Cookbook at:
http://icl.cs.utk.edu/papi/custom/index.html?lid=49&slid=79
Hardware counters exist on every major processor today, such as Intel Pentium, IA-64, AMD Athlon, and IBM POWER series. These counters can provide performance tool developers with a basis for tool development and application developers with valuable information about sections of their code that can be improved. However, there are only a few APIs that allow access to these counters, and most of them are poorly documented, unstable, or unavailable. In addition, performance metrics may have different definitions and different programming interfaces on different platforms.
These considerations motivated the development of the PAPI Project. Some goals of the PAPI Project are as follows:
· To present a set of standard definitions for performance metrics on all platforms
The Figure below shows the internal design of the PAPI architecture. In this figure, we can see the two layers of the architecture:
The Portable Layer consists of the API (low level and high level) and machine independent support functions.
The Machine Specific Layer defines and exports a machine independent interface to machine dependent functions and data structures. These functions are defined in the substrate layer, which uses kernel extensions, operating system calls, or assembly language to access the hardware performance counters. PAPI uses the most efficient and flexible of the three, depending on what is available.
PAPI strives to provide a uniform environment across platforms. However, this is not always possible. Where hardware support for features, such as overflows and multiplexing is not supported, PAPI implements the features in software where possible. Also, processors do not support the same metrics, thus you can monitor different events depending on the processor in use. Therefore, the interface remains constant, but how it is implemented can vary. Throughout this guide, implementation decisions will be documented where it can make a difference to the user, such as overhead costs, sampling, and etc.
On some of the systems that PAPI supports (see Appendix D), you can install PAPI right out of the box without any additional setup. Others require drivers or patches to be installed first.
The general installation steps are below, but first
find your particular Operating System’s section of the /papi/INSTALL file for
current information on any additional steps that may be necessary.
General Installation
1. Pick the appropriate Makefile.<arch> for your system in the papi source distribution, edit it (if necessary) and compile.
% make -f Makefile.<arch>
2. Check for errors. Look for the libpapi.a and libpapi.so in the current directory. Optionally, run the test programs in the ‘ftests’ and ‘ctests’ directories.
Not all tests will succeed on all platforms.
% ./run_tests.sh
This will run the tests in quiet mode, which will print PASSED, FAILED, or SKIPPED. Tests are SKIPPED if the functionality being tested is not supported by that platform.
3. Create a PAPI binary distribution or install PAPI directly.
To directly install PAPI from the build tree:
% make -f Makefile.<arch> DESTDIR=<install-dir> install
Please use an absolute pathname for <install-dir>, not a relative pathname.
To create a binary kit, papi-<arch>.tgz:
% make -f Makefile.<arch> dist
PAPI is written in C. The function calls in the C interface are defined in the header file, papi.h and consist of the following form:
<returned data type> PAPI_function_name(arg1, arg2,…)
The function calls in the Fortran interface are defined in the header file, fpapi.h and consist of the following form:
PAPIF_function_name(arg1, arg2, …, check)
As you can probably see, the C
function calls have equivalent Fortran function calls (PAPI_<call>
becomes PAPIF_<call>). Well, this is true for most function calls, except
for the functions that return C pointers to structures, such as PAPI_get_opt
and PAPI_get_executable_info, which are either not implemented in the Fortran
interface, or implemented with different calling semantics. In the function
calls of the Fortran interface, the return code of the corresponding C routine
is returned in the argument, check.
For most architectures, the following relation holds between the pseudo-types listed and Fortran variable types:
Pseudo-type |
Fortran type |
Description |
C_INT |
INTEGER |
Default Integer type |
C_FLOAT |
REAL |
Default Real type |
C_LONG_LONG |
INTEGER*8 |
Extended size integer |
C_STRING |
CHARACTER*(PAPI_MAX_STR_LEN) |
Fortran string |
C_INT FUNCTION |
EXTERNAL INTEGER FUNCTION |
Fortran function returning integer result |
Array arguments must be of sufficient size to hold the input/output from/to the subroutine for predictable behavior. The array length is indicated either by the accompanying argument or by internal PAPI definitions.
Subroutines accepting C_STRING as an argument are on most implementations capable of reading the character string length as provided by Fortran. In these implementations, the string is truncated or space padded as necessary. For other implementations, the length of the character array is assumed to be of sufficient size. No character string longer than PAPI_MAX_STR_LEN is returned by the PAPIF interface.
Events are occurrences of specific signals related to a processor’s function. Hardware performance counters exist as a small set of registers that count events, such as cache misses and floating point operations while the program executes on the processor. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. Each processor has a number of events that are native to that architecture. PAPI provides a software abstraction of these architecture-dependent native events into a collection of preset events that are accessible through the PAPI interface.
Native events comprise the set of all events that are countable by the CPU. There are generally far more native events available than can be mapped onto PAPI preset events. Even if no preset event is available that exposes a given native event, native events can still be accessed directly. To use native events effectively you should be very familiar with the particular platform in use. PAPI provides access to native events on all supported platforms through the low-level interface. Native events use the same interface as used when setting up a preset event, but since a PAPI preset event definition is not available for native events, a native event name must often be translated into an event code before it can be used.
Native event codes and names are platform dependent, so native codes for one platform most likely will not work for any other platform. To determine the native events for your platform, see the native event lists for the various platforms in the processor architecture manual. Every attempt is made to keep native event names used by PAPI as similar as possible to those used in the vendor documentation. This is not always possible. The test code native_avail.c provides insight into the names of the native events for a specific platform.
Native events are specified as arguments to the low-level function, PAPI_add_event in a manner similar to adding PAPI preset events. In the following code example, a native event name is converted to an event code and added to an eventset by using PAPI_add_event:
For more code examples using native events, see ctests/native.c and ctests/native_avail.c in the papi source distribution.
Preset events, also known as predefined events, are a common set of events deemed relevant and useful for application performance tuning. These events are typically found in many CPUs that provide performance counters and give access to the memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit, and pipeline status. Furthermore, preset events are mappings from symbolic names (PAPI preset name) to machine specific definitions (native countable events) for a particular hardware resource. For example, Total Cycles (in user mode) is PAPI_TOT_CYC. Also, PAPI supports presets that may be derived from the underlying hardware metrics. For example, Total L1 Cache Misses (PAPI_L1_TCM) might be the sum of L1 Data Misses and L1 Instruction Misses on a given platform. A preset can be either directly available as a single counter, derived using a combination of counters, or unavailable on any particular platform.
The PAPI library names approximately 100 preset events, which are defined in the header file, papiStdEventDefs.h. For a given platform, a subset of these preset events can be counted though either a simple high-level programming interface or a more complete C or Fortran low-level interface. For a representative list of all the preset events on some supported platforms, see Appendix A. Note that processors and software are revised over time, and this list may not be up to date. To determine exactly which preset events are available on a specific platform, run ctests/avail.c in the papi source distribution.
The exact semantics of an event counter are platform dependent. PAPI preset names are mapped onto available events so as to map as many countable events as possible on different platforms. Due to hardware implementation differences, it is not necessarily feasible to directly compare the counts of a particular PAPI preset event obtained on different hardware platforms.
The following low-level functions can be called to query about the existence of a preset or native event (in other words, if the hardware supports that certain event), and to get details about that event:
C:
PAPI_query_event(EventCode)
PAPI_get_event_info(EventCode, &info)
PAPI_enum_event(&EventCode,
modifier)
Fortran:
PAPIF_query_event(EventCode, check)
PAPIF_get_event_info(EventCode, symbol, longDescr, shortDescr, count, note, flags, check)
PAPIF_enum_event(&EventCode, modifier, check)
EventCode -- a defined event, such as PAPI_TOT_INS.
symbol -- the event symbol, or name, such as the
preset name, PAPI_BR_CN.
longDescr -- a descriptive string for the event of
length less than PAPI_MAX_STR_LEN.
shortDescr -- a short descriptive string for the
event of length less than 18 characters.
count -- zero if the event CANNOT be counted.
note -- additional text information about an
event (if available).
flags -- provides
additional information about an event, e.g., PAPI_DERIVED for an event
derived from 2 or more other events.
modifier -- modifies the search criteria; for
preset events, returns all events or only available events; for native events,
the definition is platform dependent.
PAPI_query_event asks the PAPI library if the preset or native event can be counted on this architecture. If the event CAN be counted, the function returns PAPI_OK. If the event CANNOT be counted, the function returns an error code.
PAPI_get_event_info asks the PAPI library for a copy of an event descriptor. This descriptor can then be used to investigate the details about the event. In Fortran, the individual fields in the descriptor are returned as parameters.
PAPI_enum_event asks the PAPI library to return an event code for the next sequential event based on the current event code and the modifier. This function can be used to enumerate all preset or native events on any platform. See ctests/avail.c or ctests/native_avail.c for details.
In the above code example, PAPI_query_event is used to see if a preset (PAPI_TOT_INS) exists, PAPI_get_event_info is used to query details about the event, and PAPI_enum_event is used to count the number of events in the preset list after this preset.
On success, all three of these functions return PAPI_OK, and on error, a non-zero error code is returned.
A preset or native event can be referenced by name or by event code. Most PAPI functions require an event code, while most user input and output is in terms of names. Two low-level functions are provided to translated between these formats:
C:
PAPI_event_name_to_code(EventName, EventCode)
PAPI_event_code_to_name(EventCode, EventName)
Fortran:
PAPIF_event_name_to_code(EventName, EventCode, check)
PAPIF_event_code_to_name(EventCode, EventName, check)
EventCode -- a preset or native event of integer type, such as PAPI_TOT_INS.
EventName -- the event name string, such as the
preset name, “PAPI_BR_CN”.
Note that the preset does not actually have to be available on a given platform to call these functions. Native event names are platform specific and where feasible match those given in the vendor documentation.
PAPI_event_name_to_code is used to translate an ASCII PAPI preset or native event name into an integer PAPI event code.
PAPI_event_code_to_name is used to translate an integer PAPI event code into an ASCII PAPI preset or native event name.
Using PAPI_event_code_to_name in conjunction with PAPI_enum_event is a good way to explore the names of native events on a specific platform, as shown in the following code example:
The output will vary depending on the platform. This was generated on an Intel Pentium III processor.
On success, all the functions return PAPI_OK and on error, a non-zero error code is returned.
The high-level API (Application Programming Interface) provides the ability to start, stop, and read the counters for a specified list of events. It is meant for programmers wanting simple event measurements using only PAPI preset events. Earlier versions of the high-level API were also not thread safe, but this restriction has been removed in PAPI 3. Some of the benefits of using the high-level API rather than the low-level API are that it is easier to use and requires less setup (additional calls). This ease of use comes with somewhat higher overhead and loss of flexibility.
It should also be noted that the high-level API can be used in conjunction with the low-level API and in fact does call the low-level API. However, the high-level API by itself is only able to access those events countable simultaneously by the underlying hardware.
There are eight functions that represent the high-level API that allow the user to access and count specific hardware events. Note that these functions can be accessed from both C and Fortran. For a code example of using the high-level interface, see Simple Code Examples: High Level API or ctests/high-level.c in the PAPI source distribution.
For full details on the calling semantics of these functions, please refer to the PAPI Programmer’s Reference.
The PAPI library is initialized implicitly by several high-level API calls. In addition to the three rate calls discussed later, either of the following two functions also implicitly initializes the library:
C:
PAPI_num_counters()
PAPI_start_counters(*events, array_length)
Fortran:
PAPIF_num_counters(check)
PAPIF_start_counters(*events, array_length, check)
*events --
an array of codes for events such as PAPI_INT_INS or a native event code.
array_length -- the number of items in the events array.
PAPI_num_counters returns the optimal length of the values array for high-level functions. This value corresponds to the number of hardware counters supported by the current substrate. PAPI_num_counters initializes the PAPI library using PAPI_library_init if necessary.
PAPI_start_counters initializes the PAPI library (if necessary) and starts counting the events named in the events array. This function implicitly stops and initializes any counters running as a result of a previous call to PAPI_start_counters. It is the user’s responsibility to choose events that can be counted simultaneously by reading the vendor’s documentation. The length of the events array should be no longer than the value returned by PAPI_num_counters.
In the following code example, PAPI_num_counters is used to initialize the library and to get the number of hardware counters available on the system. Also, PAPI_start_counters is used to start counting events:
On success, PAPI_num_counters returns the number of hardware counters available on the system and on error, a non-zero error code is returned.
Optionally, the PAPI
library can be initialized explicitly by using PAPI_library_init.
This can be useful if you wish to call PAPI low-level API functions before
using the high-level functions.
Three PAPI high-level functions are available to measure floating point or total instruction rates. These three calls are shown below:
C:
PAPI_flips(*real_time, *proc_time, *flpins, *mflips)
PAPI_flops(*real_time, *proc_time, *flpins, *mflops)
PAPI_ipc(*real_time, *proc_time, *ins, *ipc)
Fortran:
PAPIF_flips(real_time, proc_time, flpins, mflips, check)
PAPIF_flops(real_time, proc_time, flpins, mflops, check)
PAPIF_ipc(real_time, proc_time, ins, ipc, check)
*real_time
-- the total real (wallclock) time since
the first rate call.
*proc_time -- the total process time since the first rate call.
*flpins
-- the total floating point instructions
since the first rate call.
*mflips,
*mflops – Millions of floating point operations
or instructions per second achieved since the latest rate call.
*ins -- the total instructions executed since the first PAPI_ipc
call.
*ipc – instructions per cycle achieved since the latest PAPI_ipc call.
The first execution rate call initializes the PAPI library if needed, sets up the counters to monitor either PAPI_FP_INS, PAPI_FP_OPS or PAPI_TOT_INS (depending on the call), and PAPI_TOT_CYC events, and starts the counters. Subsequent calls to the same rate function will read the counters and return total real time, total process time, total instructions or operations, and the appropriate rate of execution since the last call. A call to PAPI_stop_counters will reinitialize all values to 0. Sequential calls to different execution rate functions will return an error.
Note that on many platforms there may be subtle differences between floating point instructions and operations. Instructions are typically those execution elements most directly measured by the hardware counters. They may include floating point load and store instructions, and may count instructions such as FMA as one, even though two floating point operations have occurred. Consult the hardware documentation for your system for more details. Operations represent a derived value where an attempt is made, when possible, to more closely map to the expected definition of a floating point event.
On success, the rate calls return PAPI_OK and on error, a non-zero error code is returned.
For a code example, see ctest/flops.c or ctest/ipc.c in the papi source distribution.
Counters can be read, accumulated, and stopped by calling the following high-level functions, respectively:
C:
PAPI_read_counters(*values, array_length)
PAPI_accum_counters(*values, array_length)
PAPI_stop_counters(*values, array_length)
Fortran:
PAPIF_read_counters(*values, array_length, check)
PAPIF_accum_counters(*values, array_length, check)
PAPIF_stop_counters(*values, array_length, check)
*values -- an array where to put the counter values.
array_length -- the number of items in the *values array.
PAPI_read_counters, PAPI_accum_counters and PAPI_stop_counters all capture the values of the currently running counters into the array, values. Each of these functions behaves somewhat differently.
PAPI_read_counters copies the current counts into the elements of the values array and leaves the counters running.
PAPI_accum_counters adds the current counts into the elements of the values array and resets the counters to zero, leaving the counters running. Care should be exercised not to mix calls to PAPI_accum_counters with calls to the execution rate functions. Such intermixing is likely to produce unexpected results.
PAPI_stop_counters stops the counters and copies the current counts into the elements of the values array. This call can also be used to reset the rate functions if used with a NULL pointer to the values array.
In the following code example, PAPI_read_counters and PAPI_stop_counters are used to copy and stop event counters in an array, respectively:
On success, all of these functions return PAPI_OK and on error, a non-zero error code is returned.
The low-level API (Application Programming Interface) manages hardware events in user-defined groups called Event Sets. It is meant for experienced application programmers and tool developers wanting fine-grained measurement and control of the PAPI interface. Unlike the high-level interface, it allows both PAPI preset and native events. Another features of the low-level API are the ability to obtain information about the executable and the hardware as well as to set options for multiplexing and overflow handling. Some of the benefits of using the low-level API rather than the high-level API are that it increases efficiency and functionality.
It should also be noted that the low-level interface could be used in conjunction with the high-level interface, as long as attention is paid to insure that the PAPI library is initialized prior to the first low-level PAPI call.
The low-level API is only as powerful as the substrate upon which it is built. Thus, some features may not be available on every platform. The converse may also be true, that more advanced features may be available on every platform and defined in the header file. Therefore, the user is encouraged to read the documentation for each platform carefully. There are approximately 50 functions that represent the low-level API. For a code example of using the low-level interface, see Simple Code Examples: Low-Level API or ctests/low_level.c in the PAPI source distribution.
Note that most functions are implemented in both C and Fortran, but some are implemented in only one of these two languages. For full details on the calling semantics of these functions, please refer to the PAPI Programmer’s Reference.
The PAPI library must be initialized before it can be used. It can be initialized explicitly by calling the following low-level function:
C:
PAPI_library_init(version)
Fortran:
PAPIF_library_init(check)
version -- upon initialization, PAPI checks the
argument against the internal value of PAPI_VER_CURRENT when the library
was compiled. This guards against portability problems when updating the PAPI
shared libraries on your system.
Note that this function must be called before calling
any other low-level PAPI function.
On success, this function returns PAPI_VER_CURRENT.
On error, a positive
return code other than PAPI_VER_CURRENT indicates a library version mismatch
and a negative return code indicates an initialization error.
Beginning with PAPI 3.0, there are a number of options for examining the current version number of PAPI:
· PAPI_VERSION produces an integer containing the complete current version including MAJOR, MINOR, and REVISION components. Typically the REVISION component changes with bug fixes or minor enhancements, the MINOR component changes with feature additions or API changes, and the MAJOR component changes with significant API structural changes.
· PAPI_VER_CURRENT contains the MAJOR and MINOR components and is useful for determining library compatibility changes.
· PAPI_VERSION_MAJOR,
· PAPI_VERSION_MINOR,
· PAPI_VERSION_REVISION are macros that extract specified component from the version number.
The following is a code example of using PAPI_library_init to initialize the PAPI library: