PAPI v3.0 Cookbook
April 11, 2003
Philip J. Mucci
DRAFT
This document is intended to provide the PAPI developers with some guidelines
as to how to implement a substrate for PAPI 3.0. It is NOT intended as a
definitive specification for PAPI 3.0. Much of the new functionality has
yet to be implemented on any platform, and some may not make it into the
first release of PAPI 3. This document, like PAPI 3, is a work in progress.
Please review it and send comments to: papi@cs.utk.edu
.
NOTE TO PROSPECTIVE DEVELOPERS:
If you see a portion of this project to which you would like to contribute,
let us know by sending mail to: papi@cs.utk.edu
In order to make accurate and sustained progress on PAPI 3.0, an iterative
approach will be required. As a first step, the platform-independent code
will need to be restructured. At this point, all substrates will be broken
and will need to be brought into conformance with the platform-independent
code. Once all substrates are functional again, new functionality will need
to be added, which may again break all substrates. It may be appropriate
at this stage to focus on a full implementation of a single substrate, such
as Pentium 4/PerfCtr, to develop proof-of-principle for other substrates.
Once complete, the development team should have enough knowledge to
develop other substrates.
Some fundamental changes in the PAPI library:
1) All thread handling routines have been moved to threads.[ch].
2) All PAPI API calls exist in api.c
3) All PAPI API support routines exist in papi.c
4) All internal PAPI prototypes live in papi_protos.h
Some fundamental changes in the substrates:
1) All substrates will support 'third party' operation. This
means that the PAPI can 'attach' to another running process and maybe thread.
Many kernel counter APIs require different function calls to use this operation.
This is an extension to the granularities PAPI will support.
a. PAPI_THREAD
b. PAPI_PROCESS
c. PAPI_REMOTE
d. PAPI_CPU
e. PAPI_SYSTEM
2) All substrates will provide for querying for support of the following,
via fields in the papi_mdi_t structure.
a. High resolution wall clock timer, returning true only
if the timer is HIGH RESOLUTION, i.e. not gettimeofday()
b. High resolution virtual clock timer, returning true only if the timer
is HIGH RESOLUTION, i.e. not getrusage()
c. OS-level multiplexing, like IRIX.
d. OS-level data profiling, like the IA64/P4
e. OS-level branch profiling, like the IA64/P4
f. OS-level text profiling, like IRIX and maybe IA64
3) All substrates will no longer statically define the fields in papi_mdt_t.
While the structure will be statically declared, all fields will be filled
in at runtime after initializing the structure to all zeroes.
4) All substrates will support FULL NATIVE event support through a single
_papi3_hwd_add_event function.
5) All substrates will provide a wrapper function that will encode native
events for the user.
6) All allocated memory shall use the papi_malloc() wrapper which will
be defined in the machine independent source files.
7) ALL FUNCTIONS NOT DECLARED STATIC WILL BE PREFIXED WITH _papi3_hwd.
8) All code shall be fully commented.
9) Any code deemed 'generalizable' shall be moved out of the substrate.
10) All substrates will use HAIHANG'S NEW REGISTER ALLOCATOR.
11) All memory functions will be included in the substrate file.
12) All substrates will no longer attempt to compute a valid Mhz value
unless one is not available on the system.
13) Any functions suitable for implementation as MACRO's shall be done
in the substrate header file. This is especially true for _papi3_hwd_read
_papi3_hwd_stop _papi3_hwd_start and _papi3_hwd_reset.
14) All substrates shall be aware of GRANULARITY, either thread, process,
cpu or system.
15) Derived events should be handled outside of the substrate.
16) All return statements indicating an error will be wrapped with a statement
like error_return. It will have 2 arguments, 1 the actual return code, and
2 a understandable message.
17) Aggregate values from counters used for anything else other than counting
will not be supported. i.e. You don't get valid values from PAPI_read()ing
an EventSet that has overflowing enabled.
18) All PAPI event definitions files will be made external to the substrate.
Functions to be removed:
_papi_hwd_merge
_papi_hwd_unmerge
_papi_hwd_add_prog_event
Functions to be added:
_papi3_hwd_start
_papi3_hwd_stop
EventSet Management
The substrates will no longer be required to merge/unmerge different event
sets.
IN PAPI 3.0, ONLY ONE EVENTSET PER THREAD OR CONTEXT CAN BE IN USE AT ANY
TIME.
Some fundamentals:
The EventInfo_t structure will now contain MACHINE DEPENDENT information
in addition to MACHINE INDEPENDENT.
The current EventInfo_t structure has:
int code;
/* Preset or native code for this event PAPI_add_event() */
unsigned int selector; /* Counter select bits
used in the lower level */
int command;
/* Counter derivation command used in the lower level */
int operand_index; /* Counter derivation
data used in the lower level */
int index;
/* added to indicate the position
in the array */
The new EventInfo_t structure has:
unsigned code;
/* Code passed to PAPI_add_event() or PAPI_NATIVE */
unsigned index;
/* Accessing the counters from the hw/os, this is the one */
papi3_eventinfo_t info;
This new structure, papi3_eventinfo_t will be defined in the substrates
header file. This structure will contain all necessary information to:
1) Map from a counter from the Kernel/Hardware
to the software counter when PAPI_read() is called.
2) Remove a counter fully from the EventSet.
Adding events:
Adding hardware events in PAPI 3 will follow a different path. The new add_event
call will allow the substrate to do full register reallocation and optimization.
Exactly how this will operate is to be determined by the integration of
the new register allocation scheme.
Some sample arguments are included here for discussion:
_papi3_hwd_add_events()
As input:
The entire EventInfo_t array and it's length
Space for the machine control structure
As output:
A filled machine control structure suitable for passing to _papi3_hwd_start.
Modified EventInfo_t array entries corresponding to the new register mapping.
Adding Native Events:
Native Events that can be added by simply specifying a register code that
fits in the appropriate number of bits will still be supported in PAPI 3.0.
This will be accomplished with a new call: PAPI_encode_native(). This will
allow user code to use system specific native definitions where appropriate,
WITHOUT having PAPI redefine every single event for an architecture.
For example, on the IA64:
PAPI_add_event(EventSet,PAPI_encode_native(PFM_CYCLES));
This call will simply set 'event_code' to PAPI_NATIVE and _papi3_hwd_add_events
will fill in the machine dependent structure as appropriate.
Adding Programmable Events:
Programmable Events that can be added by simply specifying a register code
that fits in the appropriate number of bits will still be supported in PAPI
3.0. This will be accomplished with a new call: PAPI_encode_native_prog().
This will allow user code to use system specific definitions.
For example, on the IA64:
/* Set up the programmable structure for counting cache misses that take
longer than 10 cycles. */
PAPI_native_prog_t space;
PAPI_encode_native_prog(&space, PFM_L1_MISSES,THRESHOLD,10,...)
PAPI_add_prog_event(EventSet,&space);
On the x86 with Perfctr:
PAPI_encode_native_prog(&space, pmc_map, evntsel, evntsel_aux, ...)
PAPI_add_prog_event(EventSet,&space);
This call will simply set 'event_code' to PAPI_NATPROG and _papi3_hwd_add_events
will fill in the machine dependent structure as appropriate.
How to port a PAPI 2.x substrate to PAPI 3.0
1) Rename all non-static functions to _papi3_hwd_xxx
2) Change the merge/unmerge calls to their equivalent start/stop routines.
If this code consumes more than a dozen lines or so, you have done something
wrong. For instance, on the x986/perfctr port, the call looks like this
(slow) pseudo code.
int _papi3_hwd_stop(P4_perfctr_context_t
*ctx, P4_perfctr_control_t *state)
{
if (ctx->granularity == PAPI_THREAD)
if (vperfctr_stop(ctx->perfctr) < 0)
error_return(PAPI_ESYS,VCNTRL_ERROR);
else if (ctr->granularity == PAPI_REMOTE)
if (rperfctr_stop(ctx->perfctr) < 0)
error_return(PAPI_ESYS,VCNTRL_ERROR);
else if (ctr->granularity == PAPI_SYSTEM)
if (gperfctr_stop(ctx->perfctr) < 0)
error_return(PAPI_ESYS,VCNTRL_ERROR);
return(PAPI_OK);
}
Optionally, this could be implemented with
custom function pointers in the P4_perfctr_context_t structure. This makes
for much cleaner and faster code. When the granularity is changed, the function
pointers in the context are changed.
int _papi3_hwd_stop(P4_perfctr_context_t *ctx, P4_perfctr_control_t *state)
{
if (ctx->perfctr_stop(ctx->perfctr) == 0)
return(PAPI_OK);
else
error_return(PAPI_ESYS,VCNTRL_ERROR);
}
Note that for perfctr, PROCESS level granularity will be implemented in
the machine independent layer by looping over all the threads and calling
_papi3_hwd_stop on each one.
3) Most substrate calls will take 2 arguments. The first is a pointer to
the current context of the performance counter hardware. The second is the
argument to be made to the call.
4) Change add_event and remove_event calls to call the new register allocation
routines.
5) Remove all static initialization of the papi_mdi_t structure.
6) Change call DBG calls to SUBDBG calls.
7) Remove the code from _papi3_hwd_read that distributes the counters.
This will be done in the machine independent layer.
Internal signal chaining.
PAPI will support programs that install their own signal handlers for signals
that PAPI uses. This will allow PAPI to coexist transparently with applications
that make use of SIGPROF.
PAPI thread functions.
PAPI_list_threads(&listptr, &len) will return a list of registered
threads,i.e. those that have been detected by the PAPI library. This will
call a function in the threads.c file, _papi_hwd_list_threads that returnns
an array and a length of thread TID's.
PAPI_get_thread_specific(&from, int index) will return one of PAPI_MAX_TLS_POINTERS
pointer to a THREAD SPECIFIC data area. This will be implemented either
with THREAD LOCAL STORAGE (TLS on gcc and IRIX for example). This allows
monitoring programs to easily maintain per thread storage without knowing
anything about the thread library. Same for PAPI_set_thread_specific(to,
int index)
PAPI_lock/PAPI_unlock
PAPI_lock/PAPI_unlock will now take an argument of a pointer to a lock
variable. This will allow finer grained locking and better performance on
SMPs. Using NULL will cause the default behavior of locking the entire library.
To create a lock variable, we will usePAPI_lock_create(&lockvar) to
get back an address and PAPI_lock_destroy(lockvar) to clean it up.
PAPI_INHERIT
PAPI_INHERIT option, for systems that support the BAD IDEA of inheritance
of counter data by forms, we will support this. Note this is a special case
of a PAPI_REMOTE operation. We'll have to figure out how to read it.
Source code structure:
PAPI 3.0 will have a revamped source code structure. The base tree will look
as follows:
Papi
- tools
- - dynaprof
- - perfometer
- - psrun/papiperfex
- - papirun/papiprof
- - vprof
- src
- - <substrate_name>
- doc
- man
- examples
- tests
- - ctests
- - ftests
- - jtests
OS level, Substrate Unification
There are some ports of PAPI that are to similar operating systems but slightly
different architectures. Examples of this include AIX/Power3/4,Linux/Itanium/2
and Linux/x86/PIV. In PAPI 3.0, these implementations will be grouped into
the three libraries.
A New Makefile Configuration System
With this simplicification, we will be able to determine the architecture
at 'make config' time. As a result, the user will no longer have to manually
guess the makefile that is appropriate for his platform. The new makefile
system will build the tree in architecture specific directories, so all
ports can be compiled/tested in the same directory. This will greatly simplify
the build process.