PAPI v3.0 Cookbook

April 11, 2003

Philip J. Mucci

DRAFT

This document is intended to provide the PAPI developers with some guidelines as to how to implement a substrate for PAPI 3.0. It is NOT intended as a definitive specification for PAPI 3.0. Much of the new functionality has yet to be implemented on any platform, and some may not make it into the first release of PAPI 3. This document, like PAPI 3,  is a work in progress. Please review it and send comments to: papi@cs.utk.edu .

NOTE TO PROSPECTIVE DEVELOPERS:
If you see a portion of this project to which you would like to contribute, let us know by sending mail to: papi@cs.utk.edu

In order to make accurate and sustained progress on PAPI 3.0, an iterative approach will be required. As a first step, the platform-independent code will need to be restructured. At this point, all substrates will be broken and will need to be brought into conformance with the platform-independent code. Once all substrates are functional again, new functionality will need to be added, which may again break all substrates. It may be appropriate at this stage to focus on a full implementation of a single substrate, such as Pentium 4/PerfCtr, to develop proof-of-principle for other substrates. Once complete, the development team should have enough knowledge to  develop other substrates.

Some fundamental changes in the PAPI library:

1) All thread handling routines have been moved to threads.[ch].
2) All PAPI API calls exist in api.c
3) All PAPI API support routines exist in papi.c
4) All internal PAPI prototypes live in papi_protos.h

Some fundamental changes in the substrates:

1) All substrates will support 'third party' operation. This means that the PAPI can 'attach' to another running process and maybe thread. Many kernel counter APIs require different function calls to use this operation. This is an extension to the granularities PAPI will support.
a. PAPI_THREAD
b. PAPI_PROCESS
c. PAPI_REMOTE
d. PAPI_CPU
e. PAPI_SYSTEM
2) All substrates will provide for querying for support of the following, via fields in the papi_mdi_t structure.
a. High resolution wall clock timer, returning true only if the timer is HIGH RESOLUTION, i.e. not gettimeofday()
b. High resolution virtual clock timer, returning true only if the timer is HIGH RESOLUTION, i.e. not getrusage()
c. OS-level multiplexing, like IRIX.
d. OS-level data profiling, like the IA64/P4
e. OS-level branch profiling, like the IA64/P4
f. OS-level text profiling, like IRIX and maybe IA64
3) All substrates will no longer statically define the fields in papi_mdt_t. While the structure will be statically declared, all fields will be filled in at runtime after initializing the structure to all zeroes.
4) All substrates will support FULL NATIVE event support through a single _papi3_hwd_add_event function.
5) All substrates will provide a wrapper function that will encode native events for the user.
6) All allocated memory shall use the papi_malloc() wrapper which will be defined in the machine independent source files.
7) ALL FUNCTIONS NOT DECLARED STATIC WILL BE PREFIXED WITH _papi3_hwd.
8) All code shall be fully commented.
9) Any code deemed 'generalizable' shall be moved out of the substrate.
10) All substrates will use HAIHANG'S NEW REGISTER ALLOCATOR.
11) All memory functions will be included in the substrate file.
12) All substrates will no longer attempt to compute a valid Mhz value unless one is not available on the system.
13) Any functions suitable for implementation as MACRO's shall be done in the substrate header file. This is especially true for _papi3_hwd_read _papi3_hwd_stop _papi3_hwd_start and _papi3_hwd_reset.
14) All substrates shall be aware of GRANULARITY, either thread, process, cpu or system.
15) Derived events should be handled outside of the substrate.
16) All return statements indicating an error will be wrapped with a statement like error_return. It will have 2 arguments, 1 the actual return code, and 2 a understandable message.
17) Aggregate values from counters used for anything else other than counting will not be supported. i.e. You don't get valid values from PAPI_read()ing an EventSet that has overflowing enabled.
18) All PAPI event definitions files will be made external to the substrate.

Functions to be removed:

_papi_hwd_merge
_papi_hwd_unmerge
_papi_hwd_add_prog_event

Functions to be added:

_papi3_hwd_start
_papi3_hwd_stop

EventSet Management

The substrates will no longer be required to merge/unmerge different event sets.
IN PAPI 3.0, ONLY ONE EVENTSET PER THREAD OR CONTEXT CAN BE IN USE AT ANY TIME.

Some fundamentals:

The EventInfo_t structure will now contain MACHINE DEPENDENT information in addition to MACHINE INDEPENDENT.

The current EventInfo_t structure has:
  int code;                        /* Preset or native code for this event PAPI_add_event() */
  unsigned int selector;    /* Counter select bits used in the lower level */
  int command;               /* Counter derivation command used in the lower level */
  int operand_index;      /* Counter derivation data used in the lower level */
  int index;                    /* added to indicate the position in the array */
The new EventInfo_t structure has:
  unsigned code;          /* Code passed to PAPI_add_event() or PAPI_NATIVE */
  unsigned index;         /* Accessing the counters from the hw/os, this is the one */
  papi3_eventinfo_t info;
This new structure, papi3_eventinfo_t will be defined in the substrates header file. This structure will contain all necessary information to:
1) Map from a counter from the Kernel/Hardware to the software counter when PAPI_read() is called.
2) Remove a counter fully from the EventSet.

Adding events:

Adding hardware events in PAPI 3 will follow a different path. The new add_event call will allow the substrate to do full register reallocation and optimization. Exactly how this will operate is to be determined by the integration of the new register allocation scheme.

Some sample arguments are included here for discussion:

_papi3_hwd_add_events()
As input:
The entire EventInfo_t array and it's length
    Space for the machine control structure
As output:
A filled machine control structure suitable for passing to _papi3_hwd_start.
Modified EventInfo_t array entries corresponding to the new register mapping.

Adding Native Events:

Native Events that can be added by simply specifying a register code that fits in the appropriate number of bits will still be supported in PAPI 3.0. This will be accomplished with a new call: PAPI_encode_native(). This will allow user code to use system specific native definitions where appropriate, WITHOUT having PAPI redefine every single event for an architecture.

For example, on the IA64:

PAPI_add_event(EventSet,PAPI_encode_native(PFM_CYCLES));

This call will simply set 'event_code' to PAPI_NATIVE and _papi3_hwd_add_events will fill in the machine dependent structure as appropriate.

Adding Programmable Events:

Programmable Events that can be added by simply specifying a register code that fits in the appropriate number of bits will still be supported in PAPI 3.0. This will be accomplished with a new call: PAPI_encode_native_prog(). This will allow user code to use system specific definitions.

For example, on the IA64:

/* Set up the programmable structure for counting cache misses that take longer than 10 cycles. */

PAPI_native_prog_t space;
PAPI_encode_native_prog(&space, PFM_L1_MISSES,THRESHOLD,10,...)
PAPI_add_prog_event(EventSet,&space);

On the x86 with Perfctr:

PAPI_encode_native_prog(&space, pmc_map, evntsel, evntsel_aux, ...)
PAPI_add_prog_event(EventSet,&space);

This call will simply set 'event_code' to PAPI_NATPROG and _papi3_hwd_add_events will fill in the machine dependent structure as appropriate.

How to port a PAPI 2.x substrate to PAPI 3.0

1) Rename all non-static functions to _papi3_hwd_xxx
2) Change the merge/unmerge calls to their equivalent start/stop routines. If this code consumes more than a dozen lines or so, you have done something wrong. For instance, on the x986/perfctr port, the call looks like this (slow) pseudo code.
int _papi3_hwd_stop(P4_perfctr_context_t *ctx, P4_perfctr_control_t *state)
{
  if (ctx->granularity == PAPI_THREAD)
      if (vperfctr_stop(ctx->perfctr) < 0)
            error_return(PAPI_ESYS,VCNTRL_ERROR);
  else if (ctr->granularity == PAPI_REMOTE)
      if (rperfctr_stop(ctx->perfctr) < 0)
            error_return(PAPI_ESYS,VCNTRL_ERROR);
  else if (ctr->granularity == PAPI_SYSTEM)
      if (gperfctr_stop(ctx->perfctr) < 0)
            error_return(PAPI_ESYS,VCNTRL_ERROR);

  return(PAPI_OK);
}
Optionally, this could be implemented with custom function pointers in the P4_perfctr_context_t structure. This makes for much cleaner and faster code. When the granularity is changed, the function pointers in the context are changed.

int _papi3_hwd_stop(P4_perfctr_context_t *ctx, P4_perfctr_control_t *state)
{
   if (ctx->perfctr_stop(ctx->perfctr) == 0)
return(PAPI_OK);
   else
error_return(PAPI_ESYS,VCNTRL_ERROR);
}

Note that for perfctr, PROCESS level granularity will be implemented in the machine independent layer by looping over all the threads and calling _papi3_hwd_stop on each one.

3) Most substrate calls will take 2 arguments. The first is a pointer to the current context of the performance counter hardware. The second is the argument to be made to the call.
4) Change add_event and remove_event calls to call the new register allocation routines.
5) Remove all static initialization of the papi_mdi_t structure.
6) Change call DBG calls to SUBDBG calls.
7) Remove the code from _papi3_hwd_read that distributes the counters. This will be done in the machine independent layer.

Internal signal chaining.

PAPI will support programs that install their own signal handlers for signals that PAPI uses. This will allow PAPI to coexist transparently with applications that make use of SIGPROF.

PAPI thread functions.

PAPI_list_threads(&listptr, &len) will return a list of registered threads,i.e. those that have been detected by the PAPI library. This will call a function in the threads.c file, _papi_hwd_list_threads that returnns an array and a length of thread TID's.

PAPI_get_thread_specific(&from, int index) will return one of PAPI_MAX_TLS_POINTERS pointer to a THREAD SPECIFIC data area. This will be implemented either with THREAD LOCAL STORAGE (TLS on gcc and IRIX for example). This allows monitoring programs to easily maintain per thread storage without knowing anything about the thread library. Same for PAPI_set_thread_specific(to, int index)

PAPI_lock/PAPI_unlock

PAPI_lock/PAPI_unlock will now take an argument of a pointer to a lock variable. This will allow finer grained locking and better performance on SMPs. Using NULL will cause the default behavior of locking the entire library. To create a lock variable, we will usePAPI_lock_create(&lockvar) to get back an address and PAPI_lock_destroy(lockvar) to clean it up.

PAPI_INHERIT

PAPI_INHERIT option, for systems that support the BAD IDEA of inheritance of counter data by forms, we will support this. Note this is a special case of a PAPI_REMOTE operation. We'll have to figure out how to read it.

Source code structure:

PAPI 3.0 will have a revamped source code structure. The base tree will look as follows:

Papi
- tools
- - dynaprof
- - perfometer
- - psrun/papiperfex
- - papirun/papiprof
- - vprof
- src
- - <substrate_name>
- doc
- man
- examples
- tests
- - ctests
- - ftests
- - jtests

OS level, Substrate Unification

There are some ports of PAPI that are to similar operating systems but slightly different architectures. Examples of this include AIX/Power3/4,Linux/Itanium/2 and Linux/x86/PIV. In PAPI 3.0, these implementations will be grouped into the three libraries.

A New Makefile Configuration System

With this simplicification, we will be able to determine the architecture at 'make config' time. As a result, the user will no longer have to manually guess the makefile that is appropriate for his platform. The new makefile system will build the tree in architecture specific directories, so all ports can be compiled/tested in the same directory. This will greatly simplify the build process.