Abstract
The Performance API (PAPI) project specifies a standard application programming
interface (API) for accessing hardware performance counters available on most
modern microprocessors. These counters exist as a small set of registers that
count Events, occurrences of specific signals related to the processor's
function. Monitoring these events facilitates correlation between the structure
of source/object code and the efficiency of the mapping of that code to the
underlying architecture. This correlation has a variety of uses in performance
analysis including hand tuning, compiler optimization, debugging, benchmarking,
monitoring and performance modeling. In addition, it is hoped that this information
will prove useful in the development of new compilation technology as well as
in steering architectural development towards alleviating commonly occurring
bottlenecks in high performance computing.
Description
PAPI provides two interfaces to the underlying counter hardware; a simple,
high level interface for the acquisition of simple measurements and a fully
programmable, low level interface directed towards users with more sophisticated
needs. The low level PAPI interface deals with hardware events in groups called
EventSets. EventSets reflect how the counters are most frequently used,
such as taking simultaneous measurements of different hardware events and relating
them to one another. For example, relating cycles to memory references or flops
to level 1 cache misses can indicate poor locality and memory management. In
addition, EventSets allow a highly efficient implementation which translates
to more detailed and accurate measurements. EventSets are fully programmable
and have features such as guaranteed thread safety, writing of counter values,
multiplexing and notification on threshold crossing, as well as processor specific
features. The high level interface simply provides the ability to start, stop
and read specific events, one at a time.
PAPI provides portability across different platforms. It uses the same routines
with similar argument lists to control and access the counters for every architecture.
As part of PAPI, we have predefined a set of events that we feel represents
the lowest common denominator of every good counter implementation. Our
intent is that the same source code will count similar and possibly comparable
events when run on different platforms. If the programmer chooses to use this
set of standardized events, then the source code need not be changed and only
a fresh compilation and link is necessary. However, should the developer wish
to access machine specific events, the low level API provides access to all
available events and counting modes. If an event or feature does not exist on
the current platform, PAPI returns an appropriate error code. This significantly
reduces the porting effort of code using PAPI because the semantics of each
call to PAPI remains the same, just the argument lists need updating. In addition
to the standard set, each PAPI implementation supports all native events
through the ability to directly accept platform specific counter numbers. Definitions
for most, if not all of these, are included as conditional macros in the header
file. In this way, PAPI avoids having inefficient code to translate all events
for all platforms into a uniform representation and back again. This translation
is only done for the relatively few events defined in the standardized set.
Some processors like those in the POWER series have counter groups. They enable
access to specific groups of counters, instead of individual events. This presents
a serious portability problem, thus PAPI abstracts hardware counters from their
groups with a packed naming scheme. Each counter control value or event is made
up of the counter group number and the number of the specific counter in that
group.
PAPI can be divided into two layers of software. The upper layer consists
of the API and machine independent support functions. The lower layer defines
and exports a machine independent interface to machine dependent functions and
data structures. These functions access the substrate, which may consist
of the operating system, a kernel extension or assembly functions to directly
access the processors registers. PAPI tries to use the most efficient and flexible
of the three, depending on what is available. Naturally, the functionality of
the upper layers heavily depends on that provided by the substrate. In cases
where the substrates do not provide highly desirable features, PAPI attempts
to emulate them as described below.
PAPI makes sure the underlying operating system or library guards against
overflow of counter values. Each counter can potentially be incremented multiple
times in a single clock cycle. This combined with increasing clock speeds and
the small precision of some of the physical counters means that overflow is
likely to occur.
One of the more advanced features of PAPI is to provide a portable implementation
of asynchronous notification when counters exceed a user specified value.
This functionality provides the basis for PAPI's SVR4 compatible profiling
calls, that generate an accurate histogram of performance interrupts based
on hardware metrics, not on time. Such functionality provides the basis for
all line level performance analysis software, from the antiquated days of AT
& T's prof to SGI's SpeedShop. Thus for any architecture with even the most
rudimentary access to hardware performance counters, PAPI provides the foundation
for a truly portable, source level, performance analysis tool based on real
processor statistics.
|
 |
Supercomputing 2001 Poster
View [.JPG]
Download [.PDF]
Innovative Computing Laboratory 2001 Report
Color [.PDF]
3.85 MB
|
|