%0 Conference Proceedings %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2023 %T Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements %A Daniel Barry %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K GPU power %K High Performance Computing %K network traffic %K papi %K performance analysis %K Performance Counters %X Some of the most important categories of performance events count the data traffic between the processing cores and the main memory. However, since these counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP); however, doing so adds an indirection layer that involves querying the PCP daemon. This paper performs a quantitative study of the accuracy of the measurements obtained through this component on the Summit supercomputer. We use two linear algebra kernels---a generalized matrix multiply, and a modified matrix-vector multiply---as benchmarks and a distributed, GPU-accelerated 3D-FFT mini-app (using cuFFT) to compare the measurements obtained through the PAPI PCP component against the expected values across different problem sizes. We also compare our measurements against an in-house machine with a very similar architecture to Summit, where elevated privileges allow PAPI to access the hardware counters directly (without using PCP) to show that measurements taken via PCP are as accurate as the those taken directly. Finally, using both QMCPACK and the 3D-FFT, we demonstrate the diverse hardware activities that can be monitored simultaneously via PAPI hardware components. %B 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C St. Petersburg, Florida %8 2023-08 %G eng %U https://ieeexplore.ieee.org/document/10196656 %R 10.1109/IPDPSW59300.2023.00070 %0 Journal Article %J Concurrency Computation: Practice and Experience %D 2018 %T Investigating Power Capping toward Energy-Efficient Scientific Applications %A Azzam Haidar %A Heike Jagode %A Phil Vaccaro %A Asim YarKhan %A Stanimire Tomov %A Jack Dongarra %K energy efficiency %K High Performance Computing %K Intel Xeon Phi %K Knights landing %K papi %K performance analysis %K Performance Counters %K power efficiency %X The emergence of power efficiency as a primary constraint in processor and system design poses new challenges concerning power and energy awareness for numerical libraries and scientific applications. Power consumption also plays a major role in the design of data centers, which may house petascale or exascale-level computing systems. At these extreme scales, understanding and improving the energy efficiency of numerical libraries and their related applications becomes a crucial part of the successful implementation and operation of the computing system. In this paper, we study and investigate the practice of controlling a compute system's power usage, and we explore how different power caps affect the performance of numerical algorithms with different computational intensities. Further, we determine the impact, in terms of performance and energy usage, that these caps have on a system running scientific applications. This analysis will enable us to characterize the types of algorithms that benefit most from these power management schemes. Our experiments are performed using a set of representative kernels and several popular scientific benchmarks. We quantify a number of power and performance measurements and draw observations and conclusions that can be viewed as a roadmap to achieving energy efficiency in the design and execution of scientific algorithms. %B Concurrency Computation: Practice and Experience %V 2018 %P 1-14 %8 2018-04 %G eng %U https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4485 %N e4485 %R https://doi.org/10.1002/cpe.4485 %0 Conference Proceedings %B International Workshop on Power-Aware Systems and Architectures %D 2012 %T Measuring Energy and Power with PAPI %A Vincent M Weaver %A Matt Johnson %A Kiran Kasichayanula %A James Ralph %A Piotr Luszczek %A Dan Terpstra %A Shirley Moore %K papi %X Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure. %B International Workshop on Power-Aware Systems and Architectures %C Pittsburgh, PA %8 2012-09 %G eng %R 10.1109/ICPPW.2012.39 %0 Journal Article %J CloudTech-HPC 2012 %D 2012 %T PAPI-V: Performance Monitoring for Virtual Machines %A Matt Johnson %A Heike McCraw %A Shirley Moore %A Phil Mucci %A John Nelson %A Dan Terpstra %A Vincent M Weaver %A Tushar Mohan %K papi %X This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments. %B CloudTech-HPC 2012 %C Pittsburgh, PA %8 2012-09 %G eng %R 10.1109/ICPPW.2012.29 %0 Generic %D 2012 %T Performance Counter Monitoring for the Blue Gene/Q Architecture %A Heike McCraw %K papi %B University of Tennessee Computer Science Technical Report %8 2012-00 %G eng %0 Conference Paper %B International Conference on Parallel Processing (ICPP'11) %D 2011 %T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs %A Allen D. Malony %A Scott Biersdorff %A Sameer Shende %A Heike Jagode %A Stanimire Tomov %A Guido Juckeland %A Robert Dietrich %A Duncan Poole %A Christopher Lamb %K magma %K mumi %K papi %X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support. %B International Conference on Parallel Processing (ICPP'11) %I ACM %C Taipei, Taiwan %8 2011-09 %@ 978-0-7695-4510-3 %G eng %R 10.1109/ICPP.2011.71 %0 Journal Article %J Procedia Computer Science %D 2011 %T User-Defined Events for Hardware Performance Monitoring %A Shirley Moore %A James Ralph %K mumi %K papi %X PAPI is a widely used cross-platform interface to hardware performance counters. PAPI currently supports native events, which are those provided by a given platform, and preset events, which are pre-defined events thought to be common across platforms. Presets are currently mapped and defined at the time that PAPI is compiled and installed. The idea of user-defined events is to allow users to define their own metrics and to have those metrics mapped to events on a platform without the need to re-install PAPI. User-defined events can be defined in terms of native, preset, and previously defined user-defined events. The user can combine events and constants in an arbitrary expression to define a new metric and give a name to the new metric. This name can then be specified as a PAPI event in a PAPI library call the same way as native and preset events. End-user tools such as TAU and Scalasca that use PAPI can also use the user-defined metrics. Users can publish their metric definitions so that other users can use them as well. We present several examples of how user-defined events can be used for performance analysis and modeling. %B Procedia Computer Science %I Elsevier %V 4 %P 2096-2104 %8 2011-05 %G eng %R https://doi.org/10.1016/j.procs.2011.04.229 %0 Conference Proceedings %B 3rd Workshop on Functionality of Hardware Performance Monitoring %D 2010 %T Can Hardware Performance Counters Produce Expected, Deterministic Results? %A Vincent M Weaver %A Jack Dongarra %K papi %B 3rd Workshop on Functionality of Hardware Performance Monitoring %C Atlanta, GA %8 2010-12 %G eng %0 Journal Article %J Tools for High Performance Computing 2009 %D 2010 %T Collecting Performance Data with PAPI-C %A Dan Terpstra %A Heike Jagode %A Haihang You %A Jack Dongarra %K mumi %K papi %X Modern high performance computer systems continue to increase in size and complexity. Tools to measure application performance in these increasingly complex environments must also increase the richness of their measurements to provide insights into the increasingly intricate ways in which software and hardware interact. PAPI (the Performance API) has provided consistent platform and operating system independent access to CPU hardware performance counters for nearly a decade. Recent trends toward massively parallel multi-core systems with often heterogeneous architectures present new challenges for the measurement of hardware performance information, which is now available not only on the CPU core itself, but scattered across the chip and system. We discuss the evolution of PAPI into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface. Several examples of components and component data measurements are discussed. We explore the challenges to hardware performance measurement in existing multi-core architectures. We conclude with an exploration of future directions for the PAPI interface. %B Tools for High Performance Computing 2009 %I Springer Berlin / Heidelberg %C 3rd Parallel Tools Workshop, Dresden, Germany %P 157-173 %8 2010-05 %G eng %R https://doi.org/10.1007/978-3-642-11261-4_11 %0 Conference Proceedings %B Proceedings of Parallel Computing 2005 (ParCo) %D 2005 %T Analysis and Optimization of Yee_Bench using Hardware Performance Counters %A Ulf Andersson %A Phil Mucci %K papi %X In this paper, we report on our analysis and optimization of a serial Fortran 90 benchmark called Yee bench. This benchmark has been run on a variety of architectures and its performance is reasonably well understood. However, on AMD Opteron based machines, we found unexpected dips in the delivered MFLOPS of the code for a seemingly random set of problem sizes. Through the use of the Opteron’s on-chip hardware performance counters andPapiEx, aPAPI based tool, we discovered that these drops were directly related to high L1 cache miss rates for these problem sizes. The high miss rates could be attributed to the fact that in the two core regions of the code we have references to three dynamically allocated arrays which compete for the same set in the Opteron’s 2-way set associative cache. We validated this conclusion by accurately predicting those problem sizes that exhibit this problem. We were able to alleviate these performance anomalies using variable intra-array padding to effectively accomplish inter-array padding. We conclude with some comments on the general applicability of this method as well how one might improving the implementation of the Fortran 90ALLOCATE intrinsic to handle this case. 1. %B Proceedings of Parallel Computing 2005 (ParCo) %C Malaga, Spain %8 2005-01 %G eng %0 Conference Paper %B European Conference on Parallel Processing (Euro-Par 2005) %D 2005 %T PerfMiner: Cluster-Wide Collection, Storage and Presentation of Application Level Hardware Performance Data %A Phil Mucci %A Daniel Ahlin %A Johan Danielsson %A Per Ekman %A Lars Malinowski %K papi %X We present PerfMiner, a system for the transparent collection, storage and presentation of thread-level hardware performance data across an entire cluster. Every sub-process/thread spawned by the user through the batch system is measured with near zero overhead and no dilation of run-time. Performance metrics are collected at the thread level using tool built on top of the Performance Application Programming Interface (PAPI). As the hardware counters are virtualized by the OS, the resulting counts are largely unaffected by other kernel or user processes. PerfMiner correlates this performance data with metadata from the batch system and places it in a database. Through a command line and web interface, the user can make queries to the database to report information on everything from overall workload characterization and system utilization to the performance of a single thread in a specific application. This is in contrast to other monitoring systems that report aggregate system-wide metrics sampled over a period of time. In this paper, we describe our implementation of PerfMiner as well as present some results from the test deployment of PerfMiner across three different clusters at the Center for Parallel Computers at The Royal Institute of Technology in Stockholm, Sweden. %B European Conference on Parallel Processing (Euro-Par 2005) %I Springer %C Monte de Caparica, Portugal %8 2005-09 %G eng %R https://doi.org/10.1007/11549468_1 %0 Conference Paper %B Proceedings of DoD HPCMP UGC 2005 %D 2005 %T Performance Profiling and Analysis of DoD Applications using PAPI and TAU %A Shirley Moore %A David Cronk %A Felix Wolf %A Avi Purkayastha %A Patricia J. Teller %A Robert Araiza %A Gabriela Aguilera %A Jamie Nava %K papi %B Proceedings of DoD HPCMP UGC 2005 %I IEEE %C Nashville, TN %8 2005-06 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2004) %D 2004 %T Accurate Cache and TLB Characterization Using Hardware Counters %A Jack Dongarra %A Shirley Moore %A Phil Mucci %A Keith Seymour %A Haihang You %K gco %K lacsi %K papi %X We have developed a set of microbenchmarks for accurately determining the structural characteristics of data cache memories and TLBs. These characteristics include cache size, cache line size, cache associativity, memory page size, number of data TLB entries, and data TLB associativity. Unlike previous microbenchmarks that used time-based measurements, our microbenchmarks use hardware event counts to more accurately and quickly determine these characteristics while requiring fewer limiting assumptions. %B International Conference on Computational Science (ICCS 2004) %I Springer %C Krakow, Poland %8 2004-06 %G eng %R https://doi.org/10.1007/978-3-540-24688-6_57 %0 Conference Paper %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %D 2004 %T Automatic Blocking of QR and LU Factorizations for Locality %A Qing Yi %A Ken Kennedy %A Haihang You %A Keith Seymour %A Jack Dongarra %K gco %K papi %K sans %X QR and LU factorizations for dense matrices are important linear algebra computations that are widely used in scientific applications. To efficiently perform these computations on modern computers, the factorization algorithms need to be blocked when operating on large matrices to effectively exploit the deep cache hierarchy prevalent in today's computer memory systems. Because both QR (based on Householder transformations) and LU factorization algorithms contain complex loop structures, few compilers can fully automate the blocking of these algorithms. Though linear algebra libraries such as LAPACK provides manually blocked implementations of these algorithms, by automatically generating blocked versions of the computations, more benefit can be gained such as automatic adaptation of different blocking strategies. This paper demonstrates how to apply an aggressive loop transformation technique, dependence hoisting, to produce efficient blockings for both QR and LU with partial pivoting. We present different blocking strategies that can be generated by our optimizer and compare the performance of auto-blocked versions with manually tuned versions in LAPACK, both using reference BLAS, ATLAS BLAS and native BLAS specially tuned for the underlying machine architectures. %B 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004) %I ACM %C Washington, DC %8 2004-06 %G eng %R 10.1145/1065895.1065898 %0 Conference Paper %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %D 2004 %T Automating the Large-Scale Collection and Analysis of Performance %A Phil Mucci %A Jack Dongarra %A Rick Kufrin %A Shirley Moore %A Fengguang Song %A Felix Wolf %K kojak %K papi %B 5th LCI International Conference on Linux Clusters: The HPC Revolution %C Austin, Texas %8 2004-05 %G eng %0 Conference Paper %B PADTAD Workshop, IPDPS 2003 %D 2003 %T Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters %A Jack Dongarra %A Kevin London %A Shirley Moore %A Phil Mucci %A Dan Terpstra %A Haihang You %A Min Zhou %K lacsi %K papi %X The PAPI project has defined and implemented a cross-platform interface to the hardware counters available on most modern microprocessors. The interface has gained widespread use and acceptance from hardware vendors, users, and tool developers. This paper reports on experiences with the community-based open-source effort to define the PAPI specification and implement it on a variety of platforms. Collaborations with tool developers who have incorporated support for PAPI are described. Issues related to interpretation and accuracy of hardware counter data and to the overheads of collecting this data are discussed. The paper concludes with implications for the design of the next version of PAPI. %B PADTAD Workshop, IPDPS 2003 %I IEEE %C Nice, France %8 2003-04 %@ 0-7695-1926-1 %G eng %0 Journal Article %J Advances in Parallel Computing %D 2003 %T Hardware-Counter Based Automatic Performance Analysis of Parallel Programs %A Felix Wolf %A Bernd Mohr %K kojak %K papi %X The KOJAK performance-analysis environment identifies a large number of performance problems on parallel computers with SMP nodes. The current version concentrates on parallelism-related performance problems that arise from an inefficient usage of the parallel programming interfaces MPI and OpenMP, while ignoring individual CPU performance. This chapter describes an extended design of KOJAK capable of diagnosing low individual-CPU performance based on hardware-counter information and of integrating the results with those of the parallelism-centered analysis. The performance of parallel applications is determined by a variety of different factors. Performance of single components frequently influences the overall behavior in unexpected ways. Application programmers on current parallel machines have to deal with numerous performance-critical aspects: different modes of parallel execution, such as message passing, multi-threading or even a combination of the two, and performance on individual CPU that is determined by the interaction of different functional units. The KOJAK analysis process is composed of two parts: a semi-automatic instrumentation of the user application followed by an automatic analysis of the generated performance data. KOJAK's instrumentation software runs on most major UNlX platforms and works on multiple levels, including source-code, compiler, and linker. %B Advances in Parallel Computing %I Elsevier %C Dresden, Germany %V 13 %P 753-760 %8 2004-01 %G eng %R https://doi.org/10.1016/S0927-5452(04)80092-3 %0 Conference Paper %B ICCS 2003 Terascale Workshop %D 2003 %T Performance Instrumentation and Measurement for Terascale Systems %A Jack Dongarra %A Allen D. Malony %A Shirley Moore %A Phil Mucci %A Sameer Shende %K papi %X As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve together with performance analysis infrastructure to address the challenges of new scalable parallel systems. %B ICCS 2003 Terascale Workshop %I Springer, Berlin, Heidelberg %C Melbourne, Australia %8 2003-06 %G eng %R https://doi.org/10.1007/3-540-44864-0_6 %0 Conference Paper %B International Conference on Computational Science (ICCS 2002) %D 2002 %T A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware %A Shirley Moore %K papi %X Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate events counts are accumulated, and/or in sampling mode, in which time-based or event-based sampling is used to collect profiling data. This paper discusses uses of these two modes and considers the issues of efficiency and accuracy raised by each. Implications for the PAPI cross-platform hardware counter interface are also discussed. %B International Conference on Computational Science (ICCS 2002) %I Springer %C Amsterdam, Netherlands %8 2002-04 %G eng %R https://doi.org/10.1007/3-540-46080-2_95 %0 Conference Paper %B International Conference on Parallel and Distributed Computing Systems %D 2001 %T End-user Tools for Application Performance Analysis, Using Hardware Counters %A Kevin London %A Jack Dongarra %A Shirley Moore %A Phil Mucci %A Keith Seymour %A T. Spencer %K papi %X One purpose of the end-user tools described in this paper is to give users a graphical representation of performance information that has been gathered by instrumenting an application with the PAPI library. PAPI is a project that specifies a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count "events", which are occurrences of specific signals and states related to a processor’s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. The perfometer tool developed by the PAPI project provides a graphical view of this information, allowing users to quickly see where performance bottlenecks are in their application. Only one function call has to be added by the user to their program to take advantage of perfometer. This makes it quick and simple to add and remove instrumentation from a program. Also, perfometer allows users to change the "event" they are monitoring. Add the ability to monitor parallel applications, set alarms and a Java front-end that can run anywhere, and this gives the user a powerful tool for quickly discovering where and why a bottleneck exists. A number of third-party tools for analyzing performance of message-passing and/or threaded programs have also incorporated support for PAPI so as to be able to display and analyze hardware counter data from their interfaces. %B International Conference on Parallel and Distributed Computing Systems %C Dallas, TX %8 2001-08 %G eng %0 Conference Paper %B Department of Defense Users' Group Conference Proceedings %D 2001 %T The PAPI Cross-Platform Interface to Hardware Performance Counters %A Kevin London %A Shirley Moore %A Phil Mucci %A Keith Seymour %A Richard Luczak %K papi %X The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count "events," which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has developed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware. The PAPI library has been implemented for a number of Shared Resource Center platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools. %B Department of Defense Users' Group Conference Proceedings %C Biloxi, Mississippi %8 2001-06 %G eng %0 Journal Article %J European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131 %D 2001 %T Review of Performance Analysis Tools for MPI Parallel Programs %A Shirley Moore %A David Cronk %A Kevin London %A Jack Dongarra %K papi %X In order to produce MPI applications that perform well on today’s parallel architectures, programmers need effective tools for collecting and analyzing performance data. A variety of such tools, both commercial and research, are becoming available. This paper reviews and evaluations the available cross-platform MPI performance analysis tools. %B European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131 %I Springer Verlag, Berlin %C Greece %P 241-248 %8 2001-09 %G eng %R https://doi.org/10.1007/3-540-45417-9_34 %0 Conference Paper %B Conference on Linux Clusters: The HPC Revolution %D 2001 %T Using PAPI for Hardware Performance Monitoring on Linux Systems %A Jack Dongarra %A Kevin London %A Shirley Moore %A Phil Mucci %A Dan Terpstra %K papi %X PAPI is a specification of a cross-platform interface to hardware performance counters on modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to a processor's function. Monitoring these events has a variety of uses in application performance analysis and tuning. The PAPI specification consists of both a standard set of events deemed most relevant for application performance tuning, as well as both high-level and low-level sets of routines for accessing the counters. The high level interface simply provides the ability to start, stop, and read sets of events, and is intended for the acquisition of simple but accurate measurement by application engineers. The fully programmable low-level interface provides sophisticated options for controlling the counters, such as setting thresholds for interrupt on overflow, as well as access to all native counting modes and events, and is intended for third-party tool writers or users with more sophisticated needs. PAPI has been implemented on a number of platforms, including Linux/x86 and Linux/IA-64. The Linux/x86 implementation requires a kernel patch that provides a driver for the hardware counters. The driver memory maps the counter registers into user space and allows virtualizing the counters on a perprocess or per-thread basis. The kernel patch is being proposed for inclusion in the main Linux tree. The PAPI library provides access on Linux platforms not only to the standard set of events mentioned above but also to all the Linux/x86 and Linux/IA-64 native events. PAPI has been installed and is in use, either directly or through incorporation into third-party end-user performance analysis tools, on a number of Linux clusters, including the New Mexico LosLobos cluster and Linux clusters at NCSA and the University of Tennessee being used for the GrADS (Grid Application Development Software) project. %B Conference on Linux Clusters: The HPC Revolution %I Linux Clusters Institute %C Urbana, Illinois %8 2001-06 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2000 %T A Portable Programming Interface for Performance Evaluation on Modern Processors %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A George Ho %A Phil Mucci %K papi %B The International Journal of High Performance Computing Applications %V 14 %P 189-204 %8 2000-09 %G eng %R https://doi.org/10.1177/109434200001400303 %0 Conference Proceedings %B Proceedings of SuperComputing 2000 (SC'00) %D 2000 %T A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters %A Shirley Browne %A Jack Dongarra %A Nathan Garner %A Kevin London %A Phil Mucci %K papi %B Proceedings of SuperComputing 2000 (SC'00) %C Dallas, TX %8 2000-11 %G eng %0 Conference Proceedings %B Proceedings of Department of Defense HPCMP Users Group Conference %D 1999 %T PAPI: A Portable Interface to Hardware Performance Counters %A Shirley Browne %A Christine Deane %A George Ho %A Phil Mucci %K papi %B Proceedings of Department of Defense HPCMP Users Group Conference %8 1999-06 %G eng