%0 Conference Paper %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %D 2023 %T Elastic deep learning through resilient collective operations %A Li, Jiali %A Bosilca, George %A Bouteiller, Aurélien %A Nicolae, Bogdan %X A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications. %B SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis %I ACM %C Denver, CO %8 2023-11 %@ 9798400707858 %G eng %U https://dl.acm.org/doi/abs/10.1145/3624062.3626080 %R 10.1145/3624062.3626080 %0 Conference Paper %B Smoky Mountains Computational Sciences and Engineering Conference %D 2023 %T Preconditioners for Batched Iterative Linear Solvers on GPUs %A Aggarwal, Isha %A Nayak, Pratik %A Kashi, Aditya %A Anzt, Hartwig %E Doug, Kothe %E Al, Geist %E Pophale, Swaroop %E Liu, Hong %E Parete-Koon, Suzanne %X Batched iterative solvers can be an attractive alternative to batched direct solvers if the linear systems allow for fast convergence. In non-batched settings, iterative solvers are often enhanced with sophisticated preconditioners to improve convergence. In this paper, we develop preconditioners for batched iterative solvers that improve the iterative solver convergence without incurring detrimental resource overhead and preserving much of the iterative solver flexibility. We detail the design and implementation considerations, present a user-friendly interface to the batched preconditioners, and demonstrate the convergence and runtime benefits over non-preconditioned batched iterative solvers on state-of-the-art GPUs for a variety of benchmark problems from finite difference stencil matrices, the Suitesparse matrix collection and a computational chemistry application. %B Smoky Mountains Computational Sciences and Engineering Conference %I Springer Nature Switzerland %V 169075 %P 38 - 53 %8 2023-01 %@ 978-3-031-23605-1 %G eng %U https://link.springer.com/chapter/10.1007/978-3-031-23606-8_3 %R 10.1007/978-3-031-23606-810.1007/978-3-031-23606-8_3 %0 Conference Paper %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2022 %T Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations %A Kashi, Aditya %A Nayak, Pratik %A Kulkarni, Dhruva %A Scheinberg, Aaron %A Lin, Paul %A Anzt, Hartwig %X Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities. %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Lyon, France %8 2022-07 %G eng %U https://ieeexplore.ieee.org/document/9820663 %R 10.1109/IPDPS53621.2022.00024 %0 Conference Paper %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2022 %T Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment %A Schuchart, Joseph %A Nookala, Poornima %A Javanmard, Mohammad Mahdi %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %A Harrison, Robert J. %X We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations. %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Lyon, France %8 2022-07 %G eng %U https://ieeexplore.ieee.org/abstract/document/9820613 %R 10.1109/IPDPS53621.2022.00086 %0 Journal Article %J ACM Transactions on Mathematical Software %D 2022 %T Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing %A Anzt, Hartwig %A Cojean, Terry %A Flegar, Goran %A Göbel, Fritz %A Grützmacher, Thomas %A Nayak, Pratik %A Ribizel, Tobias %A Tsai, Yuhsiang Mike %A Quintana-Ortí, Enrique S %X In this article, we present Ginkgo, a modern C++ math library for scientific high performance computing. While classical linear algebra libraries act on matrix and vector objects, Ginkgo’s design principle abstracts all functionality as “linear operators,” motivating the notation of a “linear operator algebra library.” Ginkgo’s current focus is oriented toward providing sparse linear algebra functionality for high performance graphics processing unit (GPU) architectures, but given the library design, this focus can be easily extended to accommodate other algorithms and hardware architectures. We introduce this sophisticated software architecture that separates core algorithms from architecture-specific backends and provide details on extensibility and sustainability measures. We also demonstrate Ginkgo’s usability by providing examples on how to use its functionality inside the MFEM and deal.ii finite element ecosystems. Finally, we offer a practical demonstration of Ginkgo’s high performance on state-of-the-art GPU architectures. %B ACM Transactions on Mathematical Software %V 48 %P 1 - 33 %8 2022-03 %G eng %U https://dl.acm.org/doi/10.1145/3480935 %N 12 %! ACM Trans. Math. Softw. %R 10.1145/3480935 %0 Conference Proceedings %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %D 2022 %T Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach %A Whitlock, Matthew %A Morales, Nicolas %A George Bosilca %A Bouteiller, Aurélien %A Nicolae, Bogdan %A Teranishi, Keita %A Giem, Elisabeth %A Sarkar, Vivek %K checkpointing %K Fault tolerance %K Fenix %K HPC %K Kokkos %K MPI-ULFM %K resilience %B 2022 IEEE International Conference on Cluster Computing (CLUSTER 2022) %C Heidelberg, Germany %8 2022-09 %G eng %U https://hal.archives-ouvertes.fr/hal-03772536 %0 Conference Paper %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %D 2022 %T Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG %A Schuchart, Joseph %A Nookala, Poornima %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %K Dataflow graph %K Hardware %K Instruction sets %K Memory management %K PaR-SEC %K parallel programming %K runtime %K scalability %K Task analysis %K task-based programming %K Template Task Graph %K TTG %X Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory. %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Heidelberg, Germany %8 2022-09 %G eng %U https://ieeexplore.ieee.org/document/9912704/ %R 10.1109/CLUSTER51413.2022.00026 %0 Conference Paper %B 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS) %D 2022 %T A Python Library for Matrix Algebra on GPU and Multicore Architectures %A Nance, Delario %A Stanimire Tomov %A Wong, Kwai %B 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS) %I IEEE %C Denver, CO %8 2022-12 %G eng %U https://ieeexplore.ieee.org/document/9973474/ %R 10.1109/MASS56207.2022.00121 %0 Conference Proceedings %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %D 2022 %T Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications %A Cao, Qinglei %A Abdulah, Sameh %A Rabab Alomairy %A Pei, Yu %A Pratik Nag %A George Bosilca %A Dongarra, Jack %A Genton, Marc G. %A Keyes, David %A Ltaief, Hatem %A Sun, Ying %K climate/weather prediction %K dynamic runtime systems %K high performance computing. %K low- rank matrix approximations %K mixed-precision computations %K space-time geospatial statistics %K Task-based programming models %X We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tilebased Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation. %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %I IEEE Press %C Dallas, TX %8 2022-11 %@ 9784665454445 %G eng %U https://dl.acm.org/doi/abs/10.5555/3571885.3571888 %0 Journal Article %J Parallel Computing %D 2021 %T Callback-based completion notification using MPI Continuations %A Schuchart, Joseph %A Samfass, Philipp %A Niethammer, Christoph %A Gracia, José %A George Bosilca %K MPI %K MPI Continuations %K OmpSs %K OpenMP %K parsec %K TAMPI %K Task-based programming models %X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and nonblocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space. %B Parallel Computing %V 21238566 %P 102793 %8 Jan-05-2021 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000466?via%3Dihub %N 0225 %! Parallel Computing %R 10.1016/j.parco.2021.102793 %0 Journal Article %J IEEE Access %D 2021 %T Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems %A Iqbal, Zafar %A Nooshabadi, Saeid %A Yamazaki, Ichitaro %A Stanimire Tomov %A Jack Dongarra %B IEEE Access %G eng %R 10.1109/ACCESS.2021.3106054 %0 Generic %D 2021 %T Gingko: A Sparse Linear Algebrea Library for HPC %A Hartwig Anzt %A Natalie Beams %A Terry Cojean %A Fritz Göbel %A Thomas Grützmacher %A Aditya Kashi %A Pratik Nayak %A Tobias Ribizel %A Yuhsiang M. Tsai %I 2021 ECP Annual Meeting %8 2021-04 %G eng %0 Journal Article %J Computer Physics Communications %D 2021 %T Materials fingerprinting classification %A Spannaus, Adam %A Law, Kody J.H. %A Piotr Luszczek %A Nasrin, Farzana %A Micucci, Cassie Putman %A Liaw, Peter K. %A Santodonato, Louis J. %A Keffer, David J. %A Maroulas, Vasileios %K Atom probe tomography %K High entropy alloy %K Machine Learning %K Materials discovery %K Topological data analysis %X Significant progress in many classes of materials could be made with the availability of experimentally-derived large datasets composed of atomic identities and three-dimensional coordinates. Methods for visualizing the local atomic structure, such as atom probe tomography (APT), which routinely generate datasets comprised of millions of atoms, are an important step in realizing this goal. However, state-of-the-art APT instruments generate noisy and sparse datasets that provide information about elemental type, but obscure atomic structures, thus limiting their subsequent value for materials discovery. The application of a materials fingerprinting process, a machine learning algorithm coupled with topological data analysis, provides an avenue by which here-to-fore unprecedented structural information can be extracted from an APT dataset. As a proof of concept, the material fingerprint is applied to high-entropy alloy APT datasets containing body-centered cubic (BCC) and face-centered cubic (FCC) crystal structures. A local atomic configuration centered on an arbitrary atom is assigned a topological descriptor, with which it can be characterized as a BCC or FCC lattice with near perfect accuracy, despite the inherent noise in the dataset. This successful identification of a fingerprint is a crucial first step in the development of algorithms which can extract more nuanced information, such as chemical ordering, from existing datasets of complex materials. %B Computer Physics Communications %P 108019 %8 Jan-05-2021 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0010465521001314 %! Computer Physics Communications %R 10.1016/j.cpc.2021.108019 %0 Conference Paper %B EuroMPI'21 %D 2021 %T Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Communication %A Schuchart, Joseph %A Niethammer, Christoph %A Gracia, José %A George Bosilca %K Memory Handles %K MPI %K MPI-RMA %K RDMA %X The MPI standard has long included one-sided communication abstractions through the MPI Remote Memory Access (RMA) interface. Unfortunately, the MPI RMA chapter in the 4.0 version of the MPI standard still contains both well-known and lesser known short-comings for both implementations and users, which lead to potentially non-optimal usage patterns. In this paper, we identify a set of issues and propose ways for applications to better express anticipated usage of RMA routines, allowing the MPI implementation to better adapt to the application's needs. In order to increase the flexibility of the RMA interface, we add the capability to duplicate windows, allowing access to the same resources encapsulated by a window using different configurations. In the same vein, we introduce the concept of MPI memory handles, meant to provide life-time guarantees on memory attached to dynamic windows, removing the overhead currently present in using dynamically exposed memory. We will show that our extensions provide improved accumulate latencies, reduced overheads for multi-threaded flushes, and allow for zero overhead dynamic memory window usage. %B EuroMPI'21 %C Garching, Munich Germany %G eng %U https://arxiv.org/abs/2111.08142 %0 Conference Paper %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) %D 2020 %T DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models %A Bogdan Nicolae %A Jiali Li %A Justin M. Wozniak %A George Bosilca %A Matthieu Dorier %A Franck Cappello %X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead. %B 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) %I IEEE %C Melbourne, VIC, Australia %8 2020-05 %G eng %R https://doi.org/10.1109/CCGrid49817.2020.00-76 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2020 %T Evaluating Asynchronous Schwarz Solvers on GPUs %A Pratik Nayak %A Terry Cojean %A Hartwig Anzt %K abstract Schwarz methods %K Asynchronous solvers %K exascale %K GPUs %K multicore processors %K parallel numerical linear algebra %X With the commencement of the exascale computing era, we realize that the majority of the leadership supercomputers are heterogeneous and massively parallel. Even a single node can contain multiple co-processors such as GPUs and multiple CPU cores. For example, ORNL’s Summit accumulates six NVIDIA Tesla V100 GPUs and 42 IBM Power9 cores on each node. Synchronizing across compute resources of multiple nodes can be prohibitively expensive. Hence, it is necessary to develop and study asynchronous algorithms that circumvent this issue of bulk-synchronous computing. In this study, we examine the asynchronous version of the abstract Restricted Additive Schwarz method as a solver. We do not explicitly synchronize, but allow the communication between the sub-domains to be completely asynchronous, thereby removing the bulk synchronous nature of the algorithm. We accomplish this by using the one-sided Remote Memory Access (RMA) functions of the MPI standard. We study the benefits of using such an asynchronous solver over its synchronous counterpart. We also study the communication patterns governed by the partitioning and the overlap between the sub-domains on the global solver. Finally, we show that this concept can render attractive performance benefits over the synchronous counterparts even for a well-balanced problem. %B International Journal of High Performance Computing Applications %8 2020-08 %G eng %R https://doi.org/10.1177/1094342020946814 %0 Journal Article %J Journal of Open Source Software %D 2020 %T Ginkgo: A High Performance Numerical Linear Algebra Library %A Hartwig Anzt %A Terry Cojean %A Yen-Chen Chen %A Fritz Goebel %A Thomas Gruetzmacher %A Pratik Nayak %A Tobias Ribizel %A Yu-Hsiang Tsai %X Ginkgo is a production-ready sparse linear algebra library for high performance computing on GPU-centric architectures with a high level of performance portability and focuses on software sustainability. The library focuses on solving sparse linear systems and accommodates a large variety of matrix formats, state-of-the-art iterative (Krylov) solvers and preconditioners, which make the library suitable for a variety of scientific applications. Ginkgo supports many architectures such as multi-threaded CPU, NVIDIA GPUs, and AMD GPUs. The heavy use of modern C++ features simplifies the addition of new executor paradigms and algorithmic functionality without introducing significant performance overhead. Solving linear systems is usually one of the most computationally and memory intensive aspects of any application. Hence there has been a significant amount of effort in this direction with software libraries such as UMFPACK (Davis, 2004) and CHOLMOD (Chen, Davis, Hager, & Rajamanickam, 2008) for solving linear systems with direct methods and PETSc (Balay et al., 2020), Trilinos (“The Trilinos Project Website,” 2020), Eigen (Guennebaud, Jacob, & others, 2010) and many more to solve linear systems with iterative methods. With Ginkgo, we aim to ensure high performance while not compromising portability. Hence, we provide very efficient low level kernels optimized for different architectures and separate these kernels from the algorithms thereby ensuring extensibility and ease of use. Ginkgo is also a part of the xSDK effort (Bartlett et al., 2017) and available as a Spack (Gamblin et al., 2015) package. xSDK aims to provide infrastructure for and interoperability between a collection of related and complementary software elements to foster rapid and efficient development of scientific applications using High Performance Computing. Within this effort, we provide interoperability with application libraries such as deal.ii (Arndt et al., 2019) and mfem (Anderson et al., 2020). Ginkgo provides wrappers within these two libraries so that they can take advantage of the features of Ginkgo. %B Journal of Open Source Software %V 5 %8 2020-08 %G eng %N 52 %R https://doi.org/10.21105/joss.02260 %0 Generic %D 2020 %T Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster) %A Hartwig Anzt %A Terry Cojean %A Yen-Chen Chen %A Fritz Goebel %A Thomas Gruetzmacher %A Pratik Nayak %A Tobias Ribizel %A Yu-Hsiang Tsai %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T How to Build Your Own Deep Neural Network %A Kwai Wong %A Stanimire Tomov %A Daniel Nichols %A Rocco Febbo %A Florent Lopez %A Julian Halloy %A Xianfeng Ma %K AI %K Deep Neural Networks %K dense linear algebra %K HPC %K ML %I PEARC20 %8 2020-07 %G eng %0 Generic %D 2020 %T Integrating Deep Learning in Domain Science at Exascale (MagmaDNN) %A Stanimire Tomov %A Kwai Wong %A Jack Dongarra %A Rick Archibald %A Edmond Chow %A Eduardo D'Azevedo %A Markus Eisenbach %A Rocco Febbo %A Florent Lopez %A Daniel Nichols %A Junqi Yin %X We will present some of the current challenges in the design and integration of deep learning AI with traditional HPC simulations. We evaluate existing packages for readiness to run efficiently deep learning models and applications on large scale HPC systems, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and up-coming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated in MagmaDNN, an open source HPC deep learning framework. Many deep learning frameworks are targeted towards data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how these can be provided, e.g., as in MagmaDNN, through a deep integration with existing HPC libraries such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced and mixed-precision and asynchronous optimization methods. Finally, we present illustrations and potential solutions on enhancing traditional compute and data intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated on materials science, imaging, and climate applications. %I DOD HPCMP seminar %C virtual %8 2020-12 %G eng %0 Generic %D 2020 %T Integrating Deep Learning in Domain Sciences at Exascale %A Rick Archibald %A Edmond Chow %A Eduardo D'Azevedo %A Jack Dongarra %A Markus Eisenbach %A Rocco Febbo %A Florent Lopez %A Daniel Nichols %A Stanimire Tomov %A Kwai Wong %A Junqi Yin %X This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems e ciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2020-08 %G eng %0 Conference Paper %B 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020) %D 2020 %T Integrating Deep Learning in Domain Sciences at Exascale %A Rick Archibald %A Edmond Chow %A Eduardo D'Azevedo %A Jack Dongarra %A Markus Eisenbach %A Rocco Febbo %A Florent Lopez %A Daniel Nichols %A Stanimire Tomov %A Kwai Wong %A Junqi Yin %X This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems e ciently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications. %B 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020) %8 2020-08 %G eng %0 Journal Article %J ACM Transactions on Parallel Computing %D 2020 %T Load-Balancing Sparse Matrix Vector Product Kernels on GPUs %A Hartwig Anzt %A Terry Cojean %A Chen Yen-Chen %A Jack Dongarra %A Goran Flegar %A Pratik Nayak %A Stanimire Tomov %A Yuhsiang M. Tsai %A Weichung Wang %X Efficient processing of Irregular Matrices on Single Instruction, Multiple Data (SIMD)-type architectures is a persistent challenge. Resolving it requires innovations in the development of data formats, computational techniques, and implementations that strike a balance between thread divergence, which is inherent for Irregular Matrices, and padding, which alleviates the performance-detrimental thread divergence but introduces artificial overheads. To this end, in this article, we address the challenge of designing high performance sparse matrix-vector product (SpMV) kernels designed for Nvidia Graphics Processing Units (GPUs). We present a compressed sparse row (CSR) format suitable for unbalanced matrices. We also provide a load-balancing kernel for the coordinate (COO) matrix format and extend it to a hybrid algorithm that stores part of the matrix in SIMD-friendly Ellpack format (ELL) format. The ratio between the ELL- and the COO-part is determined using a theoretical analysis of the nonzeros-per-row distribution. For the over 2,800 test matrices available in the Suite Sparse matrix collection, we compare the performance against SpMV kernels provided by NVIDIA’s cuSPARSE library and a heavily-tuned sliced ELL (SELL-P) kernel that prevents unnecessary padding by considering the irregular matrices as a combination of matrix blocks stored in ELL format. %B ACM Transactions on Parallel Computing %V 7 %8 2020-03 %G eng %N 1 %R https://doi.org/10.1145/3380930 %0 Generic %D 2020 %T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic %A Ahmad Abdelfattah %A Hartwig Anzt %A Erik Boman %A Erin Carson %A Terry Cojean %A Jack Dongarra %A Mark Gates %A Thomas Gruetzmacher %A Nicholas J. Higham %A Sherry Li %A Neil Lindquist %A Yang Liu %A Jennifer Loe %A Piotr Luszczek %A Pratik Nayak %A Sri Pranesh %A Siva Rajamanickam %A Tobias Ribizel %A Barry Smith %A Kasia Swirydowicz %A Stephen Thomas %A Stanimire Tomov %A Yaohung Tsai %A Ichitaro Yamazaki %A Urike Meier Yang %B SLATE Working Notes %I University of Tennessee %8 2020-07 %G eng %9 SLATE Working Notes %0 Conference Paper %B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2) %D 2020 %T The Template Task Graph (TTG) - An Emerging Practical Dataflow Programming Paradigm for Scientific Simulation at Extreme Scale %A George Bosilca %A Robert Harrison %A Thomas Herault %A Mohammad Mahdi Javanmard %A Poornima Nookala %A Edward Valeev %K dag %K dataflow %K exascale %K graph %K High-performance computing %K workflow %X We describe TESSE, an emerging general-purpose, open-source software ecosystem that attacks the twin challenges of programmer productivity and portable performance for advanced scientific applications on modern high-performance computers. TESSE builds upon and extends the ParsecDAG/-dataflow runtime with a new Domain Specific Languages (DSL) and new integration capabilities. Motivating this work is our belief that such a dataflow model, perhaps with applications composed in domain specific languages, can overcome many of the challenges faced by a wide variety of irregular applications that are poorly served by current programming and execution models. Two such applications from many-body physics and applied mathematics are briefly explored. This paper focuses upon the Template Task Graph (TTG), which is TESSE's main C++ Api that provides a powerful work/data-flow programming model. Algorithms on spatial trees, block-sparse tensors, and wave fronts are used to illustrate the API and associated concepts, as well as to compare with related approaches. %B 2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2) %I IEEE %8 2020-11 %G eng %R https://doi.org/10.1109/ESPM251964.2020.00011 %0 Generic %D 2019 %T A Collection of Presentations from the BDEC2 Workshop in Kobe, Japan %A Rosa M. Badia %A Micah Beck %A François Bodin %A Taisuke Boku %A Franck Cappello %A Alok Choudhary %A Carlos Costa %A Ewa Deelman %A Nicola Ferrier %A Katsuki Fujisawa %A Kohei Fujita %A Maria Girone %A Geoffrey Fox %A Shantenu Jha %A Yoshinari Kameda %A Christian Kniep %A William Kramer %A James Lin %A Kengo Nakajima %A Yiwei Qiu %A Kishore Ramachandran %A Glenn Ricart %A Kim Serradell %A Dan Stanzione %A Lin Gan %A Martin Swany %A Christine Sweeney %A Alex Szalay %A Christine Kirkpatrick %A Kenton McHenry %A Alainna White %A Steve Tuecke %A Ian Foster %A Joe Mambretti %A William. M Tang %A Michela Taufer %A Miguel Vázquez %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2019-02 %G eng %0 Generic %D 2019 %T MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs %A Lucien Ng %A Sihan Chen %A Alex Gessinger %A Daniel Nichols %A Sophia Cheng %A Anu Meenasorna %A Kwai Wong %A Stanimire Tomov %A Azzam Haidar %A Eduardo D'Azevedo %A Jack Dongarra %I University of Tennessee %8 2019-01 %G eng %R 10.13140/RG.2.2.14906.64961 %0 Conference Paper %B Practice and Experience in Advanced Research Computing (PEARC ’19) %D 2019 %T MagmaDNN: Accelerated Deep Learning Using MAGMA %A Daniel Nichols %A Kwai Wong %A Stanimire Tomov %A Lucien Ng %A Sihan Chen %A Alex Gessinger %B Practice and Experience in Advanced Research Computing (PEARC ’19) %I ACM %C Chicago, IL %8 2019-07 %G eng %0 Conference Paper %B ISC High Performance %D 2019 %T MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing %A Daniel Nichols %A Natalie-Sofia Tomov %A Frank Betancourt %A Stanimire Tomov %A Kwai Wong %A Jack Dongarra %X In this paper, we present work towards the development of a new data analytics and machine learning (ML) framework, called MagmaDNN. Our main goal is to provide scalable, high-performance data analytics and ML solutions for scientific applications running on current and upcoming heterogeneous many-core GPU-accelerated architectures. To this end, since many of the functionalities needed are based on standard linear algebra (LA) routines, we designed MagmaDNN to derive its performance power from the MAGMA library. The close integration provides the fundamental (scalable high-performance) LA routines available in MAGMA as a backend to MagmaDNN. We present some design issues for performance and scalability that are specific to ML using Deep Neural Networks (DNN), as well as the MagmaDNN designs towards overcoming them. In particular, MagmaDNN uses well established HPC techniques from the area of dense LA, including task-based parallelization, DAG representations, scheduling, mixed-precision algorithms, asynchronous solvers, and autotuned hyperparameter optimization. We illustrate these techniques and their incorporation and use to outperform other frameworks, currently available. %B ISC High Performance %I Springer International Publishing %C Frankfurt, Germany %8 2019-06 %G eng %R https://doi.org/10.1007/978-3-030-34356-9_37 %0 Conference Paper %B Practice and Experience in Advanced Research Computing (PEARC ’19) %D 2019 %T OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework %A Frank Betancourt %A Kwai Wong %A Efosa Asemota %A Quindell Marshall %A Daniel Nichols %A Stanimire Tomov %B Practice and Experience in Advanced Research Computing (PEARC ’19) %I ACM %C Chicago, IL %8 2019-07 %G eng %0 Conference Paper %B Platform for Advanced Scientific Computing Conference (PASC 2019) %D 2019 %T Towards Continuous Benchmarking %A Hartwig Anzt %A Yen Chen Chen %A Terry Cojean %A Jack Dongarra %A Goran Flegar %A Pratik Nayak %A Enrique S. Quintana-Orti %A Yuhsiang M. Tsai %A Weichung Wang %X We present an automated performance evaluation framework that enables an automated workflow for testing and performance evaluation of software libraries. Integrating this component into an ecosystem enables sustainable software development, as a community effort, via a web application for interactively evaluating the performance of individual software components. The performance evaluation tool is based exclusively on web technologies, which removes the burden of downloading performance data or installing additional software. We employ this framework for the Ginkgo software ecosystem, but the framework can be used with essentially any software project, including the comparison between different software libraries. The Continuous Integration (CI) framework of Ginkgo is also extended to automatically run a benchmark suite on predetermined HPC systems, store the state of the machine and the environment along with the compiled binaries, and collect results in a publicly accessible performance data repository based on Git. The Ginkgo performance explorer (GPE) can be used to retrieve the performance data from the repository, and visualizes it in a web browser. GPE also implements an interface that allows users to write scripts, archived in a Git repository, to extract particular data, compute particular metrics, and visualize them in many different formats (as specified by the script). The combination of these approaches creates a workflow which enables performance reproducibility and software sustainability of scientific software. In this paper, we present example scripts that extract and visualize performance data for Ginkgo’s SpMV kernels that allow users to identify the optimal kernel for specific problem characteristics. %B Platform for Advanced Scientific Computing Conference (PASC 2019) %I ACM Press %C Zurich, Switzerland %8 2019-06 %@ 9781450367707 %G eng %R https://doi.org/10.1145/3324989.3325719 %0 Conference Paper %B 2019 European Conference on Parallel Processing (Euro-Par 2019) %D 2019 %T Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring %A Shu-Mei Tseng %A Bogdan Nicolae %A George Bosilca %A Emmanuel Jeannot %A Aparna Chandramowlishwaran %A Franck Cappello %X Stealing network bandwidth helps a variety of HPC runtimes and services to run additional operations in the background without negatively affecting the applications. A key ingredient to make this possible is an accurate prediction of the future network utilization, enabling the runtime to plan the background operations in advance, such as to avoid competing with the application for network bandwidth. In this paper, we propose a portable deep learning predictor that only uses the information available through MPI introspection to construct a recurrent sequence-to-sequence neural network capable of forecasting network utilization. We leverage the fact that most HPC applications exhibit periodic behaviors to enable predictions far into the future (at least the length of a period). Our online approach does not have an initial training phase, it continuously improves itself during application execution without incurring significant computational overhead. Experimental results show better accuracy and lower computational overhead compared with the state-of-the-art on two representative applications. %B 2019 European Conference on Parallel Processing (Euro-Par 2019) %I Springer %C Göttingen, Germany %8 2019-08 %G eng %R https://doi.org/10.1007/978-3-030-29400-7_4 %0 Conference Paper %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %D 2019 %T Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training %A Jiali Li %A Bogdan Nicolae %A Justin M. Wozniak %A George Bosilca %X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training. %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %I IEEE %C Denver, CO %8 2019-11 %G eng %R https://doi.org/10.1109/MLHPC49564.2019.00006 %0 Journal Article %J Proceedings of the IEEE %D 2018 %T Autotuning in High-Performance Computing Applications %A Prasanna Balaprakash %A Jack Dongarra %A Todd Gamblin %A Mary Hall %A Jeffrey Hollingsworth %A Boyana Norris %A Richard Vuduc %K High-performance computing %K performance tuning programming systems %X Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, autotuning has been used primarily in high-performance applications through tunable libraries or previously tuned application code that is integrated directly into the application. This paper draws on the authors' extensive experience applying autotuning to high-performance applications, describing both successes and future challenges. If autotuning is to be widely used in the HPC community, researchers must address the software engineering challenges, manage configuration overheads, and continue to demonstrate significant performance gains and portability across architectures. In particular, tools that configure the application must be integrated into the application build process so that tuning can be reapplied as the application and target architectures evolve. %B Proceedings of the IEEE %V 106 %P 2068–2083 %8 2018-11 %G eng %N 11 %R 10.1109/JPROC.2018.2841200 %0 Journal Article %J Concurrency Computation: Practice and Experience %D 2018 %T A Survey of MPI Usage in the US Exascale Computing Project %A David E. Bernholdt %A Swen Boehm %A George Bosilca %A Manjunath Gorentla Venkata %A Ryan E. Grant %A Thomas Naughton %A Howard P. Pritchard %A Martin Schulz %A Geoffroy R. Vallee %K exascale %K MPI %X The Exascale Computing Project (ECP) is currently the primary effort in theUnited States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain amore thorough understanding of how the software projects under the ECPare using, and planning to use theMessagePassing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey.Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were usingMPI. This paper reports the results of that survey for the benefit of the broader community of MPI developers. %B Concurrency Computation: Practice and Experience %8 2018-09 %G eng %9 Special Issue %R https://doi.org/10.1002/cpe.4851 %0 Generic %D 2017 %T MagmaDNN – High-Performance Data Analytics for Manycore GPUs and CPUs %A Lucien Ng %A Kwai Wong %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %I 2017 Summer Research Experiences for Undergraduate (REU), Presentation %C Knoxville, TN %8 2017-12 %G eng %0 Journal Article %J IEEE Embedded Systems Letters %D 2017 %T Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems %A Ichitaro Yamazaki %A Saeid Nooshabadi %A Stanimire Tomov %A Jack Dongarra %K Karush Kuhn Tucker (KKT) %K Realtime embedded convex optimization solver %X With the increasing sophistication in the use of optimization algorithms such as deep learning on embedded systems, the convex optimization solvers on embedded systems have found widespread use. This letter presents a novel linear solver technique to reduce the run-time of convex optimization solver by using the property that some parameters are fixed during the solution iterations of a solve instance. Our experimental results show that the run-time can be reduced by two orders of magnitude. %B IEEE Embedded Systems Letters %V 9 %P 61–64 %8 2017-05 %G eng %U http://ieeexplore.ieee.org/document/7917357/ %N 3 %R 10.1109/LES.2017.2700401 %0 Conference Proceedings %B Software for Exascale Computing - SPPEXA %D 2016 %T Domain Overlap for Iterative Sparse Triangular Solves on GPUs %A Hartwig Anzt %A Edmond Chow %A Daniel Szyld %A Jack Dongarra %E Hans-Joachim Bungartz %E Philipp Neumann %E Wolfgang E. Nagel %X Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution. %B Software for Exascale Computing - SPPEXA %S Lecture Notes in Computer Science and Engineering %I Springer International Publishing %V 113 %P 527–545 %8 2016-09 %G eng %R 10.1007/978-3-319-40528-5_24 %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %K plasma %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Generic %D 2016 %T High Performance Realtime Convex Solver for Embedded Systems %A Ichitaro Yamazaki %A Saeid Nooshabadi %A Stanimire Tomov %A Jack Dongarra %K KKT %K Realtime embedded convex optimization solver %X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances. %B University of Tennessee Computer Science Technical Report %8 2016-10 %G eng %0 Journal Article %J VMWare Technical Journal %D 2014 %T Analyzing PAPI Performance on Virtual Machines %A John Nelson %X Performance Application Programming Interface (PAPI) aims to provide a consistent interface for measuring performance events using the performance counter hardware available on the CPU as well as available software performance events and off-chip hardware. Without PAPI, a user may be forced to search through specific processor documentation to discover the name of processor performance events. These names can change from model to model and vendor to vendor. PAPI simplifies this process by providing a consistent interface and a set of processor-agnostic preset events. Software engineers can use data collected through source-code instrumentation using the PAPI interface to examine the relation between software performance and performance events. PAPI can also be used within many high-level performance-monitoring utilities such as TAU, Vampir, and Score-P. VMware® ESXiTM and KVM have both added support within the last year for virtualizing performance counters. This article compares results measuring the performance of five real-world applications included in the Mantevo Benchmarking Suite in a VMware virtual machine, a KVM virtual machine, and on bare metal. By examining these results, it will be shown that PAPI provides accurate performance counts in a virtual machine environment. %B VMWare Technical Journal %V Winter 2013 %8 2014-01 %G eng %U https://labs.vmware.com/vmtj/analyzing-papi-performance-on-virtual-machines %0 Generic %D 2013 %T Analyzing PAPI Performance on Virtual Machines %A John Nelson %X Over the last ten years, virtualization techniques have become much more widely popular as a result of fast and cheap processors. Virtualization provides many benefits making it appealing for testing environments. Encapsulating configurations is a huge motivator for wanting to do performance testing on virtual machines. Provisioning, a technique that is used by FutureGrid, is also simplified using virtual machines. Virtual machines enable portability among heterogeneous systems while providing an identical configuration within the guest operating system. My work in ICL has focused on using PAPI inside of virtual machines. There were two main areas of focus throughout my research. The first originated because of anomalous results of the HPC Challenge Benchmark reported in a paper submitted by ICL [3] in which the order of input sizes tested impacted run time on virtual machines but not on bare metal. A discussion of this anomaly will be given in section II along with a discussion of timers used in virtual machines. The second area of focus was exploring the recently implemented support by KVM (Kernel-based Virtual Machine) and VMware for guest OS level performance counters. A discussion of application tests run to observe the behavior of event counts measured in a virtual machine as well as a discussion of information learned pertinent to event measurement will be given in section III. %B ICL Technical Report %8 2013-08 %G eng %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Diagnosis and Optimization of Application Prefetching Performance %A Gabriel Marin %A Colin McCurdy %A Jeffrey Vetter %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %X Hardware prefetchers are effective at recognizing streaming memory access patterns and at moving data closer to the processing units to hide memory latency. However, hardware prefetchers can track only a limited number of data streams due to finite hardware resources. In this paper, we introduce the term streaming concurrency to characterize the number of parallel, logical data streams in an application. We present a simulation algorithm for understanding the streaming concurrency at any point in an application, and we show that this metric is a good predictor of the number of memory requests initiated by streaming prefetchers. Next, we try to understand the causes behind poor prefetching performance. We identified four prefetch unfriendly conditions and we show how to classify an application's memory references based on these conditions. We evaluated our analysis using the SPEC CPU2006 benchmark suite. We selected two benchmarks with unfavorable access patterns and transformed them to improve their prefetching effectiveness. Results show that making applications more prefetcher friendly can yield meaningful performance gains. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465014 %R 10.1145/2464996.2465014 %0 Generic %D 2013 %T PAPI 5: Measuring Power, Energy, and the Cloud %A Vincent Weaver %A Dan Terpstra %A Heike McCraw %A Matt Johnson %A Kiran Kasichayanula %A James Ralph %A John Nelson %A Phil Mucci %A Tushar Mohan %A Shirley Moore %I 2013 IEEE International Symposium on Performance Analysis of Systems and Software %C Austin, TX %8 2013-04 %G eng %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication %A Azzam Haidar %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %K eigenvalue %K gpu communication %K gpu computation %K heterogeneous programming model %K performance %K reduction to tridiagonal %K singular value decomposiiton %K task parallelism %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465438 %R 10.1145/2464996.2465438 %0 Journal Article %J Applied Parallel and Scientific Computing %D 2012 %T An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs %A Jakub Kurzak %A Rajib Nath %A Peng Du %A Jack Dongarra %E Kristján Jónasson %B Applied Parallel and Scientific Computing %V 7133 %P 248-257 %8 2012-00 %G eng %0 Journal Article %J CloudTech-HPC 2012 %D 2012 %T PAPI-V: Performance Monitoring for Virtual Machines %A Matt Johnson %A Heike McCraw %A Shirley Moore %A Phil Mucci %A John Nelson %A Dan Terpstra %A Vincent M Weaver %A Tushar Mohan %K papi %X This paper describes extensions to the PAPI hardware counter library for virtual environments, called PAPI-V. The extensions support timing routines, I/O measurements, and processor counters. The PAPI-V extensions will allow application and tool developers to use a familiar interface to obtain relevant hardware performance monitoring information in virtual environments. %B CloudTech-HPC 2012 %C Pittsburgh, PA %8 2012-09 %G eng %R 10.1109/ICPPW.2012.29 %0 Conference Proceedings %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %D 2011 %T Correlated Set Coordination in Fault Tolerant Message Logging Protocols %A Aurelien Bouteiller %A Thomas Herault %A George Bosilca %A Jack Dongarra %E Emmanuel Jeannot %E Raymond Namyst %E Jean Roman %K ftmpi %B Proceedings of 17th International Conference, Euro-Par 2011, Part II %I Springer %C Bordeaux, France %V 6853 %P 51-64 %8 2011-08 %G eng %0 Journal Article %J in GPU Computing Gems, Jade Edition %D 2011 %T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %E Wen-mei W. Hwu %K magma %K morse %B in GPU Computing Gems, Jade Edition %I Elsevier %V 2 %P 473-484 %8 2011-00 %G eng %0 Journal Article %J 18th EuroMPI %D 2011 %T Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW %A Teng Ma %A Aurelien Bouteiller %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K dague %B 18th EuroMPI %I Springer %C Santorini, Greece %P 247-254 %8 2011-09 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2011 %T The International Exascale Software Project Roadmap %A Jack Dongarra %A Pete Beckman %A Terry Moore %A Patrick Aerts %A Giovanni Aloisio %A Jean-Claude Andre %A David Barkai %A Jean-Yves Berthou %A Taisuke Boku %A Bertrand Braunschweig %A Franck Cappello %A Barbara Chapman %A Xuebin Chi %A Alok Choudhary %A Sudip Dosanjh %A Thom Dunning %A Sandro Fiore %A Al Geist %A Bill Gropp %A Robert Harrison %A Mark Hereld %A Michael Heroux %A Adolfy Hoisie %A Koh Hotta %A Zhong Jin %A Yutaka Ishikawa %A Fred Johnson %A Sanjay Kale %A Richard Kenway %A David Keyes %A Bill Kramer %A Jesus Labarta %A Alain Lichnewsky %A Thomas Lippert %A Bob Lucas %A Barney MacCabe %A Satoshi Matsuoka %A Paul Messina %A Peter Michielse %A Bernd Mohr %A Matthias S. Mueller %A Wolfgang E. Nagel %A Hiroshi Nakashima %A Michael E. Papka %A Dan Reed %A Mitsuhisa Sato %A Ed Seidel %A John Shalf %A David Skinner %A Marc Snir %A Thomas Sterling %A Rick Stevens %A Fred Streitz %A Bob Sugar %A Shinji Sumimoto %A William Tang %A John Taylor %A Rajeev Thakur %A Anne Trefethen %A Mateo Valero %A Aad van der Steen %A Jeffrey Vetter %A Peg Williams %A Robert Wisniewski %A Kathy Yelick %X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project. %B International Journal of High Performance Computing %V 25 %P 3-60 %8 2011-01 %G eng %R https://doi.org/10.1177/1094342010391989 %0 Journal Article %J 18th EuroMPI %D 2011 %T OMPIO: A Modular Software Architecture for MPI I/O %A Mohamad Chaarawi %A Edgar Gabriel %A Rainer Keller %A Richard L. Graham %A George Bosilca %A Jack Dongarra %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %B 18th EuroMPI %I Springer %C Santorini, Greece %P 81-89 %8 2011-09 %G eng %0 Conference Proceedings %B ACM/IEEE Conference on Supercomputing (SC’11) %D 2011 %T Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs %A Rajib Nath %A Stanimire Tomov %A Tingxing Dong %A Jack Dongarra %K magma %B ACM/IEEE Conference on Supercomputing (SC’11) %C Seattle, WA %8 2011-11 %G eng %0 Conference Proceedings %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %D 2011 %T Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure %A George Bosilca %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %A A. Rezmerita %E Yiannis Cotronis %E Anthony Danalis %E Dimitrios S. Nikolopoulos %E Jack Dongarra %K ftmpi %B Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011 %I Springer %C Santorini, Greece %V 6960 %P 342-344 %8 2011-09 %G eng %0 Journal Article %J Proc. of VECPAR'10 %D 2010 %T Accelerating GPU Kernels for Dense Linear Algebra %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B Proc. of VECPAR'10 %C Berkeley, CA %8 2010-06 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %B Parallel Computing %V 36 %P 645-654 %8 2010-00 %G eng %0 Generic %D 2010 %T Autotuning Dense Linear Algebra Libraries on GPUs %A Rajib Nath %A Stanimire Tomov %A Emmanuel Agullo %A Jack Dongarra %I Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010) %C Basel, Switzerland %8 2010-06 %G eng %0 Book Section %B Scientific Computing with Multicore and Accelerators %D 2010 %T Blas for GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %B Scientific Computing with Multicore and Accelerators %S Chapman & Hall/CRC Computational Science %I CRC Press %C Boca Raton, Florida %@ 9781439825365 %G eng %& 4 %0 Conference Proceedings %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %D 2010 %T Dense Linear Algebra Solvers for Multicore with GPU Accelerators %A Stanimire Tomov %A Rajib Nath %A Hatem Ltaeif %A Jack Dongarra %X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library. %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %C Atlanta, GA %P 1-8 %G eng %R 10.1109/IPDPSW.2010.5470941 %0 Generic %D 2010 %T EZTrace: a generic framework for performance analysis %A Jack Dongarra %A Mathieu Faverge %A Yutaka Ishikawa %A Raymond Namyst %A François Rue %A Francois Trahay %B ICL Technical Report %8 2010-12 %G eng %0 Generic %D 2010 %T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %B LAPACK Working Note %8 2010-00 %G eng %0 Journal Article %J IEEE Transaction on Parallel and Distributed Systems (submitted) %D 2010 %T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %K plasma %B IEEE Transaction on Parallel and Distributed Systems (submitted) %8 2010-03 %G eng %0 Generic %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report %8 2010-07 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B International Journal of High Performance Computing %V 24 %P 511-515 %8 2010-00 %G eng %0 Journal Article %J Proc. of VECPAR'10 (to appear) %D 2010 %T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Peng Du %A Jack Dongarra %K magma %K plasma %B Proc. of VECPAR'10 (to appear) %C Berkeley, CA %8 2010-06 %G eng %0 Generic %D 2010 %T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Rajib Nath %A Jean Roman %A Samuel Thibault %A Stanimire Tomov %I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster %C Knoxville, TN %8 2010-07 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2010 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %B International Journal of High Performance Computing Applications (to appear) %8 2010-00 %G eng %0 Journal Article %J Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %D 2009 %T Computational Science – ICCS 2009, Proceedings of the 9th International Conference %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B Lecture Notes in Computer Science: Theoretical Computer Science and General Issues %C Baton Rouge, LA %V - %8 2009-05 %G eng %0 Journal Article %J ISC'09 %D 2009 %T I/O Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Shirley Moore %A Dan Terpstra %A Jack Dongarra %A Andreas Knuepfer %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B ISC'09 %C Hamburg, Germany %8 2009-06 %G eng %0 Conference Proceedings %B SciDAC 2009, Journal of Physics: Conference Series %D 2009 %T Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team %A Bronis R. de Supinski %A Sadaf Alam %A David Bailey %A Laura Carrington %A Chris Daley %A Anshu Dubey %A Todd Gamblin %A Dan Gunter %A Paul D. Hovland %A Heike Jagode %A Karen Karavanic %A Gabriel Marin %A John Mellor-Crummey %A Shirley Moore %A Boyana Norris %A Leonid Oliker %A Catherine Olschanowsky %A Philip C. Roth %A Martin Schulz %A Sameer Shende %A Allan Snavely %K test %B SciDAC 2009, Journal of Physics: Conference Series %I IOP Publishing %C San Diego, California %V 180(2009)012039 %8 2009-07 %G eng %0 Conference Proceedings %B 9th International Conference on Computational Science (ICCS 2009) %D 2009 %T A Note on Auto-tuning GEMM for GPUs %A Yinan Li %A Jack Dongarra %A Stanimire Tomov %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 9th International Conference on Computational Science (ICCS 2009) %C Baton Rouge, LA %P 884-892 %8 2009-05 %G eng %R 10.1007/978-3-642-01970-8_89 %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 2009-11 %G eng %0 Generic %D 2009 %T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project %A Rajib Nath %A Jack Dongarra %A Stanimire Tomov %A Hatem Ltaeif %A Peng Du %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, Oregon %8 2009-11 %G eng %0 Generic %D 2009 %T Trace-based Performance Analysis for the Petascale Simulation Code FLASH %A Heike Jagode %A Andreas Knuepfer %A Jack Dongarra %A Matthias Jurenz %A Matthias S. Mueller %A Wolfgang E. Nagel %K test %B Innovative Computing Laboratory Technical Report %8 2009-04 %G eng %0 Conference Proceedings %B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear) %D 2009 %T VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance %A Lavanya Ramakrishan %A Daniel Nurmi %A Anirban Mandal %A Charles Koelbel %A Dennis Gannon %A Mark Huang %A Yang-Suk Kee %A Graziano Obertelli %A Kiran Thyagaraja %A Rich Wolski %A Asim YarKhan %A Dmitrii Zagorodnov %K grads %B SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear) %C Portland, OR %8 2009-00 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPCMP User Group Conference %D 2008 %T Exploring New Architectures in Accelerating CFD for Air Force Applications %A Jack Dongarra %A Shirley Moore %A Gregory D. Peterson %A Stanimire Tomov %A Jeff Allred %A Vincent Natoli %A David Richie %K magma %B Proceedings of the DoD HPCMP User Group Conference %C Seattle, Washington %8 2008-01 %G eng %0 Journal Article %J Recent developments in Grid Technology and Applications %D 2008 %T High Performance GridRPC Middleware %A Yves Caniou %A Eddy Caron %A Frederic Desprez %A Hidemoto Nakada %A Yoshio Tanaka %A Keith Seymour %E George A. Gravvanis %E John P. Morrison %E Hamid R. Arabnia %E D. A. Power %K netsolve %B Recent developments in Grid Technology and Applications %I Nova Science Publishers %8 2008-00 %G eng %0 Conference Paper %B Proceedings of DoD HPCMP UGC 2005 %D 2005 %T Performance Profiling and Analysis of DoD Applications using PAPI and TAU %A Shirley Moore %A David Cronk %A Felix Wolf %A Avi Purkayastha %A Patricia J. Teller %A Robert Araiza %A Gabriela Aguilera %A Jamie Nava %K papi %B Proceedings of DoD HPCMP UGC 2005 %I IEEE %C Nashville, TN %8 2005-06 %G eng %0 Journal Article %J Oak Ridge National Laboratory Report %D 2004 %T Cray X1 Evaluation Status Report %A Pratul Agarwal %A R. A. Alexander %A E. Apra %A Satish Balay %A Arthur S. Bland %A James Colgan %A Eduardo D'Azevedo %A Jack Dongarra %A Tom Dunigan %A Mark Fahey %A Al Geist %A M. Gordon %A Robert Harrison %A Dinesh Kaushik %A M. Krishnakumar %A Piotr Luszczek %A Tony Mezzacapa %A Jeff Nichols %A Jarek Nieplocha %A Leonid Oliker %A T. Packwood %A M. Pindzola %A Thomas C. Schulthess %A Jeffrey Vetter %A James B White %A T. Windus %A Patrick H. Worley %A Thomas Zacharia %B Oak Ridge National Laboratory Report %V /-2004/13 %8 2004-01 %G eng %0 Journal Article %J Journal of Digital Information special issue on Interactivity in Digital Libraries %D 2002 %T Active Netlib: An Active Mathematical Software Collection for Inquiry-based Computational Science and Engineering Education %A Shirley Moore %A A.J. Baker %A Jack Dongarra %A Christian Halloy %A Chung Ng %K activenetlib %K rib %B Journal of Digital Information special issue on Interactivity in Digital Libraries %V 2 %8 2002-00 %G eng %0 Generic %D 2002 %T GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %B ICL Technical Report %8 2002-11 %G eng %0 Conference Proceedings %B Proceedings of the Third International Workshop on Grid Computing %D 2002 %T Overview of GridRPC: A Remote Procedure Call API for Grid Computing %A Keith Seymour %A Hidemoto Nakada %A Satoshi Matsuoka %A Jack Dongarra %A Craig Lee %A Henri Casanova %E Manish Parashar %B Proceedings of the Third International Workshop on Grid Computing %P 274-278 %8 2002-01 %G eng