%0 Conference Paper %B 52nd International Conference on Parallel Processing (ICPP 2023) %D 2023 %T Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication Engine %A Omri Mor %A George Bosilca %A Marc Snir %K asynchronous many-task %K dynamic runtime %K lightweight communication %K low-rank Cholesky %K message-passing %K MPI %K strong scaling %X There is a growing interest in Asynchronous Many-Task (AMT) runtimes as an efficient way to map irregular and dynamic parallel applications onto heterogeneous computing resources. In this work, we show that AMTs nonetheless struggle with communication bottlenecks when scaling computations strongly and that the design of commonly-used communication libraries such as MPI contribute to these bottlenecks. We replace MPI with LCI, a Lightweight Communication Interface that is designed for dynamic, asynchronous frameworks, as the communication layer for the PaRSEC runtime. The result is a significant reduction of end-to-end latency in communication microbenchmarks and a reduction of overall time-tosolution by up to 12% in HiCMA, a tile-based low-rank Cholesky factorization package. %B 52nd International Conference on Parallel Processing (ICPP 2023) %I ACM %C Salt Lake City, Utah %8 2023-09 %G eng %U http://snir.cs.illinois.edu/listed/icpp2023-69.pdf %R 10.1145/3605573.3605642 %0 Conference Paper %B 2023 IEEE International Conference on Cluster Computing (CLUSTER) %D 2023 %T Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion %A Cao, Qinglei %A Abdulah, Sameh %A Ltaief, Hatem %A Genton, Marc G. %A Keyes, David %A Bosilca, George %X The burgeoning interest in large-scale geospatial modeling, particularly within the domains of climate and weather prediction, underscores the concomitant critical importance of accuracy, scalability, and computational speed. Harnessing these complex simulations’ potential, however, necessitates innovative computational strategies, especially considering the increasing volume of data involved. Recent advancements in Graphics Processing Units (GPUs) have opened up new avenues for accelerating these modeling processes. In particular, their efficient utilization necessitates new strategies, such as mixed-precision arithmetic, that can balance the trade-off between computational speed and model accuracy. This paper leverages PaRSEC runtime system and delves into the opportunities provided by mixed-precision arithmetic to expedite large-scale geospatial modeling in heterogeneous environments. By using an automated conversion strategy, our mixed-precision approach significantly improves computational performance (up to 3X) on Summit supercomputer and reduces the associated energy consumption on various Nvidia GPU generations. Importantly, this implementation ensures the requisite accuracy in environmental applications, a critical factor in their operational viability. The findings of this study bear significant implications for future research and development in high-performance computing, underscoring the transformative potential of mixed-precision arithmetic on GPUs in addressing the computational demands of large-scale geospatial modeling and making a stride toward a more sustainable, efficient, and accurate future in large-scale environmental applications. %B 2023 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Santa Fe, NM, USA %8 2023-11 %G eng %U https://ieeexplore.ieee.org/document/10319946/ %R 10.1109/CLUSTER52292.2023.00035 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2022 %T Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC %A Abdulah, Sameh %A Qinglei Cao %A Pei, Yu %A George Bosilca %A Jack Dongarra %A Genton, Marc G. %A Keyes, David E. %A Ltaief, Hatem %A Sun, Ying %K Computational modeling %K Covariance matrices %K Data models %K Maximum likelihood estimation %K Predictive models %K runtime %K Task analysis %X Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Spatial data are assumed to possess properties of stationarity or non-stationarity via a kernel fitted to a covariance matrix. A primary workhorse of stationary spatial statistics is Gaussian maximum log-likelihood estimation (MLE), whose central data structure is a dense, symmetric positive definite covariance matrix of the dimension of the number of correlated observations. Two essential operations in MLE are the application of the inverse and evaluation of the determinant of the covariance matrix. These can be rendered through the Cholesky decomposition and triangular solution. In this contribution, we reduce the precision of weakly correlated locations to single- or half- precision based on distance. We thus exploit mathematical structure to migrate MLE to a three-precision approximation that takes advantage of contemporary architectures offering BLAS3-like operations in a single instruction that are extremely fast for reduced precision. We illustrate application-expected accuracy worthy of double-precision from a majority half-precision computation, in a context where uniform single-precision is by itself insufficient. In tackling the complexity and imbalance caused by the mixing of three precisions, we deploy the PaRSEC runtime system. PaRSEC delivers on-demand casting of precisions while orchestrating tasks and data movement in a multi-GPU distributed-memory environment within a tile-based Cholesky factorization. Application-expected accuracy is maintained while achieving up to 1.59X by mixing FP64/FP32 operations on 1536 nodes of HAWK or 4096 nodes of Shaheen II , and up to 2.64X by mixing FP64/FP32/FP16 operations on 128 nodes of Summit , relative to FP64-only operations. This translates into up to 4.5, 4.7, ... %B IEEE Transactions on Parallel and Distributed Systems %V 33 %P 964 - 976 %8 2022-04 %G eng %U https://ieeexplore.ieee.org/document/9442267/https://ieeexplore.ieee.org/ielam/71/9575177/9442267-aam.pdfhttp://xplorestaging.ieee.org/ielx7/71/9575177/09442267.pdf?arnumber=9442267 %N 4 %! IEEE Trans. Parallel Distrib. Syst. %R 10.1109/TPDS.2021.3084071 %0 Journal Article %J International Journal of Networking and Computing %D 2022 %T Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms %A George Bosilca %A Bouteiller, Aurélien %A Herault, Thomas %A Le Fèvre, Valentin %A Robert, Yves %A Dongarra, Jack %X This paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. We then compare the implementation of these algorithms over a task-based runtime system, PaRSEC and show the advantages and limitations of each approach in a real implementation. %B International Journal of Networking and Computing %V 12 %P 26 - 46 %8 2022-01 %G eng %U https://www.jstage.jst.go.jp/article/ijnc/12/1/12_26/_article %N 1 %! IJNC %R 10.15803/ijnc.12.1_26 %0 Conference Paper %B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM) %D 2022 %T Composition of Algorithmic Building Blocks in Template Task Graphs %A Herault, Thomas %A Schuchart, Joseph %A Valeev, Edward F. %A George Bosilca %B 2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM) %I IEEE %C Dallas, TX, USA %8 2023-01 %G eng %U https://ieeexplore.ieee.org/document/10024647/ %R 10.1109/PAW-ATM56565.2022.00008 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2022 %T Evaluating Data Redistribution in PaRSEC %A Qinglei Cao %A George Bosilca %A Losada, Nuria %A Wu, Wei %A Zhong, Dong %A Jack Dongarra %B IEEE Transactions on Parallel and Distributed Systems %V 33 %P 1856-1872 %8 2022-08 %G eng %R 10.1109/TPDS.2021.3131657 %0 Conference Paper %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2022 %T Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment %A Schuchart, Joseph %A Nookala, Poornima %A Javanmard, Mohammad Mahdi %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %A Harrison, Robert J. %X We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations. %B 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Lyon, France %8 2022-07 %G eng %U https://ieeexplore.ieee.org/abstract/document/9820613 %R 10.1109/IPDPS53621.2022.00086 %0 Conference Paper %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %D 2022 %T Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG %A Schuchart, Joseph %A Nookala, Poornima %A Herault, Thomas %A Valeev, Edward F. %A George Bosilca %K Dataflow graph %K Hardware %K Instruction sets %K Memory management %K PaR-SEC %K parallel programming %K runtime %K scalability %K Task analysis %K task-based programming %K Template Task Graph %K TTG %X Shared memory parallel programming models strive to provide low-overhead execution environments. Task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory. %B 2022 IEEE International Conference on Cluster Computing (CLUSTER) %I IEEE %C Heidelberg, Germany %8 2022-09 %G eng %U https://ieeexplore.ieee.org/document/9912704/ %R 10.1109/CLUSTER51413.2022.00026 %0 Conference Proceedings %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %D 2022 %T Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications %A Cao, Qinglei %A Abdulah, Sameh %A Rabab Alomairy %A Pei, Yu %A Pratik Nag %A George Bosilca %A Dongarra, Jack %A Genton, Marc G. %A Keyes, David %A Ltaief, Hatem %A Sun, Ying %K climate/weather prediction %K dynamic runtime systems %K high performance computing. %K low- rank matrix approximations %K mixed-precision computations %K space-time geospatial statistics %K Task-based programming models %X We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tilebased Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation. %B 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) %I IEEE Press %C Dallas, TX %8 2022-11 %@ 9784665454445 %G eng %U https://dl.acm.org/doi/abs/10.5555/3571885.3571888 %0 Journal Article %J Parallel Computing %D 2021 %T Callback-based completion notification using MPI Continuations %A Schuchart, Joseph %A Samfass, Philipp %A Niethammer, Christoph %A Gracia, José %A George Bosilca %K MPI %K MPI Continuations %K OmpSs %K OpenMP %K parsec %K TAMPI %K Task-based programming models %X Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and nonblocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space. %B Parallel Computing %V 21238566 %P 102793 %8 Jan-05-2021 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0167819121000466?via%3Dihub %N 0225 %! Parallel Computing %R 10.1016/j.parco.2021.102793 %0 Conference Paper %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %D 2021 %T Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure %A Thomas Herault %A Yves Robert %A George Bosilca %A Robert Harrison %A Cannada Lewis %A Edward Valeev %A Jack Dongarra %K block-sparse matrix multiplication %K distributed-memory %K Electronic structure %K multi-GPU node %K parsec %K tensor contraction %X Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-tosolution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size. %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %I IEEE %C Portland, OR %8 2021-05 %G eng %U https://hal.inria.fr/hal-02970659/document %0 Generic %D 2021 %T DTE: PaRSEC Enabled Libraries and Applications %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2021 Exascale Computing Project Annual Meeting %8 2021-04 %G eng %0 Conference Paper %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %D 2021 %T Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems %A Qinglei Cao %A Yu Pei %A Kadir Akbudak %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %K asynchronous executions and load balancing %K dynamic runtime system %K environmental applications %K High-performance computing %K low-rank matrix computations %K task-based programming model %K user productivity %X The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform ´high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries. %B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021) %I IEEE %C Portland, OR %8 2021-05 %G eng %0 Conference Paper %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2020 %T Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime %A Yu Pei %A Qinglei Cao %A George Bosilca %A Piotr Luszczek %A Victor Eijkhout %A Jack Dongarra %X Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2× speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively. %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C New Orleans, LA %8 2020-05 %G eng %R https://doi.org/10.1109/IPDPSW50202.2020.00127 %0 Generic %D 2020 %T DTE: PaRSEC Enabled Libraries and Applications (Poster) %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T DTE: PaRSEC Systems and Interfaces (Poster) %A George Bosilca %A Thomas Herault %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Conference Paper %B Platform for Advanced Scientific Computing Conference (PASC20) %D 2020 %T Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications %A Qinglei Cao %A Yu Pei %A Kadir Akbudak %A Aleksandr Mikhalev %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %X Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications. %B Platform for Advanced Scientific Computing Conference (PASC20) %I ACM %C Geneva, Switzerland %8 2020-06 %G eng %R https://doi.org/10.1145/3394277.3401846 %0 Conference Paper %B IEEE International Conference on Cluster Computing (Cluster 2020) %D 2020 %T Flexible Data Redistribution in a Task-Based Runtime System %A Qinglei Cao %A George Bosilca %A Wei Wu %A Dong Zhong %A Aurelien Bouteiller %A Jack Dongarra %X Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size. %B IEEE International Conference on Cluster Computing (Cluster 2020) %I IEEE %C Kobe, Japan %8 2020-09 %G eng %R https://doi.org/10.1109/CLUSTER49012.2020.00032 %0 Conference Paper %B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20) %D 2020 %T Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance %A Elliott Slaughter %A Wei Wu %A Yuankun Fu %A Legend Brandenburg %A Nicolai Garcia %A Wilhem Kautz %A Emily Marx %A Kaleb S. Morris %A Qinglei Cao %A George Bosilca %A Seema Mirchandaney %A Wonchan Lee %A Sean Treichler %A Patrick McCormick %A Alex Aiken %X We present Task Bench, a parameterized benchmark designed to explore the performance of distributed programming systems under a variety of application scenarios. Task Bench dramatically lowers the barrier to benchmarking and comparing multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications. To assess the effectiveness and overheads of the tested systems, we introduce a novel metric, minimum effective task granularity (METG). We conduct a comprehensive study with 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. Running at scale, 100μs-long tasks are the finest granularity that any system runs efficiently with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance. %B International Conference for High Performance Computing Networking, Storage, and Analysis (SC20) %I ACM %8 2020-11 %G eng %U https://dl.acm.org/doi/10.5555/3433701.3433783 %0 Conference Paper %B PAW-ATM Workshop at SC19 %D 2019 %T Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization %A Yu Pei %A George Bosilca %A Ichitaro Yamazaki %A Akihiro Ida %A Jack Dongarra %B PAW-ATM Workshop at SC19 %I ACM %C Denver, CO %8 2019-11 %G eng %0 Conference Paper %B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2019 %T Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC %A Thomas Herault %A Yves Robert %A George Bosilca %A Jack Dongarra %B ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I IEEE %C Denver, CO %8 2019-11 %G eng %0 Generic %D 2019 %T PAPI's new Software-Defined Events for in-depth Performance Analysis %A Anthony Danalis %A Heike Jagode %A Jack Dongarra %X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for the past two decades. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs. %I 13th Parallel Tools Workshop %C Dresden, Germany %8 2019-09 %G eng %0 Conference Paper %B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19 %D 2019 %T Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools %A Qinglei Cao %A Yu Pei %A Thomas Herault %A Kadir Akbudak %A Aleksandr Mikhalev %A George Bosilca %A Hatem Ltaief %A David Keyes %A Jack Dongarra %B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19 %I ACM %C Denver, CO %8 2019-11 %G eng %0 Conference Paper %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2019 %T Software-Defined Events through PAPI %A Anthony Danalis %A Heike Jagode %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %X PAPI has been used for almost two decades as an abstraction and standardization layer for profiling hardware-specific performance metrics. However, application developers-and profiling software packages-are quite often interested in information beyond hardware counters, such as the behavior of libraries used by the software that is being profiled. So far, accessing this information has required interfacing directly with the libraries on a case-by-case basis, or low-level binary instrumentation. In this paper, we introduce the new Software-Defined Event (SDE) component of PAPI which aims to enable PAPI to serve as an abstraction and standardization layer for events that originate in software layers as well. Extending PAPI to include SDEs enables monitoring of both types of performance events-hardware-and software-related events-in a uniform way, through the same consistent PAPI interface. Furthermore, implementing SDE as a PAPI component means that the new API is aimed only at the library developers who wish to export events from within their libraries. The API for reading PAPI events-both hardware and software-remains the same, so all legacy codes and tools that use PAPI will not only continue to work, but they will automatically be able to read SDEs wherever those are available. The goal of this paper is threefold. First, we outline our design decisions regarding the functionality we offer through the new SDE interface, and offer simple examples of usage. Second, we illustrate how those events can be utilized by different software packages, specifically, by showcasing their use in the task-based runtime PaRSEC, and the HPCG supercomputing benchmark. Third, we provide a thorough performance analysis of the overhead that results from monitoring different types of SDEs, and showcase the negligible overhead of using PAPI SDE even in cases of extremely heavy use. %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %R https://doi.org/10.1109/IPDPSW.2019.00069 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWCHEM, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWCHEM’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PARSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PARSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWCHEM codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWCHEM); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWCHEM, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B The International Journal of High Performance Computing Applications %V 32 %P 540--551 %8 2018-07 %G eng %U http://journals.sagepub.com/doi/10.1177/1094342016672543 %N 4 %9 Journal Article %& 540 %R 10.1177/1094342016672543 %0 Generic %D 2018 %T Data Movement Interfaces to Support Dataflow Runtimes %A Aurelien Bouteiller %A George Bosilca %A Thomas Herault %A Jack Dongarra %X This document presents the design study and reports on the implementation of a portable hosted accelerator device support in the PaRSEC Dataflow Tasking at Exascale runtime, undertaken as part of the ECP contract 17-SC-20-SC. The document discusses different technological approaches to transfer data to/from hosted accelerators, issues recommendations for technology providers, and presents the design of an OpenMP-based accelerator support in PaRSEC. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-05 %G eng %0 Generic %D 2018 %T Distributed Termination Detection for HPC Task-Based Environments %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Valentin Le Fèvre %A Yves Robert %A Jack Dongarra %X This paper revisits distributed termination detection algorithms in the context of high-performance computing applications in task systems. We first outline the need to efficiently detect termination in workflows for which the total number of tasks is data dependent and therefore not known statically but only revealed dynamically during execution. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal Distributed algorithm (EDOD). On the theoretical side, we analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages. On the practical side, we provide a highly tuned implementation of each termination detection algorithm within PaRSEC and compare their performance for a variety of benchmarks, extracted from scientific applications that exhibit dynamic behaviors. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-06 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms %D 2018 %T Evaluation of Dataflow Programming Models for Electronic Structure Theory %A Heike Jagode %A Anthony Danalis %A Reazul Hoque %A Mathieu Faverge %A Jack Dongarra %K CCSD %K coupled cluster methods %K dataflow %K NWChem %K OpenMP %K parsec %K StarPU %K task-based runtime %X Dataflow programming models have been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. In this paper, we evaluate different dataflow programming models for electronic structure methods and compare them in terms of programmability, resource utilization, and scalability. In particular, we evaluate two programming paradigms for expressing scientific applications in a dataflow form: (1) explicit dataflow, where the dataflow is specified explicitly by the developer, and (2) implicit dataflow, where a task scheduling runtime derives the dataflow using per-task data-access information embedded in a serial program. We discuss our findings and present a thorough experimental analysis using methods from the NWChem quantum chemistry application as our case study, and OpenMP, StarPU, and PaRSEC as the task-based runtimes that enable the different forms of dataflow execution. Furthermore, we derive an abstract model to explore the limits of the different dataflow programming paradigms. %B Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms %V 2018 %P 1–20 %8 2018-05 %G eng %N e4490 %R https://doi.org/10.1002/cpe.4490 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2018 %T A Failure Detector for HPC Platforms %A George Bosilca %A Aurelien Bouteiller %A Amina Guermouche %A Thomas Herault %A Yves Robert %A Pierre Sens %A Jack Dongarra %K failure detection %K Fault tolerance %K MPI %X Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm. %B The International Journal of High Performance Computing Applications %V 32 %P 139–158 %8 2018-01 %G eng %N 1 %R https://doi.org/10.1177/1094342017711505 %0 Generic %D 2018 %T PAPI's New Software-Defined Events for In-Depth Performance Analysis %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %X One of the most recent developments of the Performance API (PAPI) is the addition of Software-Defined Events (SDE). PAPI has successfully served the role of the abstraction and unification layer for hardware performance counters for over a decade. This talk presents our effort to extend this role to encompass performance critical information that does not originate in hardware, but rather in critical software layers, such as libraries and runtime systems. Our overall objective is to enable monitoring of both types of performance events, hardware- and software-related events, in a uniform way, through one consistent PAPI interface. Performance analysts will be able to form a complete picture of the entire application performance without learning new instrumentation primitives. In this talk, we outline PAPI's new SDE API and showcase the usefulness of SDE through its employment in software layers as diverse as the math library MAGMA, the dataflow runtime PaRSEC, and the state-of-the-art chemistry application NWChem. We outline the process of instrumenting these software packages and highlight the performance information that can be acquired with SDEs. %I CCDSC 2018: Workshop on Clusters, Clouds, and Data for Scientific Computing %C Lyon, France %8 2018-09 %G eng %0 Generic %D 2018 %T Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System %A George Bosilca %A Damien Genet %A Robert Harrison %A Thomas Herault %A Mohammad Mahdi Javanmard %A Chong Peng %A Edward Valeev %X The needs for predictive simulation of electronic structure in chemistry and materials science calls for fast/reduced-scaling formulations of quantum n-body methods that replace the traditional dense tensors with element-, block-, rank-, and block-rank-sparse (data-sparse) tensors. The resulting, highly irregular data structures are a poor match to imperative, bulk-synchronous parallel programming style due to the dynamic nature of the problem and to the lack of clear domain decomposition to guarantee a fair load-balance. TESSE runtime and the associated programming model aim to support performance-portable composition of applications involving irregular and dynamically changing data. In this paper we report an implementation of irregular dense tensor contraction in a paradigmatic electronic structure application based on the TESSE extension of PaRSEC, a distributed hybrid task runtime system, and analyze the resulting performance on a distributed memory cluster of multi-GPU nodes. Unprecedented strong scaling and promising efficiency indicate a viable future for task-based programming of complete production-quality reduced scaling models of electronic structure. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2018-12 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2017 %T Accelerating NWChem Coupled Cluster through Dataflow-Based Execution %A Heike Jagode %A Anthony Danalis %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelized in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller): a software package designed to enable high-performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer-grained tasks (compared with the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B The International Journal of High Performance Computing Applications %P 1–13 %8 2017-01 %G eng %U http://journals.sagepub.com/doi/10.1177/1094342016672543 %R 10.1177/1094342016672543 %0 Conference Proceedings %B ScalA17 %D 2017 %T Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime %A Reazul Hoque %A Thomas Herault %A George Bosilca %A Jack Dongarra %K data-flow %K dynamic task-graph %K parsec %K task-based runtime %X Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intranode features (MPI+X). A different, more revolutionary, solution explores data-flow task-based runtime systems as a substitute to both local and distributed data dependencies management. The solution explored in this paper, PaRSEC, is based on such a programming paradigm, supported by a highly efficient task-based runtime. This paper compares two programming paradigms present in PaRSEC, Parameterized Task Graph (PTG) and Dynamic Task Discovery (DTD) in terms of capabilities, overhead and potential benefits. %B ScalA17 %I ACM %C Denver %8 2017-09 %@ 978-1-4503-5125-6 %G eng %U https://dl.acm.org/citation.cfm?doid=3148226.3148233 %R 10.1145/3148226.3148233 %0 Journal Article %J Parallel Computing %D 2016 %T Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results %A Julien Herrmann %A George Bosilca %A Thomas Herault %A Loris Marchal %A Yves Robert %A Jack Dongarra %K Data partition %K linear algebra %K parsec %K QR factorization %K Redistribution %K Stencil %X The classical redistribution problem aims at optimally scheduling communications when reshuffling from an initial data distribution to a target data distribution. This target data distribution is usually chosen to optimize some objective for the algorithmic kernel under study (good computational balance or low communication volume or cost), and therefore to provide high efficiency for that kernel. However, the choice of a distribution minimizing the target objective is not unique. This leads to generalizing the redistribution problem as follows: find a re-mapping of data items onto processors such that the data redistribution cost is minimal, and the operation remains as efficient. This paper studies the complexity of this generalized problem. We compute optimal solutions and evaluate, through simulations, their gain over classical redistribution. We also show the NP-hardness of the problem to find the optimal data partition and processor permutation (defined by new subsets) that minimize the cost of redistribution followed by a simple computational kernel. Finally, experimental validation of the new redistribution algorithms are conducted on a multicore cluster, for both a 1D-stencil kernel and a more compute-intensive dense linear algebra routine. %B Parallel Computing %V 52 %P 22-41 %8 2016-02 %G eng %R doi:10.1016/j.parco.2015.09.005 %0 Conference Paper %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %D 2015 %T Accelerating NWChem Coupled Cluster through dataflow-based Execution %A Heike Jagode %A Anthony Danalis %A George Bosilca %A Jack Dongarra %K CCSD %K dag %K dataflow %K NWChem %K parsec %K ptg %K tasks %X Numerical techniques used for describing many-body systems, such as the Coupled Cluster methods (CC) of the quantum chemistry package NWChem, are of extreme interest to the computational chemistry community in fields such as catalytic reactions, solar energy, and bio-mass conversion. In spite of their importance, many of these computationally intensive algorithms have traditionally been thought of in a fairly linear fashion, or are parallelised in coarse chunks. In this paper, we present our effort of converting the NWChem’s CC code into a dataflow-based form that is capable of utilizing the task scheduling system PaRSEC (Parallel Runtime Scheduling and Execution Controller) – a software package designed to enable high performance computing at scale. We discuss the modularity of our approach and explain how the PaRSEC-enabled dataflow version of the subroutines seamlessly integrate into the NWChem codebase. Furthermore, we argue how the CC algorithms can be easily decomposed into finer grained tasks (compared to the original version of NWChem); and how data distribution and load balancing are decoupled and can be tuned independently. We demonstrate performance acceleration by more than a factor of two in the execution of the entire CC component of NWChem, concluding that the utilization of dataflow-based execution for CC methods enables more efficient and scalable computation. %B 11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015) %I Springer International Publishing %C Krakow, Poland %8 2015-09 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A George Bosilca %A Thomas Herault %A Jack Dongarra %X As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Conference Paper %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2015 %T Hierarchical DAG scheduling for Hybrid Distributed Systems %A Wei Wu %A Aurelien Bouteiller %A George Bosilca %A Mathieu Faverge %A Jack Dongarra %K dense linear algebra %K gpu %K heterogeneous architecture %K PaRSEC runtime %X Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak com-putational capacity. Despite significant advances in the pro-gramming interfaces to such hybrid architectures, traditional programming paradigms struggle mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel effi-ciency according to runtime conditions. Based on an extensive set of results we show that, with one-sided factorizations, i.e. Cholesky and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments. %B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Hyderabad, India %8 2015-05 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2015 %T Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %K lu factorization %K Numerical algorithms %K QR factorization %K Stability; Performance %X This paper introduces hybrid LU–QR algorithms for solving dense linear systems of the form Ax=b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of floating-point operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. The choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the PaRSEC software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. A comprehensive set of experiments shows that hybrid LU–QR algorithms provide a continuous range of trade-offs between stability and performances. %B Journal of Parallel and Distributed Computing %V 85 %P 32-46 %8 2015-11 %G eng %R doi:10.1016/j.jpdc.2015.06.007 %0 Conference Paper %B 2015 IEEE International Conference on Cluster Computing %D 2015 %T PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution %A Anthony Danalis %A Heike Jagode %A George Bosilca %A Jack Dongarra %K dag %K parsec %K ptg %K tasks %X Task-based execution has been growing in popularity as a means to deliver a good balance between performance and portability in the post-petascale era. The Parallel Runtime Scheduling and Execution Control (PARSEC) framework is a task-based runtime system that we designed to achieve high performance computing at scale. PARSEC offers a programming paradigm that is different than what has been traditionally used to develop large scale parallel scientific applications. In this paper, we discuss the use of PARSEC to convert a part of the Coupled Cluster (CC) component of the Quantum Chemistry package NWCHEM into a task-based form. We explain how we organized the computation of the CC methods in individual tasks with explicitly defined data dependencies between them and re-integrated the modified code into NWCHEM. We present a thorough performance evaluation and demonstrate that the modified code outperforms the original by more than a factor of two. We also compare the performance of different variants of the modified code and explain the different behaviors that lead to the differences in performance. %B 2015 IEEE International Conference on Cluster Computing %I IEEE %C Chicago, IL %8 2015-09 %G eng %0 Conference Paper %B 2nd Workshop on Visual Performance Analysis (VPA '15) %D 2015 %T Visualizing Execution Traces with Task Dependencies %A Blake Haugen %A Stephen Richmond %A Jakub Kurzak %A Chad A. Steed %A Jack Dongarra %X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler. %B 2nd Workshop on Visual Performance Analysis (VPA '15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Generic %D 2014 %T Design for a Soft Error Resilient Dynamic Task-based Runtime %A Chongxiao Cao %A Thomas Herault %A George Bosilca %A Jack Dongarra %X Abstract—As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate our approach and quantify the overhead introduced by such mechanisms. %B ICL Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Designing LU-QR Hybrid Solvers for Performance and Stability %A Mathieu Faverge %A Julien Herrmann %A Julien Langou %A Bradley Lowery %A Yves Robert %A Jack Dongarra %K plasma %X This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form Ax = b. Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps, based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are twice as cheap in terms of operations, as QR steps. However, LU steps are not necessarily stable, while QR steps are always stable. The hybrid algorithms execute a QR step when a robustness criterion detects some risk for instability, and they execute an LU step otherwise. Ideally, the choice between LU and QR steps must have a small computational overhead and must provide a satisfactory level of stability with as few QR steps as possible. In this paper, we introduce several robustness criteria and we establish upper bounds on the growth factor of the norm of the updated matrix incurred by each of these criteria. In addition, we describe the implementation of the hybrid algorithms through an extension of the Parsec software to allow for dynamic choices during execution. Finally, we analyze both stability and performance results compared to state-of-the-art linear solvers on parallel distributed multicore platforms. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %@ 978-1-4799-3800-1 %G eng %R 10.1109/IPDPS.2014.108 %0 Journal Article %J Parallel Computing %D 2014 %T An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems %A Marc Baboulin %A Du Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K Distributed linear algebra solvers %K LDLT factorization %K PaRSEC runtime %K plasma %K Randomized algorithms %K Symmetric indefinite systems %X Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver. %B Parallel Computing %V 40 %P 213-223 %8 2014-07 %G eng %N 7 %R 10.1016/j.parco.2013.12.003 %0 Conference Paper %B 2014 IEEE International Conference on Cluster Computing %D 2014 %T Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models %A Heike McCraw %A James Ralph %A Anthony Danalis %A Jack Dongarra %X For more than a decade, the PAPI performance-monitoring library has provided a clear, portable interface to the hardware performance counters available on all modern CPUs and other components of interest (e.g., GPUs, network, and I/O systems). Most major end-user tools that application developers use to analyze the performance of their applications rely on PAPI to gain access to these performance counters. One of the critical road-blockers on the way to larger, more complex high performance systems, has been widely identified as being the energy efficiency constraints. With modern extreme scale machines having hundreds of thousands of cores, the ability to reduce power consumption for each CPU at the software level becomes critically important, both for economic and environmental reasons. In order for PAPI to continue playing its well established role in HPC, it is pressing to provide valuable performance data that not only originates from within the processing cores but also delivers insight into the power consumption of the system as a whole. An extensive effort has been made to extend the Performance API to support power monitoring capabilities for various platforms. This paper provides detailed information about three components that allow power monitoring on the Intel Xeon Phi and Blue Gene/Q. Furthermore, we discuss the integration of PAPI in PARSEC – a taskbased dataflow-driven execution engine – enabling hardware performance counter and power monitoring at true task granularity. %B 2014 IEEE International Conference on Cluster Computing %I IEEE %C Madrid, Spain %8 2014-09 %G eng %R 10.1109/CLUSTER.2014.6968672 %0 Conference Paper %B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %D 2014 %T PTG: An Abstraction for Unhindered Parallelism %A Anthony Danalis %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Jack Dongarra %K dte %K parsec %K plasma %X

Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.

Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaRSEC can address them, and we demonstrate that in today’s large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.

%B International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) %I IEEE Press %C New Orleans, LA %8 2014-11 %G eng %0 Conference Paper %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %D 2014 %T Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes %A Xavier Lacoste %A Mathieu Faverge %A Pierre Ramet %A Samuel Thibault %A George Bosilca %K DAG based runtime %K gpu %K Multicore %K Sparse linear solver %X The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of the computing resources. The pressure to maintain reasonable levels of performance and portability, forces the application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical architectures. In this paper, we study the replacement of the highly specialized internal scheduler in PaStiX by two generic runtime frameworks: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them with the opportunity to optimize it in order to maximize the algorithm eefficiency for a predefined execution environment. A comparative study of the performance of the PaStiX solver with the three schedulers { native PaStiX, StarPU and PaRSEC schedulers { on different execution contexts is performed. The analysis highlights the similarities from a performance point of view between the different execution supports. These results demonstrate that these generic DAG-based runtimes provide a uniform and portable programming interface across heterogeneous environments, and are, therefore, a sustainable solution for hybrid environments. %B 23rd International Heterogeneity in Computing Workshop, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T Task-Based Programming for Seismic Imaging: Preliminary Results %A Lionel Boillot %A George Bosilca %A Emmanuel Agullo %A Henri Calandra %K plasma %X The level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes—relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)—may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator. %B 2014 IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 2014-08 %G eng %0 Conference Paper %B 2014 IEEE International Conference on Cluster Computing %D 2014 %T Utilizing Dataflow-based Execution for Coupled Cluster Methods %A Heike McCraw %A Anthony Danalis %A George Bosilca %A Jack Dongarra %A Karol Kowalski %A Theresa Windus %X Computational chemistry comprises one of the driving forces of High Performance Computing. In particular, many-body methods, such as Coupled Cluster (CC) methods of the quantum chemistry package NWCHEM, are of particular interest for the applied chemistry community. Harnessing large fractions of the processing power of modern large scale computing platforms has become increasingly difficult. With the increase in scale, complexity, and heterogeneity of modern platforms, traditional programming models fail to deliver the expected performance scalability. On our way to Exascale and with these extremely hybrid platforms, dataflow-based programming models may be the only viable way for achieving and maintaining computation at scale. In this paper, we discuss a dataflow-based programming model and its applicability to NWCHEM’s CC methods. Our dataflow version of the CC kernels breaks down the algorithm into fine-grained tasks with explicitly defined data dependencies. As a result, many of the traditional synchronization points can be eliminated, allowing for a dynamic reshaping of the execution based on the ongoing availability of computational resources. We build this experiment using PARSEC – a task-based dataflow-driven execution engine – that enables efficient task scheduling on distributed systems, providing a desirable portability layer for application developers. %B 2014 IEEE International Conference on Cluster Computing %I IEEE %C Madrid, Spain %8 2014-09 %G eng %0 Journal Article %J Scalable Computing and Communications: Theory and Practice %D 2013 %T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %E Samee Khan %E Lin-Wang Wang %E Albert Zomaya %B Scalable Computing and Communications: Theory and Practice %I John Wiley & Sons %P 699-735 %8 2013-03 %G eng %0 Generic %D 2013 %T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC %A Guillaume Aupy %A Mathieu Faverge %A Yves Robert %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures %B Lawn 277 %8 2013-05 %G eng %0 Journal Article %J IEEE Computing in Science and Engineering %D 2013 %T PaRSEC: Exploiting Heterogeneity to Enhance Scalability %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Thomas Herault %A Jack Dongarra %X New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution environment with sufficient intelligence and flexibility to rearrange the execution to improve resource utilization. %B IEEE Computing in Science and Engineering %V 15 %P 36-45 %8 2013-11 %G eng %N 6 %R 10.1109/MCSE.2013.98 %0 Journal Article %J Parallel Computing %D 2012 %T DAGuE: A generic distributed DAG Engine for High Performance Computing. %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Parallel Computing %I Elsevier %V 38 %P 27-51 %8 2012-00 %G eng %0 Generic %D 2012 %T An efficient distributed randomized solver with application to large dense linear systems %A Marc Baboulin %A Dulceneia Becker %A George Bosilca %A Anthony Danalis %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B ICL Technical Report %8 2012-07 %G eng %0 Conference Paper %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %D 2012 %T From Serial Loops to Parallel Execution on Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jack Dongarra %B International European Conference on Parallel and Distributed Computing (Euro-Par '12) %C Rhodes, Greece %8 2012-08 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T DAGuE: A Generic Distributed DAG Engine for High Performance Computing %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Pierre Lemariner %A Jack Dongarra %K dague %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1151-1158 %8 2011-00 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Journal Article %J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %D 2011 %T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %K magma %K parsec %B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %8 2011-06 %G eng %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng