%0 Conference Paper
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%D 2022
%T A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization
%A Qinglei Cao
%A Rabab Alomairy
%A Yu Pei
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (IPDPS)
%G eng
%0 Conference Paper
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%D 2021
%T Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems
%A Qinglei Cao
%A Yu Pei
%A Kadir Akbudak
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%K asynchronous executions and load balancing
%K dynamic runtime system
%K environmental applications
%K High-performance computing
%K low-rank matrix computations
%K task-based programming model
%K user productivity
%X The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, lowrank matrix approximations, where the main idea consists of exploiting data sparsity typically by compressing off-diagonal tiles up to an application-specific accuracy threshold, have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires to extend PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be taken at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of Matern matrix kernel, which exhibits challenging nonuniform ´high ranks in off-diagonal tiles. We first provide a dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling for servicing next-generation low-rank matrix algebra libraries.
%B 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)
%I IEEE
%C Portland, OR
%8 2021-05
%G eng
%0 Conference Paper
%B Platform for Advanced Scientific Computing Conference (PASC20)
%D 2020
%T Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications
%A Qinglei Cao
%A Yu Pei
%A Kadir Akbudak
%A Aleksandr Mikhalev
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%X Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving of large-scale linear systems that performs a Cholesky factorization on a symmetric positive-definite covariance matrix---a demanding dense factorization in terms of memory footprint and computation. We propose a novel solution to this problem: at the mathematical level, we reduce the computational requirement by exploiting the data sparsity structure of the matrix off-diagonal tiles by means of low-rank approximations; and, at the programming-paradigm level, we integrate PaRSEC, a dynamic, task-based runtime to reach unparalleled levels of efficiency for solving extreme-scale linear algebra matrix operations. The resulting solution leverages fine-grained computations to facilitate asynchronous execution while providing a flexible data distribution to mitigate load imbalance. Performance results are reported using 3D synthetic datasets up to 42M geospatial locations on 130, 000 cores, which represent a cornerstone toward fast and accurate predictions of environmental applications.
%B Platform for Advanced Scientific Computing Conference (PASC20)
%I ACM
%C Geneva, Switzerland
%8 2020-06
%G eng
%R https://doi.org/10.1145/3394277.3401846
%0 Conference Paper
%B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19
%D 2019
%T Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools
%A Qinglei Cao
%A Yu Pei
%A Thomas Herault
%A Kadir Akbudak
%A Aleksandr Mikhalev
%A George Bosilca
%A Hatem Ltaief
%A David Keyes
%A Jack Dongarra
%B Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%0 Journal Article
%J The International Journal of High Performance Computing Applications
%D 2018
%T Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry
%A Mark Asch
%A Terry Moore
%A Rosa M. Badia
%A Micah Beck
%A Pete Beckman
%A Thierry Bidot
%A François Bodin
%A Franck Cappello
%A Alok Choudhary
%A Bronis R. de Supinski
%A Ewa Deelman
%A Jack Dongarra
%A Anshu Dubey
%A Geoffrey Fox
%A Haohuan Fu
%A Sergi Girona
%A Michael Heroux
%A Yutaka Ishikawa
%A Kate Keahey
%A David Keyes
%A William T. Kramer
%A Jean-François Lavignon
%A Yutong Lu
%A Satoshi Matsuoka
%A Bernd Mohr
%A Stéphane Requena
%A Joel Saltz
%A Thomas Schulthess
%A Rick Stevens
%A Martin Swany
%A Alexander Szalay
%A William Tang
%A Gaël Varoquaux
%A Jean-Pierre Vilotte
%A Robert W. Wisniewski
%A Zhiwei Xu
%A Igor Zacharov
%X Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of the high-performance computing (HPC) community. Based on those meetings, we argue that the rapid proliferation of digital data generators, the unprecedented growth in the volume and diversity of the data they generate, and the intense evolution of the methods for analyzing and using that data are radically reshaping the landscape of scientific computing. The most critical problems involve the logistics of wide-area, multistage workflows that will move back and forth across the computing continuum, between the multitude of distributed sensors, instruments and other devices at the networks edge, and the centralized resources of commercial clouds and HPC centers. We suggest that the prospects for the future integration of technological infrastructures and research ecosystems need to be considered at three different levels. First, we discuss the convergence of research applications and workflows that establish a research paradigm that combines both HPC and HDA, where ongoing progress is already motivating efforts at the other two levels. Second, we offer an account of some of the problems involved with creating a converged infrastructure for peripheral environments, that is, a shared infrastructure that can be deployed throughout the network in a scalable manner to meet the highly diverse requirements for processing, communication, and buffering/storage of massive data workflows of many different scientific domains. Third, we focus on some opportunities for software ecosystem convergence in big, logically centralized facilities that execute large-scale simulations and models and/or perform large-scale data analytics. We close by offering some conclusions and recommendations for future investment and policy review.
%B The International Journal of High Performance Computing Applications
%V 32
%P 435–479
%8 2018-07
%G eng
%N 4
%R https://doi.org/10.1177/1094342018778123
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
%A Ahmad Abdelfattah
%A Hatem Ltaeif
%A David Keyes
%A Jack Dongarra
%X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications.
%B Concurrency and Computation: Practice and Experience
%V 28
%P 3447 - 3465
%8 2016-05
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full
%N 12
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.v28.1210.1002/cpe.3874
%0 Journal Article
%J VECPAR 2012
%D 2012
%T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
%A Ahmad Abdelfattah
%A Jack Dongarra
%A David Keyes
%A Hatem Ltaeif
%B VECPAR 2012
%C Kobe, Japan
%8 2012-07
%G eng
%0 Journal Article
%J International Journal of High Performance Computing
%D 2011
%T The International Exascale Software Project Roadmap
%A Jack Dongarra
%A Pete Beckman
%A Terry Moore
%A Patrick Aerts
%A Giovanni Aloisio
%A Jean-Claude Andre
%A David Barkai
%A Jean-Yves Berthou
%A Taisuke Boku
%A Bertrand Braunschweig
%A Franck Cappello
%A Barbara Chapman
%A Xuebin Chi
%A Alok Choudhary
%A Sudip Dosanjh
%A Thom Dunning
%A Sandro Fiore
%A Al Geist
%A Bill Gropp
%A Robert Harrison
%A Mark Hereld
%A Michael Heroux
%A Adolfy Hoisie
%A Koh Hotta
%A Zhong Jin
%A Yutaka Ishikawa
%A Fred Johnson
%A Sanjay Kale
%A Richard Kenway
%A David Keyes
%A Bill Kramer
%A Jesus Labarta
%A Alain Lichnewsky
%A Thomas Lippert
%A Bob Lucas
%A Barney MacCabe
%A Satoshi Matsuoka
%A Paul Messina
%A Peter Michielse
%A Bernd Mohr
%A Matthias S. Mueller
%A Wolfgang E. Nagel
%A Hiroshi Nakashima
%A Michael E. Papka
%A Dan Reed
%A Mitsuhisa Sato
%A Ed Seidel
%A John Shalf
%A David Skinner
%A Marc Snir
%A Thomas Sterling
%A Rick Stevens
%A Fred Streitz
%A Bob Sugar
%A Shinji Sumimoto
%A William Tang
%A John Taylor
%A Rajeev Thakur
%A Anne Trefethen
%A Mateo Valero
%A Aad van der Steen
%A Jeffrey Vetter
%A Peg Williams
%A Robert Wisniewski
%A Kathy Yelick
%X Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
%B International Journal of High Performance Computing
%V 25
%P 3-60
%8 2011-01
%G eng
%R https://doi.org/10.1177/1094342010391989
%0 Journal Article
%J International Journal of High Performance Computing Applications (submitted)
%D 2006
%T Application of Machine Learning to the Selection of Sparse Linear Solvers
%A Sanjukta Bhowmick
%A Victor Eijkhout
%A Yoav Freund
%A Erika Fuentes
%A David Keyes
%K salsa
%K sans
%B International Journal of High Performance Computing Applications (submitted)
%8 2006-00
%G eng