%0 Generic %D 2024 %T CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT) %A Maksim Melnichenko %A Oleg Balabanov %A Riley Murray %A James Demmel %A Michael W. Mahoney %A Piotr Luszczek %X This paper develops and analyzes a new algorithm for QR decomposition with column pivoting (QRCP) of rectangular matrices with large row counts. The algorithm combines methods from randomized numerical linear algebra in a particularly careful way in order to accelerate both pivot decisions for the input matrix and the process of decomposing the pivoted matrix into the QR form. The source of the latter acceleration is a use of randomized preconditioning and CholeskyQR. Comprehensive analysis is provided in both exact and finite-precision arithmetic to characterize the algorithm's rank-revealing properties and its numerical stability granted probabilistic assumptions of the sketching operator. An implementation of the proposed algorithm is described and made available inside the open-source RandLAPACK library, which itself relies on RandBLAS - also available in open-source format. Experiments with this implementation on an Intel Xeon Gold 6248R CPU demonstrate order-of-magnitude speedups relative to LAPACK's standard function for QRCP, and comparable performance to a specialized algorithm for unpivoted QR of tall matrices, which lacks the strong rank-revealing properties of the proposed method. %I arXiv %8 2024-02 %G eng %U https://arxiv.org/abs/2311.08316 %0 Generic %D 2023 %T Generalizing Random Butterfly Transforms to Arbitrary Matrix Sizes %A Neil Lindquist %A Piotr Luszczek %A Jack Dongarra %X Parker and Lê introduced random butterfly transforms (RBTs) as a preprocessing technique to replace pivoting in dense LU factorization. Unfortunately, their FFT-like recursive structure restricts the dimensions of the matrix. Furthermore, on multi-node systems, efficient management of the communication overheads restricts the matrix's distribution even more. To remove these limitations, we have generalized the RBT to arbitrary matrix sizes by truncating the dimensions of each layer in the transform. We expanded Parker's theoretical analysis to generalized RBT, specifically that in exact arithmetic, Gaussian elimination with no pivoting will succeed with probability 1 after transforming a matrix with full-depth RBTs. Furthermore, we experimentally show that these generalized transforms improve performance over Parker's formulation by up to 62\% while retaining the ability to replace pivoting. This generalized RBT is available in the SLATE numerical software library. %I arXiv %8 2023-12 %G eng %U https://arxiv.org/abs/2312.09376 %0 Conference Paper %B 37th ACM International Conference on Supercomputing (ICS'23) %D 2023 %T Using Additive Modifications in LU Factorization Instead of Pivoting %A Neil Lindquist %A Piotr Luszczek %A Jack Dongarra %B 37th ACM International Conference on Supercomputing (ICS'23) %I ACM %C Orlando, FL %8 2023-06 %G eng %R 10.1145/3577193.3593731 %0 Generic %D 2022 %T Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale %A Alan Ayala %A Stanimire Tomov %A Piotr Luszczek %A Sebastien Cayrols %A Gerald Ragghianti %A Jack Dongarra %B ICL Technical Report %I Innovative Computing Laboratory %8 2022-07 %G eng %0 Generic %D 2022 %T FFT Benchmark Performance Experiments on Systems Targeting Exascale %A Alan Ayala %A Stanimire Tomov %A Piotr Luszczek %A Sebastien Cayrols %A Gerald Ragghianti %A Jack Dongarra %B ICL Technical Report %8 2022-03 %G eng %0 Generic %D 2022 %T PAQR: Pivoting Avoiding QR factorization %A Wissam M. Sid-Lakhdar %A Sebastien Cayrols %A Daniel Bielich %A Ahmad Abdelfattah %A Piotr Luszczek %A Mark Gates %A Stanimire Tomov %A Hans Johansen %A David Williams-Young %A Timothy A. Davis %A Jack Dongarra %X The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient. On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to QR. The aim of this paper is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems. After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems. %B ICL Technical Report %8 2022-06 %G eng %0 Generic %D 2022 %T Randomized Numerical Linear Algebra: A Perspective on the Field with an Eye to Software %A Riley Murray %A James Demmel %A Michael W. Mahoney %A N. Benjamin Erichson %A Maksim Melnichenko %A Osman Asif Malik %A Laura Grigori %A Piotr Luszczek %A Michał Dereziński %A Miles E. Lopes %A Tianyu Liang %A Hengrui Luo %A Jack Dongarra %K Randomized algorithms %X Randomized numerical linear algebra – RandNLA, for short – concerns the use of randomization as a resource to develop improved algorithms for large-scale linear algebra computations. The origins of contemporary RandNLA lay in theoretical computer science, where it blossomed from a simple idea: randomization provides an avenue for computing approximate solutions to linear algebra problems more efficiently than deterministic algorithms. This idea proved fruitful in and was largely driven by the development of scalable algorithms for machine learning and statistical data analysis applications. However, the true potential of RandNLA only came into focus once it began to integrate with the fields of numerical analysis and “classical” numerical linear algebra. Through the efforts of many individuals, randomized algorithms have been developed that provide full control over the accuracy of their solutions and that are every bit as reliable as algorithms that might be found in libraries such as LAPACK. The spectrum of possibilities offered by RandNLA has created a virtuous cycle of contributions by numerical analysts, statisticians, theoretical computer scientists, and the machine learning community. Recent years have even seen the incorporation of certain RandNLA methods into MATLAB, the NAG Library, and NVIDIA’s cuSOLVER. In view of these developments, we believe the time is ripe to accelerate the adoption of RandNLA in the scientific community. In particular, we believe the community stands to benefit significantly from a suitably defined “RandBLAS” and “RandLAPACK,” to serve as standard libraries for RandNLA, in much the same way that BLAS and LAPACK serve as standards for deterministic linear algebra. This monograph surveys the field of RandNLA as a step toward building mean- ingful RandBLAS and RandLAPACK libraries. Section 1 begins by setting scope and design principles for RandLAPACK and summarizing subsequent sections of the monograph. Section 2 focuses on RandBLAS, which is to be responsible for sketching. Details of functionality suitable for RandLAPACK are covered in the five sections that follow. Specifically, Sections 3 to 5 cover least squares and optimization, low- rank approximation, and other select problems that are well-understood in how they benefit from randomized algorithms. The remaining sections – on statistical leverage scores (Section 6) and tensor computations (Section 7) – read more like traditional surveys. The different flavor of these latter sections reflects how, in our assessment, the literature on these topics is still maturing. We provide a substantial amount of pseudo-code and supplementary material over the course of five appendices. Much of the pseudo-code has been tested via publicly available Matlab and Python implementations. %B University of California, Berkeley EECS Technical Report %I University of California, Berkeley %8 2022-11 %G eng %U https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-258.html %R 10.48550/arXiv.2302.1147 %0 Conference Paper %B 2022 IEEE High Performance Extreme Computing Conference (HPEC) %D 2022 %T Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance %A Piotr Luszczek %A Cade Brown %K Analytical models %K Benchmark testing %K Cloud computing %K Computational modeling %K Data models %K Measurement %K Satellites %X We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science-oriented metrics as the relevant figures of merit. %B 2022 IEEE High Performance Extreme Computing Conference (HPEC) %I IEEE %8 2022-09 %G eng %U https://ieeexplore.ieee.org/document/9926401/ %R 10.1109/HPEC55821.2022.9926401 %0 Conference Paper %B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems %D 2022 %T Threshold Pivoting for Dense LU Factorization %A Neil Lindquist %A Mark Gates %A Piotr Luszczek %A Jack Dongarra %X LU factorization is a key approach for solving large, dense systems of linear equations. Partial row pivoting is commonly used to ensure numerical stability; however, the data movement needed for the row interchanges can reduce performance. To improve this, we propose using threshold pivoting to find pivots almost as good as those selected by partial pivoting but that result in less data movement. Our theoretical analysis bounds the element growth similarly to partial pivoting; however, it also shows that the growth of threshold pivoting for a given matrix cannot be bounded by that of partial pivoting and vice versa. Additionally, we experimentally tested the approach on the Summit supercomputer. Threshold pivoting improved performance by up to 32% without a significant effect on accuracy. For a more aggressive configuration with up to one digit of accuracy lost, the improvement was as high as 44%. %B ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems %I IEEE %C Dallas, Texas %8 2022-11 %G eng %R 10.1109/ScalAH56622.2022.00010 %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2021 %T Accelerating Restarted GMRES with Mixed Precision Arithmetic %A Neil Lindquist %A Piotr Luszczek %A Jack Dongarra %K Convergence %K Error correction %K iterative methods %K Kernel %K linear systems %K Stability analysis %X The generalized minimum residual method (GMRES) is a commonly used iterative Krylov solver for sparse, non-symmetric systems of linear equations. Like other iterative solvers, data movement dominates its run time. To improve this performance, we propose running GMRES in reduced precision with key operations remaining in full precision. Additionally, we provide theoretical results linking the convergence of finite precision GMRES with classical Gram-Schmidt with reorthogonalization (CGSR) and its infinite precision counterpart which helps justify the convergence of this method to double-precision accuracy. We tested the mixed-precision approach with a variety of matrices and preconditioners on a GPU-accelerated node. Excluding the incomplete LU factorization without fill in (ILU(0)) preconditioner, we achieved average speedups ranging from 8 to 61 percent relative to comparable double-precision implementations, with the simpler preconditioners achieving the higher speedups. %B IEEE Transactions on Parallel and Distributed Systems %8 2021-06 %G eng %R 10.1109/TPDS.2021.3090757 %0 Generic %D 2021 %T Interim Report on Benchmarking FFT Libraries on High Performance Systems %A Alan Ayala %A Stanimire Tomov %A Piotr Luszczek %A Cayrols, Sebastien %A Ragghianti, Gerald %A Jack Dongarra %X The Fast Fourier Transform (FFT) is used in many applications such as molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and many wireless multimedia applications. FFTs are also heavily used in ECP applications, such as EXAALT, Copa, ExaSky-HACC, ExaWind, WarpX, and many others. As these applications’ accuracy and speed depend on the performance of the FFTs, we designed an FFT benchmark to mea- sure performance and scalability of currently available FFT packages and present the results from a pre-Exascale platform. Our benchmarking also stresses the overall capacity of system interconnect; thus, it may be considered as an indicator of the bisection bandwidth, communication contention noise, and the software overheads in MPI collectives that are of interest to many other ECP applications and libraries. This FFT benchmarking project aims to show the strengths and weaknesses of multiple FFT libraries and to indicate what can be done to improve their performance. In particular, we believe that the benchmarking results could help design and implement a fast and robust FFT library for 2D and 3D inputs, while targeting large-scale heterogeneous systems with multicore processors and hardware accelerators that are a co-designed in tandem with ECP applications. Our work involves studying and analyzing state-of-the-art FFT software both from vendors and available as open-source codes to better understand their performance. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2021-07 %G eng %9 ICL Tech Report %0 Book Section %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %D 2021 %T An Introduction to High Performance Computing and Its Intersection with Advances in Modeling Rare Earth Elements and Actinides %A Deborah A. Penchoff %A Edward Valeev %A Heike Jagode %A Piotr Luszczek %A Anthony Danalis %A George Bosilca %A Robert J. Harrison %A Jack Dongarra %A Theresa L. Windus %K actinide %K Computational modeling %K HPC %K REE %X Computationally driven solutions in nuclear and radiochemistry heavily depend on efficient modeling of Rare Earth Elements (REEs) and actinides. Accurate modeling of REEs and actinides faces challenges stemming from limitations from an imbalanced hardware-software ecosystem and its implications on inefficient use of High Performance Computing (HPC). This chapter provides a historical perspective on the evolution of HPC hardware, its intersectionality with domain sciences, the importance of benchmarks for performance, and an overview of challenges and advances in modeling REEs and actinides. This chapter intends to provide an introduction for researchers at the intersection of scientific computing, software development for HPC, and applied computational modeling of REEs and actinides. The chapter is structured in five sections. First, the Introduction includes subsections focusing on the Importance of REEs and Actinides (1.1), Hardware, Software, and the HPC Ecosystem (1.2), and Electronic Structure Modeling of REEs and Actinides (1.3). Second, a section in High Performance Computing focuses on the TOP500 (2.1), HPC Performance (2.2), HPC Benchmarks: Processing, Bandwidth, and Latency (2.3), and HPC Benchmarks and their Relationship to Chemical Modeling (2.4). Third, the Software Challenges and Advances focus on NWChem/NWChemEx (3.1), MADNESS (3.2), and MPQC (3.3). The fourth section provides a short overview of Artificial Intelligence in HPC applications relevant to nuclear and radiochemistry. The fifth section illustrates A Protocol to Evaluate Complexation Preferences in Separations of REEs and Actinides through Computational Modeling. %B Rare Earth Elements and Actinides: Progress in Computational Science Applications %I American Chemical Society %C Washington, DC %V 1388 %P 3-53 %8 2021-10 %@ ISBN13: 9780841298255 eISBN: 9780841298248 %G eng %U https://pubs.acs.org/doi/10.1021/bk-2021-1388.ch001 %& 1 %R 10.1021/bk-2021-1388.ch001 %0 Book %D 2021 %T Lecture Notes in Computer Science: High Performance Computing %A Heike Jagode %A Anzt, Hartwig %A Ltaief, Hatem %A Piotr Luszczek %X This book constitutes the refereed post-conference proceedings of 9 workshops held at the 35th International ISC High Performance 2021 Conference, in Frankfurt, Germany, in June-July 2021: Second International Workshop on the Application of Machine Learning Techniques to Computational Fluid Dynamics and Solid Mechanics Simulations and Analysis; HPC-IODC: HPC I/O in the Data Center Workshop; Compiler-assisted Correctness Checking and Performance Optimization for HPC; Machine Learning on HPC Systems; 4th International Workshop on Interoperability of Supercomputing and Cloud Technologies; 2nd International Workshop on Monitoring and Operational Data Analytics; 16th Workshop on Virtualization in High­-Performance Cloud Computing; Deep Learning on Supercomputers; 5th International Workshop on In Situ Visualization. The 35 papers included in this volume were carefully reviewed and selected. They cover all aspects of research, development, and application of large-scale, high performance experimental and commercial systems. Topics include high-performance computing (HPC), computer architecture and hardware, programming models, system software, performance analysis and modeling, compiler analysis and optimization techniques, software sustainability, scientific applications, deep learning. %I Springer International Publishing %V 12761 %@ 978-3-030-90538-5 %G eng %R 10.1007/978-3-030-90539-2 %0 Journal Article %J Computer Physics Communications %D 2021 %T Materials fingerprinting classification %A Spannaus, Adam %A Law, Kody J.H. %A Piotr Luszczek %A Nasrin, Farzana %A Micucci, Cassie Putman %A Liaw, Peter K. %A Santodonato, Louis J. %A Keffer, David J. %A Maroulas, Vasileios %K Atom probe tomography %K High entropy alloy %K Machine Learning %K Materials discovery %K Topological data analysis %X Significant progress in many classes of materials could be made with the availability of experimentally-derived large datasets composed of atomic identities and three-dimensional coordinates. Methods for visualizing the local atomic structure, such as atom probe tomography (APT), which routinely generate datasets comprised of millions of atoms, are an important step in realizing this goal. However, state-of-the-art APT instruments generate noisy and sparse datasets that provide information about elemental type, but obscure atomic structures, thus limiting their subsequent value for materials discovery. The application of a materials fingerprinting process, a machine learning algorithm coupled with topological data analysis, provides an avenue by which here-to-fore unprecedented structural information can be extracted from an APT dataset. As a proof of concept, the material fingerprint is applied to high-entropy alloy APT datasets containing body-centered cubic (BCC) and face-centered cubic (FCC) crystal structures. A local atomic configuration centered on an arbitrary atom is assigned a topological descriptor, with which it can be characterized as a BCC or FCC lattice with near perfect accuracy, despite the inherent noise in the dataset. This successful identification of a fingerprint is a crucial first step in the development of algorithms which can extract more nuanced information, such as chemical ordering, from existing datasets of complex materials. %B Computer Physics Communications %P 108019 %8 Jan-05-2021 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0010465521001314 %! Computer Physics Communications %R 10.1016/j.cpc.2021.108019 %0 Generic %D 2021 %T Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices %A Yaohung M. Tsai %A Piotr Luszczek %A Jack Dongarra %K eigenvalue solver %K hardware accelerators %K mixed-precision algorithms %X As the new hardware is being equipped with powerful low-precision capabilities driven primarily by the needs of the burgeoning field of Artificial Intelligence (AI), mixed-precision algorithms are now showing far greater potential and renewed interest in scientific computing community. The multi-precision methods commonly follow approximate-iterate scheme by first obtaining the approximate solution from a low-precision factorization and solve. Then, they iteratively refine the solution to the desired accuracy that is often as high as what is possible with traditional approaches. While targeting symmetric and Hermitian eigenvalue problems of the form Ax=λx, we revisit the SICE algorithm proposed by Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested. %B ICL Technical Report %8 2021-08 %G eng %0 Generic %D 2021 %T P1673R3: A Free Function Linear algebra Interface Based on the BLAS %A Mark Hoemmen %A Daisy Hollman %A Christian Trott %A Daniel Sunderland %A Nevin Liber %A Li-Ta Lo %A Damien Lebrun-Grandie %A Graham Lopez %A Peter Caday %A Sarah Knepper %A Piotr Luszczek %A Timothy Costa %K C++ %K linear algebra %X We believe this proposal is complementary to P1385, a proposal for a C++ Standard linear algebra library that introduces matrix and vector classes and overloaded arithmetic operators. In fact, we think that our proposal would make a natural foundation for a library like what P1385 proposes. However, a free function interface -- which clearly separates algorithms from data structures -- more naturally allows for a richer set of operations such as what the BLAS provides. A natural extension of the present proposal would include accepting P1385's matrix and vector objects as input for the algorithms proposed here. A straightforward way to do that would be for P1385's matrix and vector objects to make views of their data available as basic_mdspan. %B ISO JTC1 SC22 WG22 %I ISO %8 2021-04 %G eng %U http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1673r3.pdf %9 standard %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2021 %T A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines %A Abdelfattah, Ahmad %A Costa, Timothy %A Jack Dongarra %A Mark Gates %A Haidar, Azzam %A Hammarling, Sven %A Higham, Nicholas J %A Kurzak, Jakub %A Piotr Luszczek %A Stanimire Tomov %A others %K Computations on matrices %K Mathematical analysis %K Mathematics of computing %K Numerical analysis %X This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning. %B ACM Transactions on Mathematical Software (TOMS) %V 47 %P 1–23 %G eng %R 10.1145/3431921 %0 Conference Proceedings %B Proceedings of the ACM International Conference on Supercomputing %D 2021 %T Task-graph scheduling extensions for efficient synchronization and communication %A Bak, Seonmyeong %A Hernandez, Oscar %A Mark Gates %A Piotr Luszczek %A Sarkar, Vivek %K Compilers %K Computing methodologies %K Parallel computing methodologies %K Parallel programming languages %K Runtime environments %K Software and its engineering %K Software notations and tools %X Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs. %B Proceedings of the ACM International Conference on Supercomputing %P 88–101 %G eng %R 10.1145/3447818.3461616 %0 Journal Article %J Journal of Computational Science %D 2021 %T Translational process: Mathematical software perspective %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %K communication avoiding algorithms %K DATAFLOW scheduling runtimes %K hardware accelerators %X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack. %B Journal of Computational Science %V 52 %P 101216 %G eng %R 10.1016/j.jocs.2020.101216 %0 Generic %D 2020 %T Clover: Computational Libraries Optimized via Exascale Research %A Mark Gates %A Stanimire Tomov %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Conference Paper %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2020 %T Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime %A Yu Pei %A Qinglei Cao %A George Bosilca %A Piotr Luszczek %A Victor Eijkhout %A Jack Dongarra %X Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2× speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively. %B 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C New Orleans, LA %8 2020-05 %G eng %R https://doi.org/10.1109/IPDPSW50202.2020.00127 %0 Conference Paper %B Computer Modeling and Intelligent Systems CMIS-2020 %D 2020 %T Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK %A Dmitry Zaitsev %A Piotr Luszczek %K docker containers %K software containers %X Platform as a Service (PaaS) cloud computing model becomes wide- spread implemented within Docker Containers. Docker uses operating system level virtualization to deliver software in packages called containers. Containers are isolated from one another and comprise all the required software, including operating system API, libraries and configuration files. With such advantageous integrity one can doubt on Docker performance. The present paper applies packet LAPACK, which is widely used for performance benchmarks of super- computers, to collect and compare benchmarks of Docker on Linux Ubuntu and MS Windows platforms. After a brief overview of Docker and LAPACK, a se- ries of Docker images containing LAPACK is created and run, abundant benchmarks obtained and represented in tabular and graphical form. From the final discussion, we conclude that Docker runs with nearly the same perfor- mance on both Linux and Windows platforms, the slowdown does not exceed some ten percent. Though Docker performance in Windows is essentially lim- ited by the amount of RAM allocated to Docker Engine. %B Computer Modeling and Intelligent Systems CMIS-2020 %C Zaporizhzhoa %8 2020-03 %G eng %0 Conference Paper %B Smoky Mountains Computational Sciences & Engineering Conference (SMC2020) %D 2020 %T Improving the Performance of the GMRES Method using Mixed-Precision Techniques %A Neil Lindquist %A Piotr Luszczek %A Jack Dongarra %K Kokkos %K Krylov subspace methods %K linear algebra %K mixed precision %X The GMRES method is used to solve sparse, non-symmetric systems of linear equations arising from many scientific applications. The solver performance within a single node is memory bound, due to the low arithmetic intensity of its computational kernels. To reduce the amount of data movement, and thus, to improve performance, we investigated the effect of using a mix of single and double precision while retaining double-precision accuracy. Previous efforts have explored reduced precision in the preconditioner, but the use of reduced precision in the solver itself has received limited attention. We found that GMRES only needs double precision in computing the residual and updating the approximate solution to achieve double-precision accuracy, although it must restart after each improvement of single-precision accuracy. This finding holds for the tested orthogonalization schemes: Modified Gram-Schmidt (MGS) and Classical Gram-Schmidt with Re-orthogonalization (CGSR). Furthermore, our mixed-precision GMRES, when restarted at least once, performed 19% and 24% faster on average than double-precision GMRES for MGS and CGSR, respectively. Our implementation uses generic programming techniques to ease the burden of coding implementations for different data types. Our use of the Kokkos library allowed us to exploit parallelism and optimize data management. Additionally, KokkosKernels was used when producing performance results. In conclusion, using a mix of single and double precision in GMRES can improve performance while retaining double-precision accuracy. %B Smoky Mountains Computational Sciences & Engineering Conference (SMC2020) %8 2020-08 %G eng %0 Book Section %B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC) %D 2020 %T Interoperable Convergence of Storage, Networking, and Computation %A Micah Beck %A Terry Moore %A Piotr Luszczek %A Anthony Danalis %E Kohei Arai %E Rahul Bhatia %K active networks %K distributed cloud %K distributed processing %K distributed storage %K edge computing %K network convergence %K network layering %K scalability %X In every form of digital store-and-forward communication, intermediate forwarding nodes are computers, with attendant memory and processing resources. This has inevitably stimulated efforts to create a wide-area infrastructure that goes beyond simple store-and-forward to create a platform that makes more general and varied use of the potential of this collection of increasingly powerful nodes. Historically, these efforts predate the advent of globally routed packet networking. The desire for a converged infrastructure of this kind has only intensified over the last 30 years, as memory, storage, and processing resources have increased in both density and speed while simultaneously decreasing in cost. Although there is a general consensus that it should be possible to define and deploy such a dramatically more capable wide-area platform, a great deal of investment in research prototypes has yet to produce a credible candidate architecture. Drawing on technical analysis, historical examples, and case studies, we present an argument for the hypothesis that in order to realize a distributed system with the kind of convergent generality and deployment scalability that might qualify as "future-defining," we must build it from a small set of simple, generic, and limited abstractions of the low level resources (processing, storage and network) of its intermediate nodes. %B Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC) %I Springer International Publishing %P 667-690 %@ 978-3-030-12385-7 %G eng %0 Generic %D 2020 %T The PLASMA Library on CORAL Systems and Beyond (Poster) %A Piotr Luszczek %A Jack Dongarra %I 2020 Exascale Computing Project Annual Meeting %C Houston, TX %8 2020-02 %G eng %0 Generic %D 2020 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) %A James Demmel %A Jack Dongarra %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Michael Mahoney %X The convergence of several unprecedented changes, including formidable new system design constraints and revolutionary levels of heterogeneity, has made it clear that much of the essential software infrastructure of computational science and engineering is, or will soon be, obsolete. Math libraries have historically been in the vanguard of software that must be adapted first to such changes, both because these low-level workhorses are so critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Under the Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC) project, the principal designers of the Linear Algebra PACKage (LAPACK) and the Scalable Linear Algebra PACKage (ScaLAPACK), the combination of which is abbreviated Sca/LAPACK, aim to enhance and update these libraries for the ongoing revolution in processor architecture, system design, and application requirements by incorporating them into a layered package of software components—the BALLISTIC ecosystem—that provides users seamless access to state-of-the-art solver implementations through familiar and improved Sca/LAPACK interfaces. %B LAPACK Working Notes %I University of Tennessee %8 2020/07 %G eng %0 Conference Paper %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %D 2020 %T Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques %A Neil Lindquist %A Piotr Luszczek %A Jack Dongarra %K linear systems %K Randomized algorithms %X Gaussian elimination is a key technique for solving dense, non-symmetric systems of linear equations. Pivoting is used to ensure numerical stability but can introduce significant overheads. We propose replacing pivoting with recursive butterfly transforms (RBTs) and iterative refinement. RBTs use an FFT-like structure and randomized elements to provide an efficient, two-sided preconditioner for factoring. This approach was implemented and tested using Software for Linear Algebra Targeting Exascale (SLATE). In numerical experiments, our implementation was more robust than Gaussian elimination with no pivoting (GENP) but failed to solve all the problems solvable with Gaussian elimination with partial pivoting (GEPP). Furthermore, the proposed solver was able to outperform GEPP when distributed on GPU-accelerated nodes. %B 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) %I IEEE %C Atlanta, GA %8 2020-11 %G eng %0 Conference Paper %B 2020 IEEE High Performance Extreme Computing Conference (HPEC) %D 2020 %T Scalable Data Generation for Evaluating Mixed-Precision Solvers %A Piotr Luszczek %A Yaohung Tsai %A Neil Lindquist %A Hartwig Anzt %A Jack Dongarra %X We present techniques of generating data for mixed precision solvers that allows to test those solvers in a scalable manner. Our techniques focus on mixed precision hardware and software where both the solver and the hardware can take advantage of mixing multiple floating precision formats. This allows taking advantage of recently released generation of hardware platforms that focus on ML and DNN workloads but can also be utilized for HPC applications if a new breed of algorithms is combined with the custom floating-point formats to deliver performance levels beyond the standard IEEE data types while delivering a comparable accuracy of the results. %B 2020 IEEE High Performance Extreme Computing Conference (HPEC) %I IEEE %C Waltham, MA, USA %8 2020-09 %G eng %R https://doi.org/10.1109/HPEC43674.2020.9286145 %0 Journal Article %J ACM Transactions on Mathematical Software %D 2020 %T A Set of Batched Basic Linear Algebra Subprograms %A Ahmad Abdelfattah %A Timothy Costa %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Sven Hammarling %A Nicholas J. Higham %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Mawussi Zounon %X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning. %B ACM Transactions on Mathematical Software %8 2020-10 %G eng %0 Generic %D 2020 %T A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic %A Ahmad Abdelfattah %A Hartwig Anzt %A Erik Boman %A Erin Carson %A Terry Cojean %A Jack Dongarra %A Mark Gates %A Thomas Gruetzmacher %A Nicholas J. Higham %A Sherry Li %A Neil Lindquist %A Yang Liu %A Jennifer Loe %A Piotr Luszczek %A Pratik Nayak %A Sri Pranesh %A Siva Rajamanickam %A Tobias Ribizel %A Barry Smith %A Kasia Swirydowicz %A Stephen Thomas %A Stanimire Tomov %A Yaohung Tsai %A Ichitaro Yamazaki %A Urike Meier Yang %B SLATE Working Notes %I University of Tennessee %8 2020-07 %G eng %9 SLATE Working Notes %0 Journal Article %J Journal of Computational Science %D 2020 %T Translational Process: Mathematical Software Perspective %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %K communication avoiding algorithms %K DATAFLOW scheduling runtimes %K hardware accelerators %X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to world-class high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack. %B Journal of Computational Science %8 2020-09 %G eng %R https://doi.org/10.1016/j.jocs.2020.101216 %0 Generic %D 2020 %T Translational Process: Mathematical Software Perspective %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %K communication avoiding algorithms %K data flow scheduling runtimes %K hardware accelerators %X Each successive generation of computer architecture has brought new challenges to achieving high performance mathematical solvers, necessitating development and analysis of new algorithms, which are then embodied in software libraries. These libraries hide architectural details from applications, allowing them to achieve a level of portability across platforms from desktops to worldclass high performance computing (HPC) systems. Thus there has been an informal translational computer science process of developing algorithms and distributing them in open source software libraries for adoption by applications and vendors. With the move to exascale, increasing intentionality about this process will benefit the long-term sustainability of the scientific software stack. %B Innovative Computing Laboratory Technical Report %8 2020-08 %G eng %0 Generic %D 2020 %T Using Quantized Integer in LU Factorization with Partial Pivoting (Poster) %A Yaohung Tsai %A Piotr Luszczek %A Jack Dongarra %X Quantization is a common technique to speed the deep learning inference. It is using integers with a shared scalar to represent a set of equally spaced numbers. The quantized integer method has shown great success in compressing the deep learning models, reducing the computation cost without losing too much accuracy. New application specific hardware and specialized CPU extension instructions like Intel AVX-512 VNNI are providing capabilities for us to do integer MADD (multiply and add) efficiently. In this poster, we would like to show our preliminary results of using quantization integers for LU factorization with partial pivoting. Using Int32, the backward error can outperform single precision. However, quantized integer has the similar issue of limited range as FP16 that it would not work directly for large matrices because of big numbers would occur in factored U. We will show some possible solutions to it and how we would like to apply this quantized integer technique to other numerical linear algebra applications. %I SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20) %C Seattle, WA %8 2020-02 %G eng %0 Generic %D 2019 %T A Collection of White Papers from the BDEC2 Workshop in Poznan, Poland %A Gabriel Antoniu %A Alexandru Costan %A Ovidiu Marcu %A Maria S. Pérez %A Nenad Stojanovic %A Rosa M. Badia %A Miguel Vázquez %A Sergi Girona %A Micah Beck %A Terry Moore %A Piotr Luszczek %A Ezra Kissel %A Martin Swany %A Geoffrey Fox %A Vibhatha Abeykoon %A Selahattin Akkas %A Kannan Govindarajan %A Gurhan Gunduz %A Supun Kamburugamuve %A Niranda Perera %A Ahmet Uyar %A Pulasthi Wickramasinghe %A Chathura Widanage %A Maria Girone %A Toshihiro Hanawa %A Richard Moreno %A Ariel Oleksiak %A Martin Swany %A Ryousei Takano %A M.P. van Haarlem %A J. van Leeuwen %A J.B.R. Oonk %A T. Shimwell %A L.V.E. Koopmans %B Innovative Computing Laboratory Technical Report %I University of Tennessee, Knoxville %8 2019-05 %G eng %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist %D 2019 %T Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators %A Piotr Luszczek %A Ichitaro Yamazaki %A Jack Dongarra %X The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms. %B IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist %I IEEE %C Waltham, MA %8 2019-09 %G eng %1 Best Paper Finalist %0 Journal Article %J ACM Transactions on Mathematical Software %D 2019 %T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Maksims Abalenkovs %A Negin Bagherpour %A Sven Hammarling %A Jakub Sistek %B ACM Transactions on Mathematical Software %V 45 %8 2019-06 %G eng %N 2 %R https://doi.org/10.1145/3264491 %0 Conference Paper %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2019 %T Software-Defined Events through PAPI %A Anthony Danalis %A Heike Jagode %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %X PAPI has been used for almost two decades as an abstraction and standardization layer for profiling hardware-specific performance metrics. However, application developers-and profiling software packages-are quite often interested in information beyond hardware counters, such as the behavior of libraries used by the software that is being profiled. So far, accessing this information has required interfacing directly with the libraries on a case-by-case basis, or low-level binary instrumentation. In this paper, we introduce the new Software-Defined Event (SDE) component of PAPI which aims to enable PAPI to serve as an abstraction and standardization layer for events that originate in software layers as well. Extending PAPI to include SDEs enables monitoring of both types of performance events-hardware-and software-related events-in a uniform way, through the same consistent PAPI interface. Furthermore, implementing SDE as a PAPI component means that the new API is aimed only at the library developers who wish to export events from within their libraries. The API for reading PAPI events-both hardware and software-remains the same, so all legacy codes and tools that use PAPI will not only continue to work, but they will automatically be able to read SDEs wherever those are available. The goal of this paper is threefold. First, we outline our design decisions regarding the functionality we offer through the new SDE interface, and offer simple examples of usage. Second, we illustrate how those events can be utilized by different software packages, specifically, by showcasing their use in the task-based runtime PaRSEC, and the HPCG supercomputing benchmark. Third, we provide a thorough performance analysis of the overhead that results from monitoring different types of SDEs, and showcase the negligible overhead of using PAPI SDE even in cases of extremely heavy use. %B 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C Rio de Janeiro, Brazil %8 2019-05 %G eng %R https://doi.org/10.1109/IPDPSW.2019.00069 %0 Journal Article %J Proceedings of the IEEE %D 2018 %T Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators %A Jack Dongarra %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Yaohung Tsai %K Dense numerical linear algebra %K performance autotuning %X Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms. %B Proceedings of the IEEE %V 106 %P 2040–2055 %8 2018-11 %G eng %N 11 %R 10.1109/JPROC.2018.2868961 %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2018 %T Autotuning Techniques for Performance-Portable Point Set Registration in 3D %A Piotr Luszczek %A Jakub Kurzak %A Ichitaro Yamazaki %A David Keffer %A Vasileios Maroulas %A Jack Dongarra %X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware. %B Supercomputing Frontiers and Innovations %V 5 %8 2018-12 %G eng %& 42 %R 10.14529/jsfi180404 %0 Report %D 2018 %T Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification %A Jack Dongarra %A Iain Duff %A Mark Gates %A Azzam Haidar %A Sven Hammarling %A Nicholas J. Higham %A Jonathan Hogg %A Pedro Valero Lara %A Piotr Luszczek %A Mawussi Zounon %A Samuel D. Relton %A Stanimire Tomov %A Timothy Costa %A Sarah Knepper %X This document describes an API for Batch Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). We focus on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The extensions beyond the original BLAS standard are considered that specify a programming interface not only for routines with uniformly-sized matrices and/or vectors but also for the situation where the sizes vary. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance manycore platforms. These include multicore and many-core CPU processors; GPUs and coprocessors; as well as other hardware accelerators with floating-point compute facility. %8 2018-07 %G eng %0 Generic %D 2018 %T Implementation of the C++ API for Batch BLAS %A Ahmad Abdelfattah %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2018-06 %G eng %1 07 %0 Generic %D 2018 %T Linear Systems Performance Report %A Jakub Kurzak %A Mark Gates %A Ichitaro Yamazaki %A Ali Charara %A Asim YarKhan %A Jamie Finney %A Gerald Ragghianti %A Piotr Luszczek %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2018-09 %G eng %9 SLATE Working Notes %1 08 %0 Generic %D 2018 %T Parallel BLAS Performance Report %A Jakub Kurzak %A Mark Gates %A Asim YarKhan %A Ichitaro Yamazaki %A Panruo Wu %A Piotr Luszczek %A Jamie Finney %A Jack Dongarra %B SLATE Working Notes %I University of Tennessee %8 2018-04 %G eng %1 05 %0 Generic %D 2018 %T Parallel Norms Performance Report %A Jakub Kurzak %A Mark Gates %A Asim YarKhan %A Ichitaro Yamazaki %A Piotr Luszczek %A Jamie Finney %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2018-06 %G eng %1 06 %0 Journal Article %J SIAM Review %D 2018 %T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %K bidiagonal matrix %K bisection %K Divide and conquer %K Hestenes method %K Jacobi method %K Kogbetliantz method %K MRRR %K QR iteration %K Singular value decomposition %K SVD %X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy. %B SIAM Review %V 60 %P 808–865 %8 2018-11 %G eng %U https://epubs.siam.org/doi/10.1137/17M1117732 %N 4 %! SIAM Rev. %R 10.1137/17M1117732 %0 Journal Article %J International Journal of Computational Science and Engineering (IJCSE) %D 2018 %T Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP %A Joseph Dorris %A Asim YarKhan %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X The increasing number of computational cores in modern many-core processors, as represented by the Intel Xeon Phi architectures, has created the need for an open-source, high performance and scalable task-based dense linear algebra package that can efficiently use this type of many-core hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, run effectively on two generations of Intel's Xeon Phi architecture, known as knights corner (KNC) and knights landing (KNL). First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP tasks for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the OpenMP tiled Cholesky decomposition on knights corner and knights landing processors. We detail the optimisations required to obtain performance on these platforms and compare with the highly tuned Intel MKL math library. %B International Journal of Computational Science and Engineering (IJCSE) %V 17 %8 2018-10 %G eng %R http://dx.doi.org/10.1504/IJCSE.2018.095851 %0 Conference Paper %B Parallel and Distributed Processing Symposium Workshops (IPDPSW) %D 2017 %T Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Yu Pei %A Jack Dongarra %K batch computation %K Cholesky Factorization %K data layout %K GPU computing %K numerical linear algebra %X Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices. %B Parallel and Distributed Processing Symposium Workshops (IPDPSW) %I IEEE %C Orlando, FL %8 2017-06 %G eng %R 10.1109/IPDPSW.2017.18 %0 Book Section %B Handbook of Big Data Technologies %D 2017 %T Bringing High Performance Computing to Big Data Algorithms %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %B Handbook of Big Data Technologies %I Springer %@ 978-3-319-49339-8 %G eng %R 10.1007/978-3-319-49340-4 %0 Generic %D 2017 %T C++ API for Batch BLAS %A Ahmad Abdelfattah %A Konstantin Arturov %A Cris Cecka %A Jack Dongarra %A Chip Freitag %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Panruo Wu %B SLATE Working Notes %I University of Tennessee %8 2017-12 %G eng %1 04 %0 Generic %D 2017 %T C++ API for BLAS and LAPACK %A Mark Gates %A Piotr Luszczek %A Ahmad Abdelfattah %A Jakub Kurzak %A Jack Dongarra %A Konstantin Arturov %A Cris Cecka %A Chip Freitag %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %1 02 %0 Generic %D 2017 %T The Case for Directive Programming for Accelerator Autotuner Optimization %A Diana Fayad %A Jakub Kurzak %A Piotr Luszczek %A Panruo Wu %A Jack Dongarra %X In this work, we present the use of compiler pragma directives for parallelizing autotuning of specialized compute kernels for hardware accelerators. A set of constructs, that include prallelizing a source code that prune a generated search space with a large number of constraints for an autotunning infrastructure. For a better performance we studied optimization aimed at minimization of the run time.We also studied the behavior of the parallel load balance and the speedup on four different machines: x86, Xeon Phi, ARMv8, and POWER8. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-10 %G eng %0 Generic %D 2017 %T Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs %A Ichitaro Yamazaki %A Mark Hoemmen %A Piotr Luszczek %A Jack Dongarra %I SIAM Annual Meeting %C Pittsburgh, Pennsylvania %8 2017-07 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2017 %T Design and Implementation of the PULSAR Programming System for Large Scale Computing %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %A Yves Robert %A Jack Dongarra %X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators. %B Supercomputing Frontiers and Innovations %V 4 %G eng %U http://superfri.org/superfri/article/view/121/210 %N 1 %R 10.14529/jsfi170101 %0 Generic %D 2017 %T Designing SLATE: Software for Linear Algebra Targeting Exascale %A Jakub Kurzak %A Panruo Wu %A Mark Gates %A Ichitaro Yamazaki %A Piotr Luszczek %A Gerald Ragghianti %A Jack Dongarra %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-10 %G eng %9 SLATE Working Notes %1 03 %0 Conference Proceedings %B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award %D 2017 %T Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives %A Ichitaro Yamazaki %A Mark Hoemmen %A Piotr Luszczek %A Jack Dongarra %X We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces. %B Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award %C Orlando, FL %8 2017-06 %G eng %R https://doi.org/10.1109/IPDPSW.2017.65 %0 Generic %D 2017 %T MAGMA-sparse Interface Design Whitepaper %A Hartwig Anzt %A Erik Boman %A Jack Dongarra %A Goran Flegar %A Mark Gates %A Mike Heroux %A Mark Hoemmen %A Jakub Kurzak %A Piotr Luszczek %A Sivasankaran Rajamanickam %A Stanimire Tomov %A Stephen Wood %A Ichitaro Yamazaki %X In this report we describe the logic and interface we develop for the MAGMA-sparse library to allow for easy integration as third-party library into a top-level software ecosystem. The design choices are based on extensive consultation with other software library developers, in particular the Trilinos software development team. The interface documentation is at this point not exhaustive, but a first proposal for setting a standard. Although the interface description targets the MAGMA-sparse software module, we hope that the design choices carry beyond this specific library, and are attractive for adoption in other packages. This report is not intended as static document, but will be updated over time to reflect the agile software development in the ECP 1.3.3.11 STMS11-PEEKS project. %B Innovative Computing Laboratory Technical Report %8 2017-09 %G eng %9 Technical Report %0 Generic %D 2017 %T PLASMA 17 Performance Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Generic %D 2017 %T PLASMA 17.1 Functionality Report %A Maksims Abalenkovs %A Negin Bagherpour %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Samuel Relton %A Jakub Sistek %A David Stevens %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %A Mawussi Zounon %X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2017-06 %G eng %0 Generic %D 2017 %T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale %A Ahmad Abdelfattah %A Hartwig Anzt %A Aurelien Bouteiller %A Anthony Danalis %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Stephen Wood %A Panruo Wu %A Ichitaro Yamazaki %A Asim YarKhan %B SLATE Working Notes %I Innovative Computing Laboratory, University of Tennessee %8 2017-06 %G eng %9 SLATE Working Notes %1 01 %0 Conference Paper %B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017) %D 2017 %T Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds %A Piotr Luszczek %A Jakub Kurzak %A Ichitaro Yamazaki %A David Keffer %A Jack Dongarra %X In this article, we present an autotuning approach applied to systematic performance engineering of the EM-ICP (Expectation-Maximization Iterative Closest Point) algorithm for the point set registration problem. We show how we were able to exceed the performance achieved by the reference code through multiple dependence transformations and automated procedure of generating and evaluating numerous implementation variants. Furthermore, we also managed to exploit code transformations that are not that common during manual optimization but yielded better performance in our tests for the EM-ICP algorithm. Finally, we maintained high levels of performance rate in a portable fashion across a wide range of HPC hardware platforms including multicore, many-core, and GPU-based accelerators. More importantly, the results indicate consistently high performance level and ability to move the task of data analysis through point-set registration to any modern compute platform without the concern of inferior asymptotic efficiency. %B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017) %I IEEE %C Boston, MA %8 2017-12 %G eng %R https://doi.org/10.1109/BigData.2017.8258258 %0 Conference Paper %B 2017 IEEE High Performance Extreme Computing Conference (HPEC) %D 2017 %T Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic %A Piotr Luszczek %A Jakub Kurzak %A Ichitaro Yamazaki %A Jack Dongarra %X With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms. %B 2017 IEEE High Performance Extreme Computing Conference (HPEC) %I IEEE %C Waltham, MA %8 2017-09 %G eng %R https://doi.org/10.1109/HPEC.2017.8091031 %0 Journal Article %J Computing in Science & Engineering %D 2017 %T With Extreme Computing, the Rules Have Changed %A Jack Dongarra %A Stanimire Tomov %A Piotr Luszczek %A Jakub Kurzak %A Mark Gates %A Ichitaro Yamazaki %A Hartwig Anzt %A Azzam Haidar %A Ahmad Abdelfattah %X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future. %B Computing in Science & Engineering %V 19 %P 52-62 %8 2017-05 %G eng %N 3 %R https://doi.org/10.1109/MCSE.2017.48 %0 Conference Paper %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2016 %T Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction. %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %K plasma %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2016 %T High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %B International Journal of High Performance Computing Applications %V 30 %P 3 - 10 %8 2016-02 %G eng %U http://hpc.sagepub.com/cgi/doi/10.1177/1094342015593158 %N 1 %! International Journal of High Performance Computing Applications %R 10.1177/1094342015593158 %0 Journal Article %J Acta Numerica %D 2016 %T Linear Algebra Software for Large-Scale Accelerated Multicore Computing %A Ahmad Abdelfattah %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A undefined %A Asim YarKhan %X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. %B Acta Numerica %V 25 %P 1-160 %8 2016-05 %G eng %R 10.1017/S0962492916000015 %0 Generic %D 2016 %T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Ahmad Abdelfattah %A Jack Dongarra %X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2016-08 %G eng %0 Journal Article %J National Science Review %D 2016 %T A New Metric for Ranking High-Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %B National Science Review %V 3 %P 30-35 %8 2016-01 %G eng %N 1 %R https://doi.org/10.1093/nsr/nwv084 %0 Journal Article %J International Journal of Parallel Programming %D 2016 %T Porting the PLASMA Numerical Library to the OpenMP Standard %A Asim YarKhan %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition. %B International Journal of Parallel Programming %8 2016-06 %G eng %U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html %! Int J Parallel Prog %R 10.1007/s10766-016-0441-6 %0 Conference Paper %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %D 2016 %T Search Space Generation and Pruning System for Autotuners %A Piotr Luszczek %A Mark Gates %A Jakub Kurzak %A Anthony Danalis %A Jack Dongarra %X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages. %B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS) %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction %A Hartwig Anzt %A William Sawyer %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B International Journal of High Performance Computing Applications %G eng %0 Conference Paper %B EuroMPI/Asia 2015 Workshop %D 2015 %T Batched Matrix Computations on Hardware Accelerators %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU. %B EuroMPI/Asia 2015 Workshop %C Bordeaux, France %8 2015-09 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Batched matrix computations on hardware accelerators based on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU. %B International Journal of High Performance Computing Applications %8 2015-02 %G eng %R 10.1177/1094342014567546 %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %D 2015 %T Cholesky Across Accelerators %A Asim YarKhan %A Azzam Haidar %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %I IEEE %C Elizabeth, NJ %8 2015-08 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Efficient Eigensolver Algorithms on Accelerator Based Architectures %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K Autotuning %K energy efficiency %K hardware accelerators %K matrix multiplication %K power %X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use. %B Concurrency and Computation: Practice and Experience %V 27 %P 5096 - 5113 %8 12-Oct %G eng %U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516 %N 17 %! Concurrency Computat.: Pract. Exper. %R 10.1002/cpe.3516 %0 Journal Article %J Concurrency in Computation: Practice and Experience %D 2015 %T Experiences in autotuning matrix multiplication for energy minimization on GPUs %A Hartwig Anzt %A Blake Haugen %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B Concurrency in Computation: Practice and Experience %V 27 %P 5096-5113 %8 2015-12 %G eng %N 17 %R 10.1002/cpe.3516 %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications %D 2015 %T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization %A Azzam Haidar %A Asim YarKhan %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization. %B 17th IEEE International Conference on High Performance Computing and Communications %C Newark, NJ %8 2015-08 %G eng %0 Conference Paper %B ISC High Performance %D 2015 %T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations %A Azzam Haidar %A Tingxing Dong %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B ISC High Performance %I Springer %C Frankfurt, Germany %8 2015-07 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2015 %T High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %K Additive Schwarz %K HPC Benchmarking %K Multigrid smoothing %K Preconditioned Conjugate Gradient %K Validation and Verification %X We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement. %B The International Journal of High Performance Computing Applications %G eng %R 10.1177/1094342015593158 %0 Journal Article %J Scientific Programming %D 2015 %T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Azzam Haidar %A Jack Dongarra %A Khairul Kabir %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %A Yulu Jia %K communication and computation overlap %K dynamic runtime scheduling using dataflow dependences %K hardware accelerators and coprocessors %K Intel Xeon Phi processor %K Many Integrated Cores %K numerical linear algebra %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B Scientific Programming %V 23 %8 2015-01 %G eng %N 1 %R 10.3233/SPR-140404 %0 Generic %D 2015 %T HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems %A Jack Dongarra %A Michael A. Heroux %A Piotr Luszczek %K Additive Schwarz %K HPC Benchmarking %K Multigrid smoothing %K Preconditioned Conjugate Gradient %K Validation and Verification %X We describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and be representative of their performance. HPCG is meant to help drive the computer system design and implementation in directions that will better impact future performance improvement. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2015-01 %G eng %U http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf %0 Conference Paper %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %D 2015 %T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing %A Azzam Haidar %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries. %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %I IEEE %C Waltham, MA %8 2015-09 %G eng %0 Generic %D 2015 %T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %I ISC High Performance (ISC15), Intel Booth Presentation %C Frankfurt, Germany %8 2015-06 %G eng %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %D 2015 %T Optimization for Performance and Energy for Batched Matrix Computations on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %I ACM %C San Francisco, CA %8 2015-02 %G eng %R 10.1145/2716282.2716288 %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K plasma %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs %A Theo Mary %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2015 %T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %K Gaussian elimination %K lu factorization %K Multicore %K parallel %K plasma %K shared memory %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. %B Concurrency and Computation: Practice and Experience %V 27 %P 1292-1309 %8 2015-04 %G eng %N 5 %R 10.1002/cpe.3306 %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %D 2015 %T Towards Batched Linear Solvers on Accelerated Hardware Platforms %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %I ACM %C San Francisco, CA %8 2015-02 %G eng %0 Conference Proceedings %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %D 2015 %T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators %A Azzam Haidar %A Yulu Jia %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %A Jack Dongarra %K dataflow scheduling %K hardware accelerators %K multi-grain parallelism %X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications. %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %I ACM %C Austin, TX %V No. 5 %8 2015-11 %G eng %0 Book Section %B Numerical Computations with GPUs %D 2014 %T Accelerating Numerical Dense Linear Algebra Calculations with GPUs %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %B Numerical Computations with GPUs %I Springer International Publishing %P 3-28 %@ 978-3-319-06547-2 %G eng %& 1 %R 10.1007/978-3-319-06548-9_1 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2014 %T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K factorization %K parallel linear algebra %K plasma %K recursion %K shared memory synchronization %K threaded parallelism %X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. %B Concurrency and Computation: Practice and Experience %V 26 %P 1408-1431 %8 2014-05 %G eng %U http://doi.wiley.com/10.1002/cpe.3110 %N 7 %! Concurrency Computat.: Pract. Exper. %& 1408 %R 10.1002/cpe.3110 %0 Conference Paper %B International Workshop on OpenCL %D 2014 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B International Workshop on OpenCL %C Bristol University, England %8 2014-05 %G eng %0 Conference Paper %B Workshop on Large-Scale Parallel Processing, IPDPS 2014 %D 2014 %T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K dataflow %K message-passing %K multithreading %K QR decomposition %K runtime %K systolic array %X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems. %B Workshop on Large-Scale Parallel Processing, IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B VECPAR 2014 %D 2014 %T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K Computer science %K factorization %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2014 %T Looking Back at Dense Linear Algebra Software %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %K decompositional approach %K dense linear algebra %K parallel algorithms %X Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure. %B Journal of Parallel and Distributed Computing %V 74 %P 2548–2560 %8 2014-07 %G eng %N 7 %& 2548 %R 10.1016/j.jpdc.2013.10.005 %0 Conference Paper %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A James Harris %A Stanimire Tomov %A Jack Dongarra %X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation. %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 2014-08 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2014 %T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems %A Jack Dongarra %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %K dense linear algebra %K hardware accelerators %K task superscalar scheduling %X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware. %B Supercomputing Frontiers and Innovations %V 1 %G eng %N 1 %R http://dx.doi.org/10.14529/jsfi1401 %0 Conference Paper %B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper) %D 2014 %T New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem %A Azzam Haidar %A Piotr Luszczek %A Jack Dongarra %X We describe a design and implementation of a multi-stage algorithm for computing eigenvectors of a dense symmetric matrix. We show that reformulating the existing algorithms is beneficial in terms of performance even if that doubles the computational complexity. Through detailed analysis, we show that the effect of the increase in the asymptotic operation count may be compensated by a much improved performance rate. Our performance results indicate that using our approach achieves very good speedup and scalability even when directly compared with the existing state-of-the-art software. %B Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper) %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %R 10.1109/IPDPSW.2014.130 %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Optimizing Krylov Subspace Solvers on Graphics Processing Units %A Stanimire Tomov %A Piotr Luszczek %A Ichitaro Yamazaki %A Jack Dongarra %A Hartwig Anzt %A William Sawyer %X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %D 2014 %T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors %A Azzam Haidar %A Chongxiao Cao %A Ichitaro Yamazaki %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly. %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %R 10.1109/ScalA.2014.8 %0 Generic %D 2014 %T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson. %B University of Tennessee EECS Technical Report %I University of Tennessee %8 2014-11 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment %A Azzam Haidar %A Chongxiao Cao %A Jack Dongarra %A Piotr Luszczek %A Stanimire Tomov %K algorithms %K Computer science %K CUDA %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %K nVidia %K Tesla K20 %K Tesla M2090 %X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Journal Article %J The Computer Journal %D 2013 %T BlackjackBench: Portable Hardware Characterization with Automated Results Analysis %A Anthony Danalis %A Piotr Luszczek %A Gabriel Marin %A Jeffrey Vetter %A Jack Dongarra %K hardware characterization %K micro-benchmarks %K statistical analysis %X DARPA's AACE project aimed to develop Architecture Aware Compiler Environments. Such a compiler automatically characterizes the targeted hardware and optimizes the application codes accordingly. We present the BlackjackBench suite, a collection of portable micro-benchmarks that automate system characterization, plus statistical analysis techniques for interpreting the results. The BlackjackBench benchmarks discover the effective sizes and speeds of the hardware environment rather than the often unattainable peak values. We aim at hardware characteristics that can be observed by running executables generated by existing compilers from standard C codes. We characterize the memory hierarchy, including cache sharing and non-uniform memory access characteristics of the system, properties of the processing cores affecting the instruction execution speed and the length of the operating system scheduler time slot. We show how these features of modern multicores can be discovered programmatically. We also show how the features could potentially interfere with each other resulting in incorrect interpretation of the results, and how established classification and statistical analysis techniques can reduce experimental noise and aid automatic interpretation of results. We show how effective hardware metrics from our probes allow guided tuning of computational kernels that outperform an autotuning library further tuned by the hardware vendor. %B The Computer Journal %8 2013-03 %G eng %R 10.1093/comjnl/bxt057 %0 Generic %D 2013 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B University of Tennessee Technical Report (Lawn 275) %I University of Tennessee %8 2013-03 %G eng %0 Conference Proceedings %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2013 %T CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience %A Yulu Jia %A Piotr Luszczek %A George Bosilca %A Jack Dongarra %X Soft errors pose a real challenge to applications running on modern hardware as the feature size becomes smaller and the integration density increases for both the modern processors and the memory chips. Soft errors manifest themselves as bit-flips that alter the user value, and numerical software is a category of software that is sensitive to such data changes. In this paper, we present a design of a bidiagonal reduction algorithm that is resilient to soft errors, and we also describe its implementation on hybrid CPU-GPU architectures. Our fault-tolerant algorithm employs Algorithm Based Fault Tolerance, combined with reverse computation, to detect, locate, and correct soft errors. The tests were performed on a Sandy Bridge CPU coupled with an NVIDIA Kepler GPU. The included experiments show that our resilient bidiagonal reduction algorithm adds very little overhead compared to the error-prone code. At matrix size 10110 x 10110, our algorithm only has a performance overhead of 1.085% when one error occurs, and 0.354% when no errors occur. %B ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C Montpellier, France %8 2013-11 %G eng %0 Journal Article %J Scalable Computing and Communications: Theory and Practice %D 2013 %T Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Piotr Luszczek %A Jack Dongarra %E Samee Khan %E Lin-Wang Wang %E Albert Zomaya %B Scalable Computing and Communications: Theory and Practice %I John Wiley & Sons %P 699-735 %8 2013-03 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2013 %T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K algorithms %K bidiagional reduction %K bulge chasing %K data translation layer %K dynamic scheduling %K high performance kernels %K performance %K tile algorithms %K two-stage approach %X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2. %B ACM Transactions on Mathematical Software (TOMS) %V 39 %G eng %N 3 %R 10.1145/2450153.2450154 %0 Book Section %B Contemporary High Performance Computing: From Petascale Toward Exascale %D 2013 %T HPC Challenge: Design, History, and Implementation Highlights %A Jack Dongarra %A Piotr Luszczek %K exascale %K hpc challenge %K hpcc %B Contemporary High Performance Computing: From Petascale Toward Exascale %I Taylor and Francis %C Boca Raton, FL %@ 978-1-4665-6834-1 %G eng %& 2 %0 Generic %D 2013 %T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC %A Guillaume Aupy %A Mathieu Faverge %A Yves Robert %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures %B Lawn 277 %8 2013-05 %G eng %0 Generic %D 2013 %T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware %A Azzam Haidar %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %K lapack %K plasma %K scalapack %B University of Tennessee Computer Science Technical Report (also LAWN 283) %I University of Tennessee %8 2013-10 %G eng %0 Conference Paper %B Supercomputing 2013 %D 2013 %T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware %A Azzam Haidar %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %B Supercomputing 2013 %C Denver, CO %8 2013-11 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Computing %D 2013 %T LU Factorization with Partial Pivoting for a Multicore System with Accelerators %A Jakub Kurzak %A Piotr Luszczek %A Jack Dongarra %K accelerator %K Gaussian elimination %K gpu %K lu factorization %K manycore %K Multicore %K partial pivoting %K plasma %X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs. %B IEEE Transactions on Parallel and Distributed Computing %V 24 %P 1613-1621 %8 2013-08 %G eng %N 8 %& 1613 %R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242 %0 Journal Article %J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %D 2013 %T Multithreading in the PLASMA Library %A Jakub Kurzak %A Piotr Luszczek %A Asim YarKhan %A Mathieu Faverge %A Julien Langou %A Henricus Bouwmeester %A Jack Dongarra %E Mohamed Ahmed %E Reda Ammar %E Sanguthevar Rajasekaran %K plasma %B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications %I Taylor & Francis %8 2013-00 %G eng %0 Conference Paper %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %D 2013 %T Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance %A Yulu Jia %A George Bosilca %A Piotr Luszczek %A Jack Dongarra %X This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases. %B International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013 %C Denver, CO %8 2013-11 %G eng %0 Conference Paper %B PPAM 2013 %D 2013 %T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %K magma %K mic %K xeon phi %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B PPAM 2013 %C Warsaw, Poland %8 2013-09 %G eng %0 Book Section %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %D 2013 %T Scalable Dense Linear Algebra on Heterogeneous Hardware %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores. %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %G eng %0 Journal Article %J Journal of Computational Science %D 2013 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K gpgpu %K gpu %K magma %X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs. %B Journal of Computational Science %V 4 %P 457–464 %8 2013-11 %G eng %N 6 %R http://dx.doi.org/10.1016/j.jocs.2013.01.004 %0 Generic %D 2013 %T Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures %A Yulu Jia %A Piotr Luszczek %A Jack Dongarra %X Graphics Processing Units (GPUs) are gaining wide spread usage in the field of scientific computing owing to the performance boost GPUs bring to computation intensive applications. The typical configuration is to integrate GPUs and CPUs in the same system where the CPUs handle the control flow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm. %B UT-CS-13-712 %I University of Tennessee Computer Science Technical Report %8 2013-06 %G eng %0 Conference Paper %B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013) %D 2013 %T Virtual Systolic Array for QR Decomposition %A Jakub Kurzak %A Piotr Luszczek %A Mark Gates %A Ichitaro Yamazaki %A Jack Dongarra %K dataflow programming %K message passing %K multi-core %K QR decomposition %K roofline model %K systolic array %X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions. %B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013) %I IEEE %C Boston, MA %8 2013-05 %G eng %R 10.1109/IPDPS.2013.119 %0 Generic %D 2012 %T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties %A Simplice Donfack %A Jack Dongarra %A Mathieu Faverge %A Mark Gates %A Jakub Kurzak %A Piotr Luszczek %A Ichitaro Yamazaki %X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Conference Proceedings %B 2012 IEEE High Performance Extreme Computing Conference %D 2012 %T Anatomy of a Globally Recursive Embedded LINPACK Benchmark %A Piotr Luszczek %A Jack Dongarra %X We present a complete bottom-up implementation of an embedded LINPACK benchmark on iPad 2. We use a novel formulation of a recursive LU factorization that is recursive and parallel at the global scope. We be believe our new algorithm presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches. We show a assembly API that allows us a much higher level of abstraction and provides rapid code development within the confines of mobile device SDK. We use performance modeling to help with the limitation of the device and the limited access to device from the development environment not geared for HPC application tuning. %B 2012 IEEE High Performance Extreme Computing Conference %C Waltham, MA %P 1-6 %8 2012-09 %@ 978-1-4673-1577-7 %G eng %R 10.1109/HPEC.2012.6408679 %0 Journal Article %J IPDPS 2012 %D 2012 %T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction %A Azzam Haidar %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %B IPDPS 2012 %C Shanghai, China %8 2012-05 %G eng %0 Journal Article %J High Performance Scientific Computing: Algorithms and Applications %D 2012 %T Dense Linear Algebra on Accelerated Multicore Hardware %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %E Michael Berry %E et al., %B High Performance Scientific Computing: Algorithms and Applications %I Springer-Verlag %C London, UK %8 2012-00 %G eng %0 Conference Proceedings %B The 2nd International Conference on Cloud and Green Computing (submitted) %D 2012 %T Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture %A Jack Dongarra %A Hatem Ltaeif %A Piotr Luszczek %A Vincent M Weaver %B The 2nd International Conference on Cloud and Green Computing (submitted) %C Xiangtan, Hunan, China %8 2012-11 %G eng %0 Journal Article %J Lecture Notes in Computer Science %D 2012 %T Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %B Lecture Notes in Computer Science %V 7203 %P 661-670 %8 2012-09 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming %A Peng Du %A Rick Weber %A Piotr Luszczek %A Stanimire Tomov %A Gregory D. Peterson %A Jack Dongarra %B Parallel Computing %V 38 %P 391-407 %8 2012-08 %G eng %0 Journal Article %J EuroPar 2012 (also LAWN 260) %D 2012 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %B EuroPar 2012 (also LAWN 260) %C Rhodes Island, Greece %8 2012-08 %G eng %0 Journal Article %J ICCS 2012 %D 2012 %T High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors %A Peng Du %A Piotr Luszczek %A Jack Dongarra %B ICCS 2012 %C Omaha, NE %8 2012-06 %G eng %0 Journal Article %J On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear) %D 2012 %T HPC Challenge: Design, History, and Implementation Highlights %A Jack Dongarra %A Piotr Luszczek %E Jeffrey Vetter %B On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear) %I Chapman & Hall/CRC Press %8 2012-00 %G eng %0 Journal Article %J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear) %D 2012 %T Looking Back at Dense Linear Algebra Software %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %E Viktor K. Prasanna %E Yves Robert %E Per Stenström %B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear) %8 2012-00 %G eng %0 Generic %D 2012 %T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors %A Jack Dongarra %A Mark Gates %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12) %C Salt Lake City, UT %8 2012-11 %G eng %0 Conference Proceedings %B International Workshop on Power-Aware Systems and Architectures %D 2012 %T Measuring Energy and Power with PAPI %A Vincent M Weaver %A Matt Johnson %A Kiran Kasichayanula %A James Ralph %A Piotr Luszczek %A Dan Terpstra %A Shirley Moore %K papi %X Energy and power consumption are becoming critical metrics in the design and usage of high performance systems. We have extended the Performance API (PAPI) analysis library to measure and report energy and power values. These values are reported using the existing PAPI API, allowing code previously instrumented for performance counters to also measure power and energy. Higher level tools that build on PAPI will automatically gain support for power and energy readings when used with the newest version of PAPI. We describe in detail the types of energy and power readings available through PAPI. We support external power meters, as well as values provided internally by recent CPUs and GPUs. Measurements are provided directly to the instrumented process, allowing immediate code analysis in real time. We provide examples showing results that can be obtained with our infrastructure. %B International Workshop on Power-Aware Systems and Architectures %C Pittsburgh, PA %8 2012-09 %G eng %R 10.1109/ICPPW.2012.39 %0 Journal Article %J SAAHPC '12 (Best Paper Award) %D 2012 %T Power Aware Computing on GPUs %A Kiran Kasichayanula %A Dan Terpstra %A Piotr Luszczek %A Stanimire Tomov %A Shirley Moore %A Gregory D. Peterson %K magma %B SAAHPC '12 (Best Paper Award) %C Argonne, IL %8 2012-07 %G eng %0 Journal Article %J LAWN 267 %D 2012 %T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B LAWN 267 %8 2012-00 %G eng %0 Conference Proceedings %B Proceedings of VECPAR’12 %D 2012 %T Programming the LU Factorization for a Multicore System with Accelerators %A Jakub Kurzak %A Piotr Luszczek %A Mathieu Faverge %A Jack Dongarra %K plasma %K quark %B Proceedings of VECPAR’12 %C Kobe, Japan %8 2012-04 %G eng %0 Generic %D 2011 %T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K plasma %K quark %B University of Tennessee Computer Science Technical Report (also as a LAWN) %8 2011-09 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results %A Anthony Danalis %A Piotr Luszczek %A Gabriel Marin %A Jeffrey Vetter %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Journal Article %J in Solving the Schrodinger Equation: Has everything been tried? (to appear) %D 2011 %T Changes in Dense Linear Algebra Kernels - Decades Long Perspective %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %E P. Popular %B in Solving the Schrodinger Equation: Has everything been tried? (to appear) %I Imperial College Press %8 2011-00 %G eng %0 Conference Proceedings %B 6th Workshop on Virtualization in High-Performance Cloud Computing %D 2011 %T Evaluation of the HPC Challenge Benchmarks in Virtualized Environments %A Piotr Luszczek %A Eric Meek %A Shirley Moore %A Dan Terpstra %A Vincent M Weaver %A Jack Dongarra %K hpcc %B 6th Workshop on Virtualization in High-Performance Cloud Computing %C Bordeaux, France %8 2011-08 %G eng %0 Conference Proceedings %B Proceedings of PARCO'11 %D 2011 %T Exploiting Fine-Grain Parallelism in Recursive LU Factorization %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %K plasma %B Proceedings of PARCO'11 %C Gent, Belgium %8 2011-04 %G eng %0 Conference Proceedings %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %D 2011 %T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops) %I IEEE %C Anchorage, Alaska, USA %P 1432-1441 %8 2011-05 %G eng %0 Generic %D 2011 %T GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement %A Hartwig Anzt %A Piotr Luszczek %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260) %8 2011-12 %G eng %0 Generic %D 2011 %T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247) %8 2011-05 %G eng %0 Journal Article %J IEEE Cluster 2011 %D 2011 %T High Performance Dense Linear System Solver with Soft Error Resilience %A Peng Du %A Piotr Luszczek %A Jack Dongarra %K ft-la %B IEEE Cluster 2011 %C Austin, TX %8 2011-09 %G eng %0 Conference Proceedings %B Proceedings of MTAGS11 %D 2011 %T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Piotr Luszczek %B Proceedings of MTAGS11 %C Seattle, WA %8 2011-11 %G eng %0 Conference Proceedings %B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011) %D 2011 %T Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency %A Hatem Ltaeif %A Piotr Luszczek %A Jack Dongarra %K mumi %B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011) %C Hamburg, Germany %8 2011-09 %G eng %0 Generic %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-07 %G eng %0 Journal Article %J UT-CS-11-675 (also LAPACK Working Note #252) %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B UT-CS-11-675 (also LAPACK Working Note #252) %8 2011-07 %G eng %0 Journal Article %J Journal of Computational Science %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B Journal of Computational Science %I Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11 %C Seattle, WA %8 2011-11 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures %A Piotr Luszczek %A Hatem Ltaeif %A Jack Dongarra %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Generic %D 2010 %T Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess %A Piotr Luszczek %A Jack Dongarra %K hpcc %B Innovative Computing Laboratory (ICL) Technical Report %8 2010-06 %G eng %0 Generic %D 2010 %T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K dplasma %K parsec %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-10-660 %8 2010-09 %G eng %0 Generic %D 2010 %T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Mathieu Faverge %A Azzam Haidar %A Thomas Herault %A Jakub Kurzak %A Julien Langou %A Pierre Lemariner %A Hatem Ltaeif %A Piotr Luszczek %A Asim YarKhan %A Jack Dongarra %K dague %K plasma %B Innovative Computing Laboratory Technical Report %8 2010-00 %G eng %0 Conference Proceedings %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %D 2010 %T Mixed-Tool Performance Analysis on Hybrid Multicore Architectures %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %C San Diego, CA %8 2010-09 %G eng %0 Conference Proceedings %B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10) %D 2010 %T OpenCL Evaluation for Numerical Linear Algebra Library Development %A Peng Du %A Piotr Luszczek %A Jack Dongarra %K magma %B Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10) %C Knoxville, TN %8 2010-07 %G eng %0 Generic %D 2010 %T Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling %A Jack Dongarra %A Piotr Luszczek %K hpcc %B University of Tennessee Computer Science Technical Report %8 2010-10 %G eng %0 Journal Article %J PGI Insider %D 2010 %T Using MAGMA with PGI Fortran %A Stanimire Tomov %A Mathieu Faverge %A Piotr Luszczek %A Jack Dongarra %K magma %B PGI Insider %8 2010-11 %G eng %0 Journal Article %J Computer Physics Communications %D 2009 %T Accelerating Scientific Computations with Mixed Precision Algorithms %A Marc Baboulin %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. %B Computer Physics Communications %V 180 %P 2526-2533 %8 2009-12 %G eng %N 12 %R https://doi.org/10.1016/j.cpc.2008.11.005 %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 2009-11 %G eng %0 Conference Proceedings %B Journal of Physics: Conference Series %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Stanimire Tomov %K magma %K plasma %B Journal of Physics: Conference Series %V 180 %8 2009-00 %G eng %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2009 %T Parallel Programming in MATLAB %A Piotr Luszczek %K lfc %K plasma %B The International Journal of High Performance Computing Applications %V 23 %P 277-283 %8 2009-07 %G eng %0 Journal Article %J in Advances in Computers %D 2008 %T DARPA's HPCS Program: History, Models, Tools, Languages %A Jack Dongarra %A Robert Graybill %A William Harrod %A Robert Lucas %A Ewing Lusk %A Piotr Luszczek %A Janice McMahon %A Allan Snavely %A Jeffrey Vetter %A Katherine Yelick %A Sadaf Alam %A Roy Campbell %A Laura Carrington %A Tzu-Yi Chen %A Omid Khalili %A Jeremy Meredith %A Mustafa Tikir %E M. Zelkowitz %B in Advances in Computers %I Elsevier %V 72 %8 2008-01 %G eng %0 Journal Article %J in High Performance Computing and Grids in Action %D 2008 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B in High Performance Computing and Grids in Action %I IOS Press %C Amsterdam %8 2008-01 %G eng %0 Journal Article %J in Beautiful Code Leading Programmers Explain How They Think (Chapter 14) %D 2008 %T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination %A Jack Dongarra %A Piotr Luszczek %E Andy Oram %E G. Wilson %B in Beautiful Code Leading Programmers Explain How They Think (Chapter 14) %P 243-282 %8 2008-01 %G eng %0 Generic %D 2008 %T HPCS Library Study Effort %A Jack Dongarra %A James Demmel %A Parry Husbands %A Piotr Luszczek %B University of Tennessee Computer Science Technical Report, UT-CS-08-617 %8 2008-01 %G eng %0 Journal Article %J Concurrency: Practice and Experience %D 2008 %T The LINPACK Benchmark: Past, Present, and Future %A Jack Dongarra %A Piotr Luszczek %A Antoine Petitet %K hpl %B Concurrency: Practice and Experience %V 15 %P 803-820 %8 2008-00 %G eng %0 Journal Article %J Computing in Science and Engineering %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B Computing in Science and Engineering %P 80-83 %8 2008-01 %G eng %0 Generic %D 2008 %T The PlayStation 3 for High Performance Scientific Computing %A Jakub Kurzak %A Alfredo Buttari %A Piotr Luszczek %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2008-01 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2008 %T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %K plasma %B ACM Transactions on Mathematical Software %V 34 %P 17-22 %8 2008-00 %G eng %0 Journal Article %J In High Performance Computing and Grids in Action (to appear) %D 2007 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B In High Performance Computing and Grids in Action (to appear) %I IOS Press %C Amsterdam %8 2007-00 %G eng %0 Journal Article %J International Journal for High Performance Computer Applications %D 2007 %T High Performance Development for High End Computing with Python Language Wrapper (PLW) %A Jack Dongarra %A Piotr Luszczek %B International Journal for High Performance Computer Applications %V 21 %P 360-369 %8 2007-00 %G eng %0 Journal Article %J in Beautiful Code Leading Programmers Explain How They Think %D 2007 %T How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination %A Jack Dongarra %A Piotr Luszczek %E Andy Oram %E Greg Wilson %B in Beautiful Code Leading Programmers Explain How They Think %I O'Reilly Media, Inc. %8 2007-06 %G eng %0 Journal Article %J International Journal of High Performance Computer Applications (to appear) %D 2007 %T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems %A Alfredo Buttari %A Jack Dongarra %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Jakub Kurzak %B International Journal of High Performance Computer Applications (to appear) %8 2007-08 %G eng %0 Generic %D 2007 %T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3 %A Alfredo Buttari %A Piotr Luszczek %A Jakub Kurzak %A Jack Dongarra %A George Bosilca %K multi-core %B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595 %8 2007-00 %G eng %0 Journal Article %J University of Tennessee Computer Science Tech Report %D 2006 %T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Jack Dongarra %K iter-ref %B University of Tennessee Computer Science Tech Report %8 2006-04 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications (to appear) %D 2006 %T High Performance Development for High End Computing with Python Language Wrapper (PLW) %A Piotr Luszczek %K hpcc %K lfc %B International Journal of High Performance Computing Applications (to appear) %8 2006-00 %G eng %0 Conference Proceedings %B SC06 Conference Tutorial %D 2006 %T The HPC Challenge (HPCC) Benchmark Suite %A Piotr Luszczek %A David Bailey %A Jack Dongarra %A Jeremy Kepner %A Robert Lucas %A Rolf Rabenseifner %A Daisuke Takahashi %K hpcc %K hpcchallenge %B SC06 Conference Tutorial %I IEEE %C Tampa, Florida %8 2006-11 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T The Impact of Multicore on Math Software %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %K plasma %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries %A James Demmel %A Jack Dongarra %A B. Parlett %A William Kahan %A Ming Gu %A David Bindel %A Yozo Hida %A Xiaoye Li %A Osni Marques %A Jason E. Riedy %A Christof Voemel %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Julien Langou %A Stanimire Tomov %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Journal Article %J IBM Journal of Research and Development %D 2006 %T Self Adapting Numerical Software SANS Effort %A George Bosilca %A Zizhong Chen %A Jack Dongarra %A Victor Eijkhout %A Graham Fagg %A Erika Fuentes %A Julien Langou %A Piotr Luszczek %A Jelena Pjesivac–Grbovic %A Keith Seymour %A Haihang You %A Sathish Vadhiyar %K gco %B IBM Journal of Research and Development %V 50 %P 223-238 %8 2006-01 %G eng %0 Journal Article %J SC|05 Tutorial - S13 %D 2005 %T HPC Challenge v1.x Benchmark Suite %A Piotr Luszczek %A David Koester %K hpcc %B SC|05 Tutorial - S13 %C Seattle, Washington %8 2005-01 %G eng %0 Journal Article %D 2005 %T Introduction to the HPC Challenge Benchmark Suite %A Piotr Luszczek %A Jack Dongarra %A David Koester %A Rolf Rabenseifner %A Bob Lucas %A Jeremy Kepner %A John McCalpin %A David Bailey %A Daisuke Takahashi %K hpcc %K hpcchallenge %8 2005-03 %G eng %0 Generic %D 2005 %T Introduction to the HPCChallenge Benchmark Suite %A Jack Dongarra %A Piotr Luszczek %K hpcc %K hpcchallenge %B ICL Technical Report %8 2005-01 %G eng %0 Journal Article %J Oak Ridge National Laboratory Report %D 2004 %T Cray X1 Evaluation Status Report %A Pratul Agarwal %A R. A. Alexander %A E. Apra %A Satish Balay %A Arthur S. Bland %A James Colgan %A Eduardo D'Azevedo %A Jack Dongarra %A Tom Dunigan %A Mark Fahey %A Al Geist %A M. Gordon %A Robert Harrison %A Dinesh Kaushik %A M. Krishnakumar %A Piotr Luszczek %A Tony Mezzacapa %A Jeff Nichols %A Jarek Nieplocha %A Leonid Oliker %A T. Packwood %A M. Pindzola %A Thomas C. Schulthess %A Jeffrey Vetter %A James B White %A T. Windus %A Patrick H. Worley %A Thomas Zacharia %B Oak Ridge National Laboratory Report %V /-2004/13 %8 2004-01 %G eng %0 Conference Proceedings %B International Conference on Computational Science %D 2004 %T Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations %A Piotr Luszczek %A Jack Dongarra %E Marian Bubak %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K lacsi %K lfc %B International Conference on Computational Science %I Springer Verlag %C Poland %8 2004-06 %G eng %R 10.1007/978-3-540-25944-2_35 %0 Conference Proceedings %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %D 2004 %T LAPACK for Clusters Project: An Example of Self Adapting Numerical Software %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %B Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04') %C Big Island, Hawaii %V 9 %P 90282 %8 2004-01 %G eng %0 Journal Article %J Parallel Computing %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %K lfc %K sans %B Parallel Computing %V 29 %P 1723-1743 %8 2003-11 %G eng %0 Generic %D 2003 %T Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160) %A Zizhong Chen %A Jack Dongarra %A Piotr Luszczek %A Kenneth Roche %K lacsi %B University of Tennessee Computer Science Technical Report, UT-CS-03-499 %8 2003-01 %G eng %0 Journal Article %J Scientific Programming %D 2001 %T Recursive Approach in Sparse Matrix LU Factorization %A Jack Dongarra %A Victor Eijkhout %A Piotr Luszczek %B Scientific Programming %V 9 %P 51-60 %8 2001-00 %G eng %0 Conference Proceedings %B Proceedings of 1st SGI Users Conference %D 2000 %T Recursive approach in sparse matrix LU factorization %A Jack Dongarra %A Victor Eijkhout %A Piotr Luszczek %B Proceedings of 1st SGI Users Conference %C Cracow, Poland (ACC Cyfronet UMM, 2000) %P 409-418 %8 2000-01 %G eng