%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2020
%T A Set of Batched Basic Linear Algebra Subprograms
%A Ahmad Abdelfattah
%A Timothy Costa
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Sven Hammarling
%A Nicholas J. Higham
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Mawussi Zounon
%X This paper describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular half precision is used in many very large scale applications, such as those associated with machine learning.
%B ACM Transactions on Mathematical Software
%8 2020-10
%G eng

%0 Generic
%D 2020
%T SLATE: Software for Linear Algebra Targeting Exascale (POSTER)
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%I 2020 Exascale Computing Project Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T SLATE Tutorial
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Ali Charara
%A Jamie Finney
%A Dalal Sukkari
%A Mohammed Al Farhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Jack Dongarra
%I 2020 ECP Annual Meeting
%C Houston, TX
%8 2020-02
%G eng

%0 Generic
%D 2020
%T SLATE Users' Guide
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Mohammed Al Farhan
%A Dalal Sukkari
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2020-07
%G eng
%9 SLATE Working Notes

%0 Generic
%D 2019
%T An Empirical View of SLATE Algorithms on Scalable Hybrid System
%A Asim YarKhan
%A Jakub Kurzak
%A Ahmad Abdelfattah
%A Jack Dongarra
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee, Knoxville
%8 2019-09
%G eng

%0 Conference Proceedings
%B ACM International Conference on Supercomputing (ICS '19)
%D 2019
%T Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%Y Rudolf Eigenmann
%Y Chen Ding
%Y Sally A. McKee
%B ACM International Conference on Supercomputing (ICS '19)
%I ACM
%C Phoenix, Arizona
%P 117–126
%8 2019-06
%@ 9781450360791
%G eng
%R https://dl.acm.org/doi/abs/10.1145/3330345.3330356

%0 Conference Paper
%B 48th International Conference on Parallel Processing (ICPP 2019)
%D 2019
%T Massively Parallel Automated Software Tuning
%A Jakub Kurzak
%A Yaohung Tsai
%A Mark Gates
%A Ahmad Abdelfattah
%A Jack Dongarra
%X This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.
%B 48th International Conference on Parallel Processing (ICPP 2019)
%I ACM Press
%C Kyoto, Japan
%8 2019-08
%G eng
%R https://doi.org/10.1145/3337821.3337908

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2019
%T PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Sven Hammarling
%A Jakub Sistek
%B ACM Transactions on Mathematical Software
%V 45
%8 2019-06
%G eng
%N 2
%R https://doi.org/10.1145/3264491

%0 Generic
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%I International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%C Denver, CO
%8 2019-11
%G eng

%0 Conference Paper
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%D 2019
%T SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
%A Mark Gates
%A Jakub Kurzak
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%X The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities for current and upcoming distributed high-performance systems, both accelerated CPU-GPU based and CPU based. SLATE will provide coverage of existing ScaLAPACK functionality, including the parallel BLAS; linear systems using LU and Cholesky; least squares problems using QR; and eigenvalue and singular value problems. In this respect, it will serve as a replacement for ScaLAPACK, which after two decades of operation, cannot adequately be retrofitted for modern accelerated architectures. SLATE uses modern techniques such as communication-avoiding algorithms, lookahead panels to overlap communication and computation, and task-based scheduling, along with a modern C++ framework. Here we present the design of SLATE and initial reports of several of its components.
%B International Conference for High Performance Computing, Networking, Storage and Analysis (SC19)
%I ACM
%C Denver, CO
%8 2019-11
%G eng
%R https://doi.org/10.1145/3295500.3356223

%0 Generic
%D 2019
%T SLATE Developers' Guide
%A Ali Charara
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-12
%G eng
%9 SLATE Working Notes

%0 Generic
%D 2019
%T SLATE Mixed Precision Performance Report
%A Ali Charara
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Asim YarKhan
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2019-04
%G eng

%0 Generic
%D 2019
%T SLATE Working Note 12: Implementing Matrix Inversions
%A Jakub Kurzak
%A Mark Gates
%A Ali Charara
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-06
%G eng

%0 Generic
%D 2019
%T SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers
%A Mark Gates
%A Mohammed Al Farhan
%A Ali Charara
%A Jakub Kurzak
%A Dalal Sukkari
%A Asim YarKhan
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2019-09
%G eng
%9 SLATE Working Notes

%0 Journal Article
%J Proceedings of the IEEE
%D 2018
%T Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yaohung Tsai
%K Dense numerical linear algebra
%K performance autotuning
%X Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.
%B Proceedings of the IEEE
%V 106
%P 2040–2055
%8 2018-11
%G eng
%N 11
%R 10.1109/JPROC.2018.2868961

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2018
%T Autotuning Techniques for Performance-Portable Point Set Registration in 3D
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Vasileios Maroulas
%A Jack Dongarra
%X We present an autotuning approach applied to exhaustive performance engineering of the EM-ICP algorithm for the point set registration problem with a known reference. We were able to achieve progressively higher performance levels through a variety of code transformations and an automated procedure of generating a large number of implementation variants. Furthermore, we managed to exploit code patterns that are not common when only attempting manual optimization but which yielded in our tests better performance for the chosen registration algorithm. Finally, we also show how we maintained high levels of the performance rate in a portable fashion across a wide range of hardware platforms including multicore, manycore coprocessors, and accelerators. Each of these hardware classes is much different from the others and, consequently, cannot reliably be mastered by a single developer in a short time required to deliver a close-to-optimal implementation. We assert in our concluding remarks that our methodology as well as the presented tools provide a valid automation system for software optimization tasks on modern HPC hardware.
%B Supercomputing Frontiers and Innovations
%V 5
%8 2018-12
%G eng
%& 42
%R 10.14529/jsfi180404

%0 Generic
%D 2018
%T Implementation of the C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 07

%0 Generic
%D 2018
%T Least Squares Performance Report
%A Mark Gates
%A Ali Charara
%A Jakub Kurzak
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-12
%G eng
%9 SLATE Working Notes
%1 09

%0 Generic
%D 2018
%T Linear Systems Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Ali Charara
%A Asim YarKhan
%A Jamie Finney
%A Gerald Ragghianti
%A Piotr Luszczek
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-09
%G eng
%9 SLATE Working Notes
%1 08

%0 Generic
%D 2018
%T Parallel BLAS Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Panruo Wu
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I University of Tennessee
%8 2018-04
%G eng
%1 05

%0 Generic
%D 2018
%T Parallel Norms Performance Report
%A Jakub Kurzak
%A Mark Gates
%A Asim YarKhan
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Jamie Finney
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2018-06
%G eng
%1 06

%0 Journal Article
%J SIAM Review
%D 2018
%T The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%K bidiagonal matrix
%K bisection
%K Divide and conquer
%K Hestenes method
%K Jacobi method
%K Kogbetliantz method
%K MRRR
%K QR iteration
%K Singular value decomposition
%K SVD
%X The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
%B SIAM Review
%V 60
%P 808–865
%8 2018-11
%G eng
%U https://epubs.siam.org/doi/10.1137/17M1117732
%N 4
%! SIAM Rev.
%R 10.1137/17M1117732

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2018
%T Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Panruo Wu
%A Mawussi Zounon
%A Jack Dongarra
%K linear algebra
%K multithreading
%K runtime
%K symmetric indefinite matrices
%X Recently, the Open Multi-Processing (OpenMP) standard has incorporated task-based programming, where a function call with input and output data is treated as a task. At run time, OpenMP's superscalar scheduler tracks the data dependencies among the tasks and executes the tasks as their dependencies are resolved. On a shared-memory architecture with multiple cores, the independent tasks are executed on different cores in parallel, thereby enabling parallel execution of a seemingly sequential code. With the emergence of many-core architectures, this type of programming paradigm is gaining attention-not only because of its simplicity, but also because it breaks the artificial synchronization points of the program and improves its thread-level parallelization. In this paper, we use these new OpenMP features to develop a portable high-performance implementation of a dense symmetric indefinite linear solver. Obtaining high performance from this kind of solver is a challenge because the symmetric pivoting, which is required to maintain numerical stability, leads to data dependencies that prevent us from using some common performance-improving techniques. To fully utilize a large number of cores through tasking, while conforming to the OpenMP standard, we describe several techniques. Our performance results on current many-core architectures-including Intel's Broadwell, Intel's Knights Landing, IBM's Power8, and Arm's ARMv8-demonstrate the portable and superior performance of our implementation compared with the Linear Algebra PACKage (LAPACK). The resulting solver is now available as a part of the PLASMA software package.
%B IEEE Transactions on Parallel and Distributed Systems
%V 29
%P 1879–1892
%8 2018-08
%G eng
%N 8
%R 10.1109/TPDS.2018.2808964

%0 Journal Article
%J International Journal of Computational Science and Engineering (IJCSE)
%D 2018
%T Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP
%A Joseph Dorris
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X The increasing number of computational cores in modern many-core processors, as represented by the Intel Xeon Phi architectures, has created the need for an open-source, high performance and scalable task-based dense linear algebra package that can efficiently use this type of many-core hardware. In this paper, we examined the design modifications necessary when porting PLASMA, a task-based dense linear algebra library, run effectively on two generations of Intel's Xeon Phi architecture, known as knights corner (KNC) and knights landing (KNL). First, we modified PLASMA's tiled Cholesky decomposition to use OpenMP tasks for its scheduling mechanism to enable Xeon Phi compatibility. We then compared the performance of our modified code to that of the original dynamic scheduler running on an Intel Xeon Sandy Bridge CPU. Finally, we looked at the performance of the OpenMP tiled Cholesky decomposition on knights corner and knights landing processors. We detail the optimisations required to obtain performance on these platforms and compare with the highly tuned Intel MKL math library.
%B International Journal of Computational Science and Engineering (IJCSE)
%V 17
%8 2018-10
%G eng
%R http://dx.doi.org/10.1504/IJCSE.2018.095851

%0 Conference Paper
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%D 2017
%T Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Yu Pei
%A Jack Dongarra
%K batch computation
%K Cholesky Factorization
%K data layout
%K GPU computing
%K numerical linear algebra
%X Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
%B Parallel and Distributed Processing Symposium Workshops (IPDPSW)
%I IEEE
%C Orlando, FL
%8 2017-06
%G eng
%R 10.1109/IPDPSW.2017.18

%0 Book Section
%B Handbook of Big Data Technologies
%D 2017
%T Bringing High Performance Computing to Big Data Algorithms
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Handbook of Big Data Technologies
%I Springer
%@ 978-3-319-49339-8
%G eng
%R 10.1007/978-3-319-49340-4

%0 Generic
%D 2017
%T C++ API for Batch BLAS
%A Ahmad Abdelfattah
%A Konstantin Arturov
%A Cris Cecka
%A Jack Dongarra
%A Chip Freitag
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Panruo Wu
%B SLATE Working Notes
%I University of Tennessee
%8 2017-12
%G eng
%1 04

%0 Generic
%D 2017
%T C++ API for BLAS and LAPACK
%A Mark Gates
%A Piotr Luszczek
%A Ahmad Abdelfattah
%A Jakub Kurzak
%A Jack Dongarra
%A Konstantin Arturov
%A Cris Cecka
%A Chip Freitag
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%1 02

%0 Generic
%D 2017
%T The Case for Directive Programming for Accelerator Autotuner Optimization
%A Diana Fayad
%A Jakub Kurzak
%A Piotr Luszczek
%A Panruo Wu
%A Jack Dongarra
%X In this work, we present the use of compiler pragma directives for parallelizing autotuning of specialized compute kernels for hardware accelerators. A set of constructs, that include prallelizing a source code that prune a generated search space with a large number of constraints for an autotunning infrastructure. For a better performance we studied optimization aimed at minimization of the run time.We also studied the behavior of the parallel load balance and the speedup on four different machines: x86, Xeon Phi, ARMv8, and POWER8.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-10
%G eng

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2017
%T Design and Implementation of the PULSAR Programming System for Large Scale Computing
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%A Yves Robert
%A Jack Dongarra
%X The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.
%B Supercomputing Frontiers and Innovations
%V 4
%G eng
%U http://superfri.org/superfri/article/view/121/210
%N 1
%R 10.14529/jsfi170101

%0 Generic
%D 2017
%T Designing SLATE: Software for Linear Algebra Targeting Exascale
%A Jakub Kurzak
%A Panruo Wu
%A Mark Gates
%A Ichitaro Yamazaki
%A Piotr Luszczek
%A Gerald Ragghianti
%A Jack Dongarra
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-10
%G eng
%9 SLATE Working Notes
%1 03

%0 Generic
%D 2017
%T MAGMA-sparse Interface Design Whitepaper
%A Hartwig Anzt
%A Erik Boman
%A Jack Dongarra
%A Goran Flegar
%A Mark Gates
%A Mike Heroux
%A Mark Hoemmen
%A Jakub Kurzak
%A Piotr Luszczek
%A Sivasankaran Rajamanickam
%A Stanimire Tomov
%A Stephen Wood
%A Ichitaro Yamazaki
%X In this report we describe the logic and interface we develop for the MAGMA-sparse library  to allow for easy integration as third-party library into a top-level software ecosystem. The  design choices are based on extensive consultation with other software library developers, in  particular the Trilinos software development team. The interface documentation is at this point  not exhaustive, but a first proposal for setting a standard. Although the interface description  targets the MAGMA-sparse software module, we hope that the design choices carry beyond this  specific library, and are attractive for adoption in other packages.  This report is not intended as static document, but will be updated over time to reflect the agile  software development in the ECP 1.3.3.11 STMS11-PEEKS project.
%B Innovative Computing Laboratory Technical Report
%8 2017-09
%G eng
%9 Technical Report

%0 Generic
%D 2017
%T PLASMA 17 Performance Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Generic
%D 2017
%T PLASMA 17.1 Functionality Report
%A Maksims Abalenkovs
%A Negin Bagherpour
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Samuel Relton
%A Jakub Sistek
%A David Stevens
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%A Mawussi Zounon
%X PLASMA (Parallel Linear Algebra for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing. PLASMA is designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state of the art solutions in parallel algorithms, scheduling, and software engineering. PLASMA currently offers a collection of routines for solving linear systems of equations and least square problems.
%B Innovative Computing Laboratory Technical Report
%I University of Tennessee
%8 2017-06
%G eng

%0 Generic
%D 2017
%T Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Aurelien Bouteiller
%A Anthony Danalis
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Stephen Wood
%A Panruo Wu
%A Ichitaro Yamazaki
%A Asim YarKhan
%B SLATE Working Notes
%I Innovative Computing Laboratory, University of Tennessee
%8 2017-06
%G eng
%9 SLATE Working Notes
%1 01

%0 Conference Paper
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%D 2017
%T Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A David Keffer
%A Jack Dongarra
%X In this article, we present an autotuning approach applied to systematic performance engineering of the EM-ICP (Expectation-Maximization Iterative Closest Point) algorithm for the point set registration problem. We show how we were able to exceed the performance achieved by the reference code through multiple dependence transformations and automated procedure of generating and evaluating numerous implementation variants. Furthermore, we also managed to exploit code transformations that are not that common during manual optimization but yielded better performance in our tests for the EM-ICP algorithm. Finally, we maintained high levels of performance rate in a portable fashion across a wide range of HPC hardware platforms including multicore, many-core, and GPU-based accelerators. More importantly, the results indicate consistently high performance level and ability to move the task of data analysis through point-set registration to any modern compute platform without the concern of inferior asymptotic efficiency.
%B IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017)
%I IEEE
%C Boston, MA
%8 2017-12
%G eng
%R https://doi.org/10.1109/BigData.2017.8258258

%0 Conference Paper
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%D 2017
%T Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic
%A Piotr Luszczek
%A Jakub Kurzak
%A Ichitaro Yamazaki
%A Jack Dongarra
%X With NVIDA Tegra Jetson X1 and Pascal P100 GPUs, NVIDIA introduced hardware-based computation on FP16 numbers also called half-precision arithmetic. In this talk, we will introduce the steps required to build a viable benchmark for this new arithmetic format. This will include the connections to established IEEE floating point standards and existing HPC benchmarks. The discussion will focus on performance and numerical stability issues that are important for this kind of benchmarking and how they relate to NVIDIA platforms.
%B 2017 IEEE High Performance Extreme Computing Conference (HPEC)
%I IEEE
%C Waltham, MA
%8 2017-09
%G eng
%R https://doi.org/10.1109/HPEC.2017.8091031

%0 Journal Article
%J Computing in Science & Engineering
%D 2017
%T With Extreme Computing, the Rules Have Changed
%A Jack Dongarra
%A Stanimire Tomov
%A Piotr Luszczek
%A Jakub Kurzak
%A Mark Gates
%A Ichitaro Yamazaki
%A Hartwig Anzt
%A Azzam Haidar
%A Ahmad Abdelfattah
%X On the eve of exascale computing, traditional wisdom no longer applies. High-performance computing is gone as we know it. This article discusses a range of new algorithmic techniques emerging in the context of exascale computing, many of which defy the common wisdom of high-performance computing and are considered unorthodox, but could turn out to be a necessity in near future.
%B Computing in Science & Engineering
%V 19
%P 52-62
%8 2017-05
%G eng
%N 3
%R https://doi.org/10.1109/MCSE.2017.48

%0 Journal Article
%J Acta Numerica
%D 2016
%T Linear Algebra Software for Large-Scale Accelerated Multicore Computing
%A Ahmad Abdelfattah
%A Hartwig Anzt
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A undefined
%A Asim YarKhan
%X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.
%B Acta Numerica
%V 25
%P 1-160
%8 2016-05
%G eng
%R 10.1017/S0962492916000015

%0 Journal Article
%J International Journal of Parallel Programming
%D 2016
%T Porting the PLASMA Numerical Library to the OpenMP Standard
%A Asim YarKhan
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.
%B International Journal of Parallel Programming
%8 2016-06
%G eng
%U http://link.springer.com/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6http://link.springer.com/content/pdf/10.1007/s10766-016-0441-6.pdfhttp://link.springer.com/article/10.1007/s10766-016-0441-6/fulltext.html
%! Int J Parallel Prog
%R 10.1007/s10766-016-0441-6

%0 Conference Paper
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2016
%T Search Space Generation and Pruning System for Autotuners
%A Piotr Luszczek
%A Mark Gates
%A Jakub Kurzak
%A Anthony Danalis
%A Jack Dongarra
%X This work tackles two simultaneous challenges faced by autotuners: the ease of describing a complex, multidimensional search space, and the speed of evaluating that space, while applying a multitude of pruning constraints. This article presents a declarative notation for describing a search space and a translation system for conversion to a standard C code for fast and multithreaded, as necessary, evaluation. The notation is Python-based and thus simple in syntax and easy to assimilate by the user interested in tuning rather than learning a new programming language. A large number of dimensions and a large number of pruning constraints may be expressed with little effort. The system is discussed in the context of autotuning the canonical matrix multiplication kernel for NVIDIA GPUs, where the search space has 15 dimensions and involves application of 10 complex pruning constrains. The speed of evaluation is compared against generators created using imperative programming style in various scripting and compiled languages.
%B 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Chicago, IL
%8 2016-05
%G eng

%0 Conference Paper
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%D 2015
%T Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs
%A Mark Gates
%A Hartwig Anzt
%A Jakub Kurzak
%A Jack Dongarra
%X In this paper we accelerate the Alternating Least Squares (ALS) algorithm used for generating product recommendations on the basis of implicit feedback datasets. We approach the algorithm with concepts proven to be successful in High Performance Computing. This includes the formulation of the algorithm as a mix of cache-optimized algorithm-specific kernels and standard BLAS routines, acceleration via graphics processing units (GPUs), use of parallel batched kernels, and autotuning to identify performance winners. For benchmark datasets, the multi-threaded CPU implementation we propose achieves more than a 10 times speedup over the implementations available in the GraphLab and Spark MLlib software packages. For the GPU implementation, the parameters of an algorithm-specific kernel were optimized using a comprehensive autotuning sweep. This results in an additional 2 times speedup over our CPU implementation.
%B 2015 IEEE International Conference on Big Data (IEEE BigData 2015)
%I IEEE
%C Santa Clara, CA
%8 2015-11
%G eng

%0 Conference Paper
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%D 2015
%T A Data Flow Divide and Conquer Algorithm for Multicore Architecture
%A Azzam Haidar
%A Jakub Kurzak
%A Gregoire Pichon
%A Mathieu Faverge
%K Eigensolver
%K lapack
%K Multicore
%K plasma
%K task-based programming
%X Computing eigenpairs of a symmetric matrix is a problem arising in many industrial applications, including quantum physics and finite-elements computation for automobiles. A classical approach is to reduce the matrix to tridiagonal form before computing eigenpairs of the tridiagonal matrix. Then, a back-transformation allows one to obtain the final solution. Parallelism issues of the reduction stage have already been tackled in different shared-memory libraries. In this article, we focus on solving the tridiagonal eigenproblem, and we describe a novel implementation of the Divide and Conquer algorithm. The algorithm is expressed as a sequential task-flow, scheduled in an out-of-order fashion by a dynamic runtime which allows the programmer to play with tasks granularity. The resulting implementation is between two and five times faster than the equivalent routine from the Intel MKL library, and outperforms the best MRRR implementation for many matrices.
%B 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS)
%I IEEE
%C Hyderabad, India
%8 2015-05
%G eng

%0 Journal Article
%J Concurrency in Computation: Practice and Experience
%D 2015
%T Experiences in autotuning matrix multiplication for energy minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B Concurrency in Computation: Practice and Experience
%V 27
%P 5096-5113
%8 2015-12
%G eng
%N 17
%R 10.1002/cpe.3516

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs
%A Hartwig Anzt
%A Blake Haugen
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K Autotuning
%K energy efficiency
%K hardware accelerators
%K matrix multiplication
%K power
%X In this paper, we report extensive results and analysis of autotuning the computationally intensive graphics processing units kernel for dense matrix–matrix multiplication in double precision. In contrast to traditional autotuning and/or optimization for runtime performance only, we also take the energy efficiency into account. For kernels achieving equal performance, we show significant differences in their energy balance. We also identify the memory throughput as the most influential metric that trades off performance and energy efficiency. As a result, the performance optimal case ends up not being the most efficient kernel in overall resource use.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 5096 - 5113
%8 12-Oct
%G eng
%U http://doi.wiley.com/10.1002/cpe.3516https://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2Fcpe.3516
%N 17
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.3516

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2015
%T Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs
%A Jakub Kurzak
%A Hartwig Anzt
%A Mark Gates
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%8 2015-11
%G eng

%0 Conference Paper
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%D 2015
%T Mixed-precision Block Gram Schmidt Orthogonalization
%A Ichitaro Yamazaki
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%A Jesse Barlow
%X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant  impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors.
%B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B 2015 SIAM Conference on Applied Linear Algebra
%D 2015
%T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs
%A Ichitaro Yamazaki
%A Jesse Barlow
%A Stanimire Tomov
%A Jakub Kurzak
%A Jack Dongarra
%X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches.
%B 2015 SIAM Conference on Applied Linear Algebra
%I SIAM
%C Atlanta, GA
%8 2015-10
%G eng

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2015
%T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
%A Maksims Abalenkovs
%A Ahmad Abdelfattah
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%A Asim YarKhan
%K dense linear algebra
%K gpu
%K HPC
%K Multicore
%K plasma
%K Programming models
%K runtime
%X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.
%B Supercomputing Frontiers and Innovations
%V 2
%8 2015-10
%G eng
%R 10.14529/jsfi1504

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
%A Theo Mary
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Conference Paper
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%D 2015
%T Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2015
%T A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%K Gaussian elimination
%K lu factorization
%K Multicore
%K parallel
%K plasma
%K shared memory
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B Concurrency and Computation: Practice and Experience
%V 27
%P 1292-1309
%8 2015-04
%G eng
%N 5
%R 10.1002/cpe.3306

%0 Conference Paper
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%D 2015
%T Visualizing Execution Traces with Task Dependencies
%A Blake Haugen
%A Stephen Richmond
%A Jakub Kurzak
%A Chad A. Steed
%A Jack Dongarra
%X Task-based scheduling has emerged as one method to reduce the complexity of parallel computing. When using task-based schedulers, developers must frame their computation as a series of tasks with various data dependencies. The scheduler can take these tasks, along with their input and output dependencies, and schedule the task in parallel across a node or cluster. While these schedulers simplify the process of parallel software development, they can obfuscate the performance characteristics of the execution of an algorithm. The execution trace has been used for many years to give developers a visual representation of how their computations are performed. These methods can be employed to visualize when and where each of the tasks in a task-based algorithm is scheduled. In addition, the task dependencies can be used to create a directed acyclic graph (DAG) that can also be visualized to demonstrate the dependencies of the various tasks that make up a workload. The work presented here aims to combine these two data sets and extend execution trace visualization to better suit task-based workloads. This paper presents a brief description of task-based schedulers and the performance data they produce. It will then describe an interactive extension to the current trace visualization methods that combines the trace and DAG data sets. This new tool allows users to gain a greater understanding of how their tasks are scheduled. It also provides a simplified way for developers to evaluate and debug the performance of their scheduler.
%B 2nd Workshop on Visual Performance Analysis (VPA '15)
%I ACM
%C Austin, TX
%8 2015-11
%G eng

%0 Book Section
%B Numerical Computations with GPUs
%D 2014
%T Accelerating Numerical Dense Linear Algebra Calculations with GPUs
%A Jack Dongarra
%A Mark Gates
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Ichitaro Yamazaki
%B Numerical Computations with GPUs
%I Springer International Publishing
%P 3-28
%@ 978-3-319-06547-2
%G eng
%& 1
%R 10.1007/978-3-319-06548-9_1

%0 Conference Paper
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%D 2014
%T Access-averse Framework for Computing Low-rank Matrix Approximations
%A Ichitaro Yamazaki
%A Theo Mary
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining
%C Washington, DC
%8 2014-10
%G eng

%0 Conference Paper
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%D 2014
%T Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime
%A Ichitaro Yamazaki
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K dataflow
%K message-passing
%K multithreading
%K QR decomposition
%K runtime
%K systolic array
%X A systolic array provides an alternative computing paradigm to the von Neuman architecture. Though its hardware implementation has failed as a paradigm to design integrated circuits in the past, we are now discovering that the systolic array as a software virtualization layer can lead to an extremely scalable execution paradigm. To demonstrate this scalability, in this paper, we design and implement a 3D virtual systolic array to compute a tile QR decomposition of a tall-and-skinny dense matrix. Our implementation is based on a state-of-the-art algorithm that factorizes a panel based on a tree-reduction. Using a runtime developed as a part of the Parallel Ultra Light Systolic Array Runtime (PULSAR) project, we demonstrate on a Cray-XT5 machine how our virtual systolic array can be mapped to a large-scale machine and obtain excellent parallel performance. This is an important contribution since such a QR decomposition is used, for example, to compute a least squares solution of an overdetermined system, which arises in many scientific and engineering problems.
%B Workshop on Large-Scale Parallel Processing, IPDPS 2014
%I IEEE
%C Phoenix, AZ
%8 2014-05
%G eng

%0 Journal Article
%J Journal of Parallel and Distributed Computing
%D 2014
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K decompositional approach
%K dense linear algebra
%K parallel algorithms
%X Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure.
%B Journal of Parallel and Distributed Computing
%V 74
%P 2548–2560
%8 2014-07
%G eng
%N 7
%& 2548
%R 10.1016/j.jpdc.2013.10.005

%0 Journal Article
%J Supercomputing Frontiers and Innovations
%D 2014
%T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems
%A Jack Dongarra
%A Azzam Haidar
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Asim YarKhan
%K dense linear algebra
%K hardware accelerators
%K task superscalar scheduling
%X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale.  In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design.  Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs).  Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns.  This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads.  In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems.  Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution.  This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed.  Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles.  We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware.
%B Supercomputing Frontiers and Innovations
%V 1
%G eng
%N 1
%R http://dx.doi.org/10.14529/jsfi1401

%0 Generic
%D 2014
%T PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X PULSAR version 2.0, released in November 2014, is a complete programming platform for large-scale distributed memory systems with multicore processors and hardware accelerators. PULSAR provides a simple abstraction layer over multithreading, message passing, and multi-GPU, multi-stream programming. PULSAR offers a general-purpose programming model, suitable for a wide range of scientific and engineering applications. PULSAR was inspired by systolic arrays, popularized by Hsiang-Tsung Kung and Charles E. Leiserson.
%B University of Tennessee EECS Technical Report
%I University of Tennessee
%8 2014-11
%G eng

%0 Conference Paper
%B VISSOFT'14: 2nd IEEE Working Conference on Software Visualization
%D 2014
%T Search Space Pruning Constraints Visualization
%A Blake Haugen
%A Jakub Kurzak
%X The field of software optimization, among others, is interested in finding an optimal solution in a large search space. These search spaces are often large, complex, non-linear and even non-continuous at times. The size of the search space makes a brute force solution intractable. As a result, one or more search space pruning constraints are often used to reduce the number of candidate configurations that must be evaluated in order to solve the optimization problem.    If more than one pruning constraint is employed, it can be challenging to understand how the pruning constraints interact and overlap. This work presents a visualization technique based on a radial, space-filling technique that allows the user to gain a better understanding of how the pruning constraints remove candidates from the search space. The technique is then demonstrated using a search space pruning data set derived from the optimization of a matrix multiplication code for NVIDIA CUDA accelerators.
%B VISSOFT'14: 2nd IEEE Working Conference on Software Visualization
%I IEEE
%C Victoria, BC, Canada
%8 2014-09
%G eng

%0 Generic
%D 2013
%T Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC
%A Guillaume Aupy
%A Mathieu Faverge
%A Yves Robert
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%X This article introduces a new systolic algorithm for QR factorization, and its implementation on a supercomputing cluster of multicore nodes. The algorithm targets a virtual 3D-array and requires only local communications. The implementation of the algorithm uses threads at the node level, and MPI for inter-node communications. The complexity of the implementation is addressed with the PaRSEC software, which takes as input a parametrized dependence graph, which is derived from the algorithm, and only requires the user to decide, at the high-level, the allocation of tasks to nodes. We show that the new algorithm exhibits competitive performance with state-of-the-art QR routines on a supercomputer called Kraken, which shows that high-level programming environments, such as PaRSEC, provide a viable alternative to enhance the production of quality software on complex and hierarchical architectures
%B Lawn 277
%8 2013-05
%G eng

%0 Conference Paper
%B Supercomputing 2013
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%B Supercomputing 2013
%C Denver, CO
%8 2013-11
%G eng

%0 Generic
%D 2013
%T An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
%A Azzam Haidar
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%K lapack
%K plasma
%K scalapack
%B University of Tennessee Computer Science Technical Report (also LAWN 283)
%I University of Tennessee
%8 2013-10
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Computing
%D 2013
%T LU Factorization with Partial Pivoting for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Jack Dongarra
%K accelerator
%K Gaussian elimination
%K gpu
%K lu factorization
%K manycore
%K Multicore
%K partial pivoting
%K plasma
%X LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memory-bound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
%B IEEE Transactions on Parallel and Distributed Computing
%V 24
%P 1613-1621
%8 2013-08
%G eng
%N 8
%& 1613
%R http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242

%0 Journal Article
%J Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%D 2013
%T Multithreading in the PLASMA Library
%A Jakub Kurzak
%A Piotr Luszczek
%A Asim YarKhan
%A Mathieu Faverge
%A Julien Langou
%A Henricus Bouwmeester
%A Jack Dongarra
%E Mohamed Ahmed
%E Reda Ammar
%E Sanguthevar Rajasekaran
%K plasma
%B Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications
%I Taylor & Francis
%8 2013-00
%G eng

%0 Book Section
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%D 2013
%T Scalable Dense Linear Algebra on Heterogeneous Hardware
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Thomas Herault
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.
%B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing
%G eng

%0 Conference Paper
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%D 2013
%T Virtual Systolic Array for QR Decomposition
%A Jakub Kurzak
%A Piotr Luszczek
%A Mark Gates
%A Ichitaro Yamazaki
%A Jack Dongarra
%K dataflow programming
%K message passing
%K multi-core
%K QR decomposition
%K roofline model
%K systolic array
%X Systolic arrays offer a very attractive, data-centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions.
%B 15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)
%I IEEE
%C Boston, MA
%8 2013-05
%G eng
%R 10.1109/IPDPS.2013.119

%0 Generic
%D 2012
%T On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties
%A Simplice Donfack
%A Jack Dongarra
%A Mathieu Faverge
%A Mark Gates
%A Jakub Kurzak
%A Piotr Luszczek
%A Ichitaro Yamazaki
%X Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of the Gaussian elimination. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multi-socket multicore systems are presented. Performance and numerical accuracy is analyzed.
%B University of Tennessee Computer Science Technical Report
%8 2013-07
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2012
%T Autotuning GEMM Kernels for the Fermi GPU
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%X Abstract—In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.
%B IEEE Transactions on Parallel and Distributed Systems
%V 23
%8 2012-11
%G eng
%R https://doi.org/10.1109/TPDS.2011.311

%0 Journal Article
%J High Performance Scientific Computing: Algorithms and Applications
%D 2012
%T Dense Linear Algebra on Accelerated Multicore Hardware
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%E Michael Berry
%E et al.,
%B High Performance Scientific Computing: Algorithms and Applications
%I Springer-Verlag
%C London, UK
%8 2012-00
%G eng

%0 Journal Article
%J Applied Parallel and Scientific Computing
%D 2012
%T An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs
%A Jakub Kurzak
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%E Kristján Jónasson
%B Applied Parallel and Scientific Computing
%V 7133
%P 248-257
%8 2012-00
%G eng

%0 Journal Article
%J Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%D 2012
%T Looking Back at Dense Linear Algebra Software
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E Viktor K. Prasanna
%E Yves Robert
%E Per Stenström
%B Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear)
%8 2012-00
%G eng

%0 Journal Article
%J LAWN 267
%D 2012
%T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%A Jack Dongarra
%B LAWN 267
%8 2012-00
%G eng

%0 Conference Proceedings
%B Proceedings of VECPAR’12
%D 2012
%T Programming the LU Factorization for a Multicore System with Accelerators
%A Jakub Kurzak
%A Piotr Luszczek
%A Mathieu Faverge
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of VECPAR’12
%C Kobe, Japan
%8 2012-04
%G eng

%0 Generic
%D 2011
%T Autotuning GEMMs for Fermi
%A Jakub Kurzak
%A Stanimire Tomov
%A Jack Dongarra
%K magma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245)
%8 2011-04
%G eng

%0 Journal Article
%J in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%D 2011
%T Changes in Dense Linear Algebra Kernels - Decades Long Perspective
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%E P. Popular
%B in Solving the Schrodinger Equation: Has everything been tried? (to appear)
%I Imperial College Press
%8 2011-00
%G eng

%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 2011-05
%G eng

%0 Generic
%D 2011
%T QUARK Users' Guide: QUeueing And Runtime for Kernels
%A Asim YarKhan
%A Jakub Kurzak
%A Jack Dongarra
%K magma
%K plasma
%K quark
%B University of Tennessee Innovative Computing Laboratory Technical Report
%8 2011-00
%G eng

%0 Journal Article
%J Parallel Computing (to appear)
%D 2010
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Parallel Computing (to appear)
%8 2010-00
%G eng

%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng

%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2010
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%P 417-423
%8 2010-04
%G eng

%0 Journal Article
%J Scientific Programming
%D 2010
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%B Scientific Programming
%V 17
%P 31-42
%8 2010-00
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2010
%T Scheduling Dense Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K gridpac
%K plasma
%B Concurrency and Computation: Practice and Experience
%V 22
%P 15-44
%8 2010-01
%G eng

%0 Journal Article
%J Journal of Scientific Computing
%D 2010
%T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Journal of Scientific Computing
%V 18
%P 33-50
%8 2010-00
%G eng

%0 Journal Article
%J Computer Physics Communications
%D 2009
%T Accelerating Scientific Computations with Mixed Precision Algorithms
%A Marc Baboulin
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julie Langou
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
%B Computer Physics Communications
%V 180
%P 2526-2533
%8 2009-12
%G eng
%N 12
%R https://doi.org/10.1016/j.cpc.2008.11.005

%0 Journal Article
%J Parallel Computing
%D 2009
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B Parallel Computing
%V 35
%P 38-53
%8 2009-00
%G eng

%0 Journal Article
%J PPAM 2009
%D 2009
%T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B PPAM 2009
%C Poland
%8 2009-09
%G eng

%0 Generic
%D 2009
%T Fully Dynamic Scheduler for Numerical Computing on Multicore Processors
%A Jakub Kurzak
%A Jack Dongarra
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-643 (Also LAPACK Working Note 220)
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Rajib Nath
%A Stanimire Tomov
%A Asim YarKhan
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, OR
%8 2009-11
%G eng

%0 Conference Proceedings
%B Journal of Physics: Conference Series
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K plasma
%B Journal of Physics: Conference Series
%V 180
%8 2009-00
%G eng

%0 Journal Article
%J Parallel Computing
%D 2009
%T Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor
%A Wesley Alvaro
%A Jakub Kurzak
%A Jack Dongarra
%B Parallel Computing
%V 35
%P 138-150
%8 2009-00
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems (to appear)
%D 2009
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems (to appear)
%8 2009-05
%G eng

%0 Journal Article
%J in Cyberinfrastructure Technologies and Applications
%D 2009
%T Parallel Dense Linear Algebra Software in the Multicore Era
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%E Junwei Cao
%K plasma
%B in Cyberinfrastructure Technologies and Applications
%I Nova Science Publishers, Inc.
%P 9-24
%8 2009-00
%G eng

%0 Journal Article
%J Scientific Programming (to appear)
%D 2009
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B Scientific Programming (to appear)
%8 2009-00
%G eng

%0 Journal Article
%J Concurrency Practice and Experience (to appear)
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Concurrency Practice and Experience (to appear)
%8 2009-00
%G eng

%0 Generic
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213)
%8 2009-00
%G eng

%0 Journal Article
%J in High Performance Computing and Grids in Action
%D 2008
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B in High Performance Computing and Grids in Action
%I IOS Press
%C Amsterdam
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor
%A Wesley Alvaro
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%8 2008-01
%G eng

%0 Generic
%D 2008
%T Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208)
%8 2008-08
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2008
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience
%V 20
%P 1573-1590
%8 2008-01
%G eng

%0 Journal Article
%J Computing in Science and Engineering
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B Computing in Science and Engineering
%P 80-83
%8 2008-01
%G eng

%0 Generic
%D 2008
%T The PlayStation 3 for High Performance Scientific Computing
%A Jakub Kurzak
%A Alfredo Buttari
%A Piotr Luszczek
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report
%8 2008-01
%G eng

%0 Generic
%D 2008
%T QR Factorization for the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-616 (also LAPACK Working Note 201)
%8 2008-05
%G eng

%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2008
%T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%V 19
%P 1-11
%8 2008-01
%G eng

%0 Journal Article
%J ACM Transactions on Mathematical Software
%D 2008
%T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B ACM Transactions on Mathematical Software
%V 34
%P 17-22
%8 2008-00
%G eng

%0 Generic
%D 2007
%T A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%8 2007-01
%G eng

%0 Journal Article
%J In High Performance Computing and Grids in Action (to appear)
%D 2007
%T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Stanimire Tomov
%E Lucio Grandinetti
%B In High Performance Computing and Grids in Action (to appear)
%I IOS Press
%C Amsterdam
%8 2007-00
%G eng

%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2007
%T Implementation of Mixed Precision in Solving Systems of Linear Equations on the Cell Processor
%A Jakub Kurzak
%A Jack Dongarra
%B Concurrency and Computation: Practice and Experience
%V 19
%P 1371-1385
%8 2007-07
%G eng

%0 Generic
%D 2007
%T Limitations of the Playstation 3 for High Performance Cluster Computing
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%B University of Tennessee Computer Science Technical Report, UT-CS-07-597 (Also LAPACK Working Note 185)
%8 2007-00
%G eng

%0 Journal Article
%J International Journal of High Performance Computer Applications (to appear)
%D 2007
%T Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
%A Alfredo Buttari
%A Jack Dongarra
%A Julien Langou
%A Julie Langou
%A Piotr Luszczek
%A Jakub Kurzak
%B International Journal of High Performance Computer Applications (to appear)
%8 2007-08
%G eng

%0 Conference Proceedings
%B Journal of Physics: Conference Series, SciDAC 2007
%D 2007
%T Multithreading for synchronization tolerance in matrix factorization
%A Alfredo Buttari
%A Jack Dongarra
%A Parry Husbands
%A Jakub Kurzak
%A Katherine Yelick
%B Journal of Physics: Conference Series, SciDAC 2007
%V 78
%8 2007-01
%G eng

%0 Generic
%D 2007
%T Parallel Tiled QR Factorization for Multicore Architectures
%A Alfredo Buttari
%A Julien Langou
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190)
%8 2007-00
%G eng

%0 Generic
%D 2007
%T SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3
%A Alfredo Buttari
%A Piotr Luszczek
%A Jakub Kurzak
%A Jack Dongarra
%A George Bosilca
%K multi-core
%B University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595
%8 2007-00
%G eng

%0 Generic
%D 2007
%T Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%K lapack
%B UT Computer Science Technical Report (Also LAPACK Working Note 184)
%8 2007-01
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report
%D 2006
%T Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
%A Julien Langou
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Jack Dongarra
%K iter-ref
%B University of Tennessee Computer Science Tech Report
%8 2006-04
%G eng

%0 Journal Article
%J PARA 2006
%D 2006
%T The Impact of Multicore on Math Software
%A Alfredo Buttari
%A Jack Dongarra
%A Jakub Kurzak
%A Julien Langou
%A Piotr Luszczek
%A Stanimire Tomov
%K plasma
%B PARA 2006
%C Umea, Sweden
%8 2006-06
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report
%D 2006
%T Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor
%A Jakub Kurzak
%A Jack Dongarra
%K iter-ref
%B University of Tennessee Computer Science Tech Report
%8 2006-09
%G eng

%0 Journal Article
%J University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178
%D 2006
%T Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead
%A Jakub Kurzak
%A Jack Dongarra
%B University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178
%8 2006-01
%G eng

%0 Journal Article
%J PARA 2006
%D 2006
%T Prospectus for the Next LAPACK and ScaLAPACK Libraries
%A James Demmel
%A Jack Dongarra
%A B. Parlett
%A William Kahan
%A Ming Gu
%A David Bindel
%A Yozo Hida
%A Xiaoye Li
%A Osni Marques
%A Jason E. Riedy
%A Christof Voemel
%A Julien Langou
%A Piotr Luszczek
%A Jakub Kurzak
%A Alfredo Buttari
%A Julien Langou
%A Stanimire Tomov
%B PARA 2006
%C Umea, Sweden
%8 2006-06
%G eng