Many scientific applications, ranging from national security to medical advances, require solving a number of relatively small-size independent problems. As the size of each individual problem does not provide sufficient parallelism for the underlying hardware, especially accelerators, these problems must be solved concurrently as a batch in order to saturate the hardware with enough work, hence the name batched computation. A possible simplification is to assume a uniform size for all problems. However, real applications do not necessarily satisfy such assumption. Consequently, an efficient solution for variable-size batched computations is required.

This paper proposes a foundation for high performance variable-size batched matrix computation based on Graphics Processing Units (GPUs). Being throughput-oriented processors, GPUs favor regular computation and less divergence among threads, in order to achieve high performance. Therefore, the development of high performance numerical software for this kind of problems is challenging. As a case study, we developed efficient batched Cholesky factorization algorithms for relatively small matrices of different sizes. However, most of the strategies and the software developed, and in particular a set of variable size batched BLAS kernels, can be used in many other dense matrix factorizations, large scale sparse direct multifrontal solvers, and applications. We propose new interfaces and mechanisms to handle the irregular computation pattern on the GPU. According to the authors’ knowledge, this is the first attempt to develop high performance software for this class of problems. Using a K40c GPU, our performance tests show speedups of up to 2:5 against two Sandy Bridge CPUs (8-core each) running Intel MKL library.

%B The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Conference Paper %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %D 2016 %T Heterogeneous Streaming %A Chris J. Newburn %A Gaurav Bansal %A Michael Wood %A Luis Crivelli %A Judit Planas %A Alejandro Duran %A Paulo Souza %A Leonardo Borges %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %A Hartwig Anzt %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Ichitaro Yamazaki %A Jesus Labarta %K plasma %X This paper introduces a new heterogeneous streaming library called hetero Streams (hStreams). We show how a simple FIFO streaming model can be applied to heterogeneous systems that include manycore coprocessors and multicore CPUs. This model supports concurrency across nodes, among tasks within a node, and between data transfers and computation. We give examples for different approaches, show how the implementation can be layered, analyze overheads among layers, and apply those models to parallelize applications using simple, intuitive interfaces. We compare the features and versatility of hStreams, OpenMP, CUDA Streams1 and OmpSs. We show how the use of hStreams makes it easier for scientists to identify tasks and easily expose concurrency among them, and how it enables tuning experts and runtime systems to tailor execution for different heterogeneous targets. Practical application examples are taken from the field of numerical linear algebra, commercial structural simulation software, and a seismic processing application. %B The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016 %I IEEE %C Chicago, IL %8 2016-05 %G eng %0 Generic %D 2016 %T High Performance Realtime Convex Solver for Embedded Systems %A Ichitaro Yamazaki %A Saeid Nooshabadi %A Stanimire Tomov %A Jack Dongarra %K KKT %K Realtime embedded convex optimization solver %X Convex optimization solvers for embedded systems find widespread use. This letter presents a novel technique to reduce the run-time of decomposition of KKT matrix for the convex optimization solver for an embedded system, by two orders of magnitude. We use the property that although the KKT matrix changes, some of its block sub-matrices are fixed during the solution iterations and the associated solving instances. %B University of Tennessee Computer Science Technical Report %8 2016-10 %G eng %0 Conference Paper %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %D 2016 %T High-performance Matrix-matrix Multiplications of Very Small Matrices %A Ian Masliah %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Joël Falcou %A Jack Dongarra %X The use of the general dense matrix-matrix multiplication (GEMM) is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices (of sizes less than 32) however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries. %B 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16) %I Springer International Publishing %C Grenoble, France %8 2016-08 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %K Applications %K Batched linear algebra %K FEM %K gpu %K Tensor contractions %K Tensor HPC %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Generic %D 2016 %T High-Performance Tensor Contractions for GPUs %A Ahmad Abdelfattah %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %X We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8× faster than CUBLAS, and 8.5× faster than MKL on 16 cores of Intel Xeon ES-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-01 %G eng %0 Journal Article %J Acta Numerica %D 2016 %T Linear Algebra Software for Large-Scale Accelerated Multicore Computing %A Ahmad Abdelfattah %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A undefined %A Asim YarKhan %X Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems. To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems. %B Acta Numerica %V 25 %P 1-160 %8 2016-05 %G eng %R 10.1017/S0962492916000015 %0 Conference Paper %B IEEE High Performance Extreme Computing Conference (HPEC'16) %D 2016 %T LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi %A Azzam Haidar %A Stanimire Tomov %A Konstantin Arturov %A Murat Guney %A Shane Story %A Jack Dongarra %X A wide variety of heterogeneous compute resources, ranging from multicore CPUs to GPUs and coprocessors, are available to modern computers, making it challenging to design unified numerical libraries that efficiently and productively use all these varied resources. For example, in order to efficiently use Intel’s Knights Langing (KNL) processor, the next-generation of Xeon Phi architectures, one must design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance. We propose a productive and portable programming model that allows us to write a serial-looking code, which, however, achieves parallelism and scalability by using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and the parallel execution. This is done through multiple techniques ranging from multi-level data partitioning to adaptive task grain sizes, and dynamic task scheduling. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. Finally, we outline the strengths and the effectiveness of this approach – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate current work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B IEEE High Performance Extreme Computing Conference (HPEC'16) %I IEEE %C Waltham, MA %8 2016-09 %G eng %0 Generic %D 2016 %T MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Ahmad Abdelfattah %A Jack Dongarra %X A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how batched GEMV and GEMM to be able to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. triangular solve) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3× speedups compared to CUBLAS and MKL solutions, wherever possible. We illustrated the batched methodology on a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5× speedup and a 1.4× greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes. %B Innovative Computing Laboratory Technical Report %I University of Tennessee %8 2016-08 %G eng %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2016 %T Non-GPU-resident Dense Symmetric Indefinite Factorization %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X We study various algorithms to factorize a symmetric indefinite matrix that does not fit in the core memory of a computer. There are two sources of the data movement into the memory: one needed for selecting and applying pivots and the other needed to update each column of the matrix for the factorization. It is a challenge to obtain high performance of such an algorithm when the pivoting is required to ensure the numerical stability of the factorization. For example, when factorizing each column of the matrix, a diagonal entry, which ensures the stability, may need to be selected as a pivot among the remaining diagonals, and moved to the leading diagonal by swapping both the corresponding rows and columns of the matrix. If the pivot is not in the core memory, then it must be loaded into the core memory. For updating the matrix, the data locality may be improved by partitioning the matrix. For example, a right-looking partitioned algorithm first factorizes the leading columns, called panel, and then uses the factorized panel to update the trailing submatrix. This algorithm only accesses the trailing submatrix after each panel factorization (instead of after each column factorization) and performs most of its floating-point operations (flops) using BLAS-3, which can take advantage of the memory hierarchy. However, because the pivots cannot be predetermined, the whole trailing submatrix must be updated before the next panel factorization can start. When the whole submatrix does not fit in the core memory all at once, loading the block columns into the memory can become the performance bottleneck. Similarly, the left-looking variant of the algorithm would require to update each panel with all of the previously factorized columns. This makes it a much greater challenge to implement an efficient out-of-core symmetric indefinite factorization compared with an out-of-core nonsymmetric LU factorization with partial pivoting, which only requires to swap the rows of the matrix and accesses the trailing submatrix after each in-core factorization (instead of after each panel factorization by the symmetric factorization). To reduce the amount of the data transfer, in this paper we uses the recently proposed left-looking communication-avoiding variant of the symmetric factorization algorithm to factorize the columns in the core memory, and then perform the partitioned right-looking out-of-core trailing submatrix updates. This combination may still require to load the pivots into the core memory, but it only updates the trailing submatrix after each in-core factorization, while the previous algorithm updates it after each panel factorization.Although these in-core and out-of-core algorithms can be applied at any level of the memory hierarchy, we apply our designs to the GPU and CPU memory, respectively. We call this specific implementation of the algorithm a non–GPU-resident implementation. Our performance results on the current hybrid CPU/GPU architecture demonstrate that when the matrix is much larger than the GPU memory, the proposed algorithm can obtain significant speedups over the communication-hiding implementations of the previous algorithms. %B Concurrency and Computation: Practice and Experience %8 2016-11 %G eng %R 10.1002/cpe.4012 %0 Conference Paper %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %D 2016 %T Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations %A Azzam Haidar %A Benjamin Brock %A Stanimire Tomov %A Michael Guidry %A Jay Jay Billings %A Daniel Shyles %A Jack Dongarra %X We demonstrate the systematic implementation of recently-developed fast explicit kinetic integration algorithms that solve efficiently N coupled ordinary differential equations (subject to initial conditions) on modern GPUs. We take representative test cases (Type Ia supernova explosions) and demonstrate two or more orders of magnitude increase in efficiency for solving such systems (of realistic thermonuclear networks coupled to fluid dynamics). This implies that important coupled, multiphysics problems in various scientific and technical disciplines that were intractable, or could be simulated only with highly schematic kinetic networks, are now computationally feasible. As examples of such applications we present the computational techniques developed for our ongoing deployment of these new methods on modern GPU accelerators. We show that similarly to many other scientific applications, ranging from national security to medical advances, the computation can be split into many independent computational tasks, each of relatively small-size. As the size of each individual task does not provide sufficient parallelism for the underlying hardware, especially for accelerators, these tasks must be computed concurrently as a single routine, that we call batched routine, in order to saturate the hardware with enough work. %B 2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) %I IEEE %C Waltham, MA %8 2016-09 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2016 %T On the performance and energy efficiency of sparse linear algebra on GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some performance and energy efficiency frontiers for sparse computations on GPU-based supercomputers. We compare the resource efficiency of different sparse matrix–vector products (SpMV) taken from libraries such as cuSPARSE and MAGMA for GPU and Intel’s MKL for multicore CPUs, and develop a GPU sparse matrix–matrix product (SpMM) implementation that handles the simultaneous multiplication of a sparse matrix with a set of vectors in block-wise fashion. While a typical sparse computation such as the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM succeeds in exceeding the memory-bound limitations of the SpMV. We integrate this kernel into a GPU-accelerated Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) eigensolver. LOBPCG is chosen as a benchmark algorithm for this study as it combines an interesting mix of sparse and dense linear algebra operations that is typical for complex simulation applications, and allows for hardware-aware optimizations. In a detailed analysis we compare the performance and energy efficiency against a multi-threaded CPU counterpart. The reported performance and energy efficiency results are indicative of sparse computations on supercomputers. %B International Journal of High Performance Computing Applications %8 2016-10 %G eng %U http://hpc.sagepub.com/content/early/2016/10/05/1094342016672081.abstract %R 10.1177/1094342016672081 %0 Book Section %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %E Julian M. Kunkel %E Pavan Balaji %E Jack Dongarra %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings %I Springer International Publishing %P 21–38 %@ 978-3-319-41321-1 %G eng %U http://dx.doi.org/10.1007/978-3-319-41321-1_2 %R 10.1007/978-3-319-41321-1_2 %0 Conference Paper %B The International Supercomputing Conference (ISC High Performance 2016) %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B The International Supercomputing Conference (ISC High Performance 2016) %C Frankfurt, Germany %8 2016-06 %G eng %0 Generic %D 2016 %T Performance, Design, and Autotuning of Batched GEMM for GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Autotuning %K Batched GEMM %K GEMM %K GPU computing %K HPC %X Abstract. The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra. It is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, there becomes a need to have a high performance GEMM kernel for a batch of small matrices. Such kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both xed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance test reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2016-02 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS'16) %D 2016 %T Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs %A Ahmad Abdelfattah %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K batched computation %K Cholesky Factorization %K GPUs %K Tuning %XSolving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.

This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

%B International Conference on Computational Science (ICCS'16) %C San Diego, CA %8 2016-06 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (TOMS) %D 2016 %T Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X To orthonormalize a set of dense vectors, Singular Value QR (SVQR) requires only one global reduction between the parallel processing units, and uses BLAS-3 kernels to perform most of its local computation. As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers. In this paper, we study the stability and performance of various SVQR implementations on multicore CPUs with a GPU, focusing on the dense triangular solve, which performs half of the total floating-point operations in SVQR. As a part of this study, we examine its adaptive mixed-precision variant that decides if a lower-precision arithmetic can be used for the triangular solution at runtime without increasing the order of its orthogonality error. Since the backward error of this adaptive mixed-precision variant is significantly greater than that of the standard SVQR, we study its effects on the solution convergence of several subspace projection methods for solving a linear system of equations and for computing singular values or eigenvalues of a sparse matrix. Our experimental results indicate that in some cases, the convergence rate of the solver may not be affected by the larger backward errors, while reducing the time to solution. %B ACM Transactions on Mathematical Software (TOMS) %V 43 %8 2016-10 %G eng %N 2 %0 Generic %D 2016 %T A Standard for Batched BLAS Routines %A Pedro Valero-Lara %A Jack Dongarra %A Azzam Haidar %A Samuel D. Relton %A Stanimire Tomov %A Mawussi Zounon %I 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16) %C Paris, France %8 2016-04 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD) %D 2016 %T Towards Achieving Performance Portability Using Directives for Accelerators %A M. Graham Lopez %A Larrea, V %A Joubert, W %A Hernandez, O %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer of- fload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86 64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86 64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD) %I Innovative Computing Laboratory, University of Tennessee %C Salt Lake City, Utah %8 2016-11 %G eng %0 Conference Paper %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %D 2015 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude. %B Spring Simulation Multi-Conference 2015 (SpringSim'15) %I SCS %C Alexandria, VA %8 2015-04 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Acceleration of GPU-based Krylov solvers via Data Transfer Reduction %A Hartwig Anzt %A William Sawyer %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B International Journal of High Performance Computing Applications %G eng %0 Conference Paper %B EuroMPI/Asia 2015 Workshop %D 2015 %T Batched Matrix Computations on Hardware Accelerators %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybridMAGMAfactorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU. %B EuroMPI/Asia 2015 Workshop %C Bordeaux, France %8 2015-09 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Batched Matrix Computations on Hardware Accelerators Based on GPUs %A Azzam Haidar %A Ahmad Abdelfattah %A Stanimire Tomov %A Jack Dongarra %X We will present techniques for small matrix computations on GPUs and their use for energy efficient, high-performance solvers. Work on small problems delivers high performance through improved data reuse. Many numerical libraries and applications need this functionality further developed. We describe the main factorizations LU, QR, and Cholesky for a set of small dense matrices in parallel. We achieve significant acceleration and reduced energy consumption against other solutions. Our techniques are of interest to GPU application developers in general. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J International Journal of High Performance Computing Applications %D 2015 %T Batched matrix computations on hardware accelerators based on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications’ context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5× speedup on the NVIDIA K40 GPU. %B International Journal of High Performance Computing Applications %8 2015-02 %G eng %R 10.1177/1094342014567546 %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %D 2015 %T Cholesky Across Accelerators %A Asim YarKhan %A Azzam Haidar %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B 17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015) %I IEEE %C Elizabeth, NJ %8 2015-08 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra %D 2015 %T Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra %A Mark Gates %A Stanimire Tomov %A Azzam Haidar %X Accelerating dense linear algebra using GPUs admits two models: hybrid CPU-GPU and GPU-only. The hybrid model factors the panel on the CPU while updating the trailing matrix on the GPU, concentrating the GPU on high-performance matrix multiplies. The GPU-only model performs the entire computation on the GPU, avoiding costly data transfers to the CPU. We compare these two approaches for three QR-based algorithms: QR factorization, rank revealing QR, and reduction to Hessenberg. %B 2015 SIAM Conference on Applied Linear Algebra %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Journal Article %J Scientific Programming %D 2015 %T Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X Low-rank matrices arise in many scientific and engineering computation. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently-developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%-50% using the GPU. %B Scientific Programming %G eng %0 Conference Paper %B ISC High Performance 2015 %D 2015 %T On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X The dramatic change in computer architecture due to the manycore paradigm shift, made the development of numerical routines that are optimal extremely challenging. In this work, we target the development of numerical algorithms and implementations for Xeon Phi coprocessor architecture designs. In particular, we examine and optimize the general and symmetric matrix-vector multiplication routines (gemv/symv), which are some of the most heavily used linear algebra kernels in many important engineering and physics applications. We describe a successful approach on how to address the challenges for this problem, starting from our algorithm design, performance analysis and programing model, to kernel optimization. Our goal, by targeting low-level, easy to understand fundamental kernels, is to develop new optimization strategies that can be effective elsewhere for the use on manycore coprocessors, and to show significant performance improvements compared to the existing state-of-the-art implementations. Therefore, in addition to the new optimization strategies, analysis, and optimal performance results, we finally present the significance of using these routines/strategies to accelerate higher-level numerical algorithms for the eigenvalue problem (EVP) and the singular value decomposition (SVD) that by themselves are foundational for many important applications. %B ISC High Performance 2015 %C Frankfurt, Germany %8 2015-07 %G eng %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %D 2015 %T Efficient Eigensolver Algorithms on Accelerator Based Architectures %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges -starting from our algorithm design, kernel optimization and tuning, to our programming model- in the development of a scalable high-performance symmetric eigenvalue and singular value solver. %B 2015 SIAM Conference on Applied Linear Algebra (SIAM LA) %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems %A Raffaele Solcà %A Anton Kozhevnikov %A Azzam Haidar %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %X We present a scalable implementation of the Linearized Augmented Plane Wave method for distributed memory systems, which relies on an efficient distributed, block-cyclic setup of the Hamiltonian and overlap matrices and allows us to turn around highly accurate 1000+ atom all-electron quantum materials simulations on clusters with a few hundred nodes. The implementation runs efficiently on standard multicore CPU nodes, as well as hybrid CPU-GPU nodes. The key for the latter is a novel algorithm to solve the generalized eigenvalue problem for dense, complex Hermitian matrices on distributed hybrid CPU-GPU systems. Performance tests for Li-intercalated CoO2 supercells containing 1501 atoms demonstrate that high-accuracy, transferable quantum simulations can now be used in throughput materials search problems. While our application can benefit and get scalable performance through CPU-only libraries like ScaLAPACK or ELPA2, our new hybrid solver enables the efficient use of GPUs and shows that a hybrid CPU-GPU architecture scales to a desired performance using substantially fewer cluster nodes, and notably, is considerably more energy efficient than the traditional multicore CPU only systems for such complex applications. %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Conference Paper %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %D 2015 %T Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6x performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5x faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers. %B Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15) %I ACM %C San Francisco, CA %8 2015-02 %@ 978-1-4503-3404-4 %G eng %R 10.1145/2712386.2712387 %0 Conference Paper %B 17th IEEE International Conference on High Performance Computing and Communications %D 2015 %T Flexible Linear Algebra Development and Scheduling with Cholesky Factorization %A Azzam Haidar %A Asim YarKhan %A Chongxiao Cao %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore-CPUs, GPUs, and coprocessors. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. On the other hand, a Xeon Phi coprocessor will work most effectively with degree of parallelism between GPUs and multicore-CPUs. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization. %B 17th IEEE International Conference on High Performance Computing and Communications %C Newark, NJ %8 2015-08 %G eng %0 Conference Paper %B ISC High Performance %D 2015 %T Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations %A Azzam Haidar %A Tingxing Dong %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %B ISC High Performance %I Springer %C Frankfurt, Germany %8 2015-07 %G eng %0 Journal Article %J Scientific Programming %D 2015 %T HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Azzam Haidar %A Jack Dongarra %A Khairul Kabir %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %A Yulu Jia %K communication and computation overlap %K dynamic runtime scheduling using dataflow dependences %K hardware accelerators and coprocessors %K Intel Xeon Phi processor %K Many Integrated Cores %K numerical linear algebra %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B Scientific Programming %V 23 %8 2015-01 %G eng %N 1 %R 10.3233/SPR-140404 %0 Generic %D 2015 %T Linear Algebra Software for High-Performance Computing (Part 2: Software for Hardware Accelerators and Coprocessors) %A Stanimire Tomov %I ISC High Performance (ISC18), Tutorial Presentation %C Frankfurt, Germany %8 2015-06 %G eng %0 Conference Paper %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %D 2015 %T MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing %A Azzam Haidar %A Stanimire Tomov %A Piotr Luszczek %A Jack Dongarra %X Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper, we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring 192 CUDA cores. The implementations presented will form the core of a MAGMA Embedded library, to be released as part of the MAGMA libraries. %B 2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award) %I IEEE %C Waltham, MA %8 2015-09 %G eng %0 Generic %D 2015 %T MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi %A Hartwig Anzt %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %I ISC High Performance (ISC15), Intel Booth Presentation %C Frankfurt, Germany %8 2015-06 %G eng %0 Conference Paper %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2015 %T Mixed-precision Block Gram Schmidt Orthogonalization %A Ichitaro Yamazaki %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %A Jesse Barlow %X The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors. %B 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %I ACM %C Austin, TX %8 2015-11 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2015 %T Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %X To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5x more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4x slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance. %B SIAM Journal on Scientific Computing %V 37 %P C203-C330 %8 2015-05 %G eng %R DOI:10.1137/14M0973773 %0 Conference Paper %B 2015 SIAM Conference on Applied Linear Algebra %D 2015 %T Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs %A Ichitaro Yamazaki %A Jesse Barlow %A Stanimire Tomov %A Jakub Kurzak %A Jack Dongarra %X Orthogonalizing a set of dense vectors is an important computational kernel in subspace projection methods for solving large-scale problems. In this talk, we discuss our efforts to improve the performance of the kernel, while maintaining its numerical accuracy. Our experimental results demonstrate the effectiveness of our approaches. %B 2015 SIAM Conference on Applied Linear Algebra %I SIAM %C Atlanta, GA %8 2015-10 %G eng %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %D 2015 %T Optimization for Performance and Energy for Batched Matrix Computations on GPUs %A Azzam Haidar %A Tingxing Dong %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size independent problems. Many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the LU and Cholesky factorizations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The goal of avoiding multicore CPU use, e.g., as in the hybrid CPU-GPU algorithms, is to exclusively benefit from the GPU’s significantly higher energy efficiency, as well as from the removal of the costly CPU-to-GPU communications. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched LU factorization featured in the CUBLAS library for GPUs, we achieved up to 2.5 speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) %I ACM %C San Francisco, CA %8 2015-02 %G eng %R 10.1145/2716282.2716288 %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2015 %T Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems %A Maksims Abalenkovs %A Ahmad Abdelfattah %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %A Asim YarKhan %K dense linear algebra %K gpu %K HPC %K Multicore %K plasma %K Programming models %K runtime %X We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. %B Supercomputing Frontiers and Innovations %V 2 %8 2015-10 %G eng %R 10.14529/jsfi1504 %0 Conference Paper %B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award %D 2015 %T Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %K Eigenvalues problem %K Hessenberg reduction %K Multi/Many-core %K Stabilized Elementary Transformations %X The solution of nonsymmetric eigenvalue problems, Ax = λx, can be accelerated substantially by first reducing A to an upper Hessenberg matrix H that has the same eigenvalues as A. This can be done using Householder orthogonal transformations, which is a well established standard, or stabilized elementary transformations. The latter approach, although having half the flops of the former, has been used less in practice, e.g., on computer architectures with well developed hierarchical memories, because of its memory-bound operations and the complexity in stabilizing it. In this paper we revisit the stabilized elementary transformations approach in the context of new architectures – both multicore CPUs and Xeon Phi coprocessors. We derive for a first time a blocking version of the algorithm. The blocked version reduces the memory-bound operations and we analyze its performance. A performance model is developed that shows the limitations of both approaches. The competitiveness of using stabilized elementary transformations has been quantified, highlighting that it can be 20 to 30% faster on current high-end multicore CPUs and Xeon Phi coprocessors. %B The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award %C Alexandria, VA %8 2015-04 %G eng %0 Conference Paper %B International Conference on Computational Science (ICCS 2015) %D 2015 %T Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform %A Khairul Kabir %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %B International Conference on Computational Science (ICCS 2015) %C Reykjavík, Iceland %8 2015-06 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %D 2015 %T Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs %A Theo Mary %A Ichitaro Yamazaki %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15) %I ACM %C Austin, TX %8 2015-11 %G eng %0 Generic %D 2015 %T Towards a High-Performance Tensor Algebra Package for Accelerators %A Marc Baboulin %A Veselin Dobrev %A Jack Dongarra %A Christopher Earl %A Joël Falcou %A Azzam Haidar %A Ian Karlin %A Tzanio Kolev %A Ian Masliah %A Stanimire Tomov %I moky Mountains Computational Sciences and Engineering Conference (SMC15) %C Gatlinburg, TN %8 2015-09 %G eng %0 Conference Paper %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %D 2015 %T Towards Batched Linear Solvers on Accelerated Hardware Platforms %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K batched factorization %K hardware accelerators %K numerical linear algebra %K numerical software libraries %K one-sided factorization algorithms %X As hardware evolves, an increasingly effective approach to develop energy efficient, high-performance solvers, is to design them to work on many small and independent problems. Indeed, many applications already need this functionality, especially for GPUs, which are known to be currently about four to five times more energy efficient than multicore CPUs for every floating-point operation. In this paper, we describe the development of the main one-sided factorizations: LU, QR, and Cholesky; that are needed for a set of small dense matrices to work in parallel. We refer to such algorithms as batched factorizations. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-contained execution. Note that this is similar in functionality to the LAPACK and the hybrid MAGMA algorithms for large-matrix factorizations. But it is different from a straightforward approach, whereby each of GPU’s symmetric multiprocessors factorizes a single problem at a time.We illustrate how our performance analysis together with the profiling and tracing tools guided the development of batched factorizations to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library on a two-sockets, Intel Sandy Bridge server. Compared to a batched LU factorization featured in the NVIDIA’s CUBLAS library for GPUs, we achieves up to 2.5-fold speedup on the K40 GPU. %B 8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015 %I ACM %C San Francisco, CA %8 2015-02 %G eng %0 Conference Proceedings %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %D 2015 %T Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators %A Azzam Haidar %A Yulu Jia %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %A Jack Dongarra %K dataflow scheduling %K hardware accelerators %K multi-grain parallelism %X A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications. %B Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15) %I ACM %C Austin, TX %V No. 5 %8 2015-11 %G eng %0 Book Section %B Numerical Computations with GPUs %D 2014 %T Accelerating Numerical Dense Linear Algebra Calculations with GPUs %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Ichitaro Yamazaki %B Numerical Computations with GPUs %I Springer International Publishing %P 3-28 %@ 978-3-319-06547-2 %G eng %& 1 %R 10.1007/978-3-319-06548-9_1 %0 Generic %D 2014 %T Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iterative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU data structures and kernels to the higher-level algorithmic choices and overall heterogeneous design. Most notably, the eigensolver leverages the high-performance of a new GPU kernel developed for the simultaneous multiplication of a sparse matrix and a set of vectors (SpMM). This is a building block that serves as a backbone for not only block-Krylov, but also for other methods relying on blocking for acceleration in general. The heterogeneous LOBPCG developed here reveals the potential of this type of eigensolver by highly optimizing all of its components, and can be viewed as a benchmark for other SpMM-dependent applications. Compared to non-blocked algorithms, we show that the performance speedup factor of SpMM vs. SpMV-based algorithms is up to six on GPUs like NVIDIA’s K40. In particular, a typical SpMV performance range in double precision is 20 to 25 GFlop/s, while the SpMM is in the range of 100 to 120 GFlop/s. Compared to highly-optimized CPU implementations, e.g., the SpMM from MKL on two eight-core Intel Xeon E5-2690s, our kernel is 3 to 5x. faster on a K40 GPU. For comparison to other computational loads, the same GPU to CPU performance acceleration is observed for the SpMV product, as well as dense linear algebra, e.g., matrix-matrix multiplication and factorizations like LU, QR, and Cholesky. Thus, the modeled GPU (vs. CPU) acceleration for the entire solver is also 3 to 5x. In practice though, currently available CPU implementations are much slower due to missed optimization opportunities, as we show. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2014-10 %G eng %0 Conference Paper %B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining %D 2014 %T Access-averse Framework for Computing Low-rank Matrix Approximations %A Ichitaro Yamazaki %A Theo Mary %A Jakub Kurzak %A Stanimire Tomov %A Jack Dongarra %B First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining %C Washington, DC %8 2014-10 %G eng %0 Conference Paper %B International Workshop on OpenCL %D 2014 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments presented, and in general provides to heterogeneous architectures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance OpenCL BLAS, hardware and OpenCL-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B International Workshop on OpenCL %C Bristol University, England %8 2014-05 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %D 2014 %T Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems %C New Orleans, LA %8 2014-11 %G eng %0 Conference Paper %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14) %D 2014 %T Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster %A Ichitaro Yamazaki %A Sivasankaran Rajamanickam %A Eric G. Boman %A Mark Hoemmen %A Michael A. Heroux %A Stanimire Tomov %B The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer’s characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD Opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4x compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %8 2014-05 %G eng %0 Conference Paper %B International Conference on Parallel Processing (ICPP-2014) %D 2014 %T A Fast Batched Cholesky Factorization on a GPU %A Tingxing Dong %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %X Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms – nonblocked, blocked, and recursive blocked – were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1:8 speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMA by 1.5 in performance-per-watt for large matrices. %B International Conference on Parallel Processing (ICPP-2014) %C Minneapolis, MN %8 2014-09 %G eng %0 Conference Paper %B VECPAR 2014 %D 2014 %T Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments %A Azzam Haidar %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K Computer science %K factorization %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %X We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Conference Paper %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %D 2014 %T Hybrid Multi-Elimination ILU Preconditioners on GPUs %A Dimitar Lukarski %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Abstract—Iterative solvers for sparse linear systems often benefit from using preconditioners. While there are implementations for many iterative methods that leverage the computing power of accelerators, porting the latest developments in preconditioners to accelerators has been challenging. In this paper we develop a selfadaptive multi-elimination preconditioner for graphics processing units (GPUs). The preconditioner is based on a multi-level incomplete LU factorization and uses a direct dense solver for the bottom-level system. For test matrices from the University of Florida matrix collection, we investigate the influence of handling the triangular solvers in the distinct iteration steps in either single or double precision arithmetic. Integrated into a Conjugate Gradient method, we show that our multi-elimination algorithm is highly competitive against popular preconditioners, including multi-colored symmetric Gauss-Seidel relaxation preconditioners, and (multi-colored symmetric) ILU for numerous problems. %B International Heterogeneity in Computing Workshop (HCW), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Generic %D 2014 %T Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %X Numerical methods in sparse linear algebra typically rely on a fast and efficient matrix vector product, as this usually is the backbone of iterative algorithms for solving eigenvalue problems or linear systems. Against the background of a large diversity in the characteristics of high performance computer architectures, it is a challenge to derive a cross-platform efficient storage format along with fast matrix vector kernels. Recently, attention focused on the SELL-C- format, a sliced ELLPACK format enhanced by row-sorting to reduce the fill in when padding rows with zeros. In this paper we propose an additional modification resulting in the padded sliced ELLPACK (SELLP) format, for which we develop a sparse matrix vector CUDA kernel that is able to efficiently exploit the computing power of NVIDIA GPUs. We show that the kernel we developed outperforms straight-forward implementations for the widespread CSR and ELLPACK formats, and is highly competitive to the implementations in the highly optimized CUSPARSE library. %B University of Tennessee Computer Science Technical Report %I University of Tennessee %8 2014-04 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Improving the performance of CA-GMRES on multicores with multiple GPUs %A Ichitaro Yamazaki %A Hartwig Anzt %A Stanimire Tomov %A Mark Hoemmen %A Jack Dongarra %X Abstract—The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %D 2014 %T LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU %A Tingxing Dong %A Azzam Haidar %A Piotr Luszczek %A James Harris %A Stanimire Tomov %A Jack Dongarra %X Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends, for example, on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently exploit the computing power of modern accelerator hardware, these linear systems are processed in batches. To improve the numerical stability of the Gaussian Elimination, at least partial pivoting is required, most often accomplished with row pivoting. However, row pivoting can result in a severe performance penalty on GPUs because it brings in thread divergence and non-coalesced memory accesses. The state-of-the-art libraries for linear algebra that target GPUs, such as MAGMA, focus on large matrix sizes. They change the data layout by transposing the matrix to avoid these divergence and non-coalescing penalties. However, the data movement associated with transposition is very expensive for small matrices. In this paper, we propose a batched LU factorization for GPUs by using a multi-level blocked right looking algorithm that preserves the data layout but minimizes the penalty of partial pivoting. Our batched LU achieves up to 2:5-fold speedup when compared to the alternative CUBLAS solutions on a K40c GPU and 3:6-fold speedup over MKL on a node of the Titan supercomputer at ORNL in a nuclear reaction network simulation. %B 16th IEEE International Conference on High Performance Computing and Communications (HPCC) %I IEEE %C Paris, France %8 2014-08 %G eng %0 Conference Paper %B VECPAR 2014 (Best Paper) %D 2014 %T Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs %A Ichitaro Yamazaki %A Stanimire Tomov %A Tingxing Dong %A Jack Dongarra %X We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20x more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance. %B VECPAR 2014 (Best Paper) %C Eugene, OR %8 2014-06 %G eng %0 Journal Article %J Supercomputing Frontiers and Innovations %D 2014 %T Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems %A Jack Dongarra %A Azzam Haidar %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Asim YarKhan %K dense linear algebra %K hardware accelerators %K task superscalar scheduling %X Hardware heterogeneity of the HPC platforms is no longer considered unusual but instead have become the most viable way forward towards Exascale. In fact, the multitude of the heterogeneous resources available to modern computers are designed for different workloads and their efficient use is closely aligned with the specialized role envisaged by their design. Commonly in order to efficiently use such GPU resources, the workload in question must have a much greater degree of parallelism than workloads often associated with multicore processors (CPUs). Available GPU variants differ in their internal architecture and, as a result, are capable of handling workloads of varying degrees of complexity and a range of computational patterns. This vast array of applicable workloads will likely lead to an ever accelerated mixing of multicore-CPUs and GPUs in multi-user environments with the ultimate goal of offering adequate computing facilities for a wide range of scientific and technical workloads. In the following paper, we present a research prototype that uses a lightweight runtime environment to manage the resource-specific workloads, and to control the dataflow and parallel execution in hybrid systems. Our lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. This concept is reminiscent of dataflow and systolic architectures in its conceptualization of a workload as a set of side-effect-free tasks that pass data items whenever the associated work assignment have been completed. Additionally, our task abstractions and their parametrization enable uniformity in the algorithmic development across all the heterogeneous resources without sacrificing precious compute cycles. We include performance results for dense linear algebra functions which demonstrate the practicality and effectiveness of our approach that is aptly capable of full utilization of a wide range of accelerator hardware. %B Supercomputing Frontiers and Innovations %V 1 %G eng %N 1 %R http://dx.doi.org/10.14529/jsfi1401 %0 Journal Article %J International Journal of High Performance Computing Applications %D 2014 %T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks %A Azzam Haidar %A Raffaele Solcà %A Mark Gates %A Stanimire Tomov %A Thomas C. Schulthess %A Jack Dongarra %K Eigensolver %K electronic structure calculations %K generalized eigensolver %K gpu %K high performance %K hybrid %K Multicore %K two-stage %X The adoption of hybrid CPU–GPU nodes in traditional supercomputing platforms such as the Cray-XK6 opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium-sized generalized eigenvalue problems must be solved many times. These eigenvalue problems are too small to effectively solve on distributed systems, but can benefit from the massive computing power concentrated on a single-node, hybrid CPU–GPU system. However, hybrid systems call for the development of new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multicore/manycore CPUs as well. Addressing these demands, we developed a generalized eigensolver featuring novel algorithms of increased computational intensity (compared with the standard algorithms), decomposition of the computation into fine-grained memory aware tasks, and their hybrid execution. The resulting eigensolvers are state-of-the-art in high-performance computing, significantly outperforming existing libraries. We describe the algorithm and analyze its performance impact on applications of interest when different fractions of eigenvectors are needed by the host electronic structure code. %B International Journal of High Performance Computing Applications %V 28 %P 196-209 %8 2014-05 %G eng %N 2 %& 196 %R 10.1177/1094342013502097 %0 Conference Paper %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %D 2014 %T Optimizing Krylov Subspace Solvers on Graphics Processing Units %A Stanimire Tomov %A Piotr Luszczek %A Ichitaro Yamazaki %A Jack Dongarra %A Hartwig Anzt %A William Sawyer %X Krylov subspace solvers are often the method of choice when solving sparse linear systems iteratively. At the same time, hardware accelerators such as graphics processing units (GPUs) continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to leverage the full potential of the accelerator. In this paper we target the acceleration of the BiCGSTAB solver for GPUs, showing that significant improvement can be achieved by reformulating the method and developing application-specific kernels instead of using the generic CUBLAS library provided by NVIDIA. We propose an implementation that benefits from a significantly reduced number of kernel launches and GPUhost communication events, by means of increased data locality and a simultaneous reduction of multiple scalar products. Using experimental data, we show that, depending on the dominance of the untouched sparse matrix vector products, significant performance improvements can be achieved compared to a reference implementation based on the CUBLAS library. We feel that such optimizations are crucial for the subsequent development of highlevel sparse linear algebra libraries. %B Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %D 2014 %T Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors %A Azzam Haidar %A Chongxiao Cao %A Ichitaro Yamazaki %A Jack Dongarra %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly. %B 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14) %I IEEE %C New Orleans, LA %8 2014-11 %G eng %R 10.1109/ScalA.2014.8 %0 Conference Paper %B VECPAR 2014 %D 2014 %T Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures %A Hartwig Anzt %A Dimitar Lukarski %A Stanimire Tomov %A Jack Dongarra %X Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved. %B VECPAR 2014 %C Eugene, OR %8 2014-06 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU %A Tingxing Dong %A Veselin Dobrev %A Tzanio Kolev %A Robert Rieben %A Stanimire Tomov %A Jack Dongarra %K Computer science %K CUDA %K FEM %K Finite element method %K linear algebra %K nVidia %K Tesla K20 %X Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs [10, 1]. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor’s library routines substantially in performance. We proposed an autotuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2:5x overall speedup and 1:42x energy efficiency (greenup) using 4th order (Q4) finite elements, and a 1:9x speedup and 1:27x greenup using 2nd order (Q2) finite elements. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Conference Paper %B IPDPS 2014 %D 2014 %T Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment %A Azzam Haidar %A Chongxiao Cao %A Jack Dongarra %A Piotr Luszczek %A Stanimire Tomov %K algorithms %K Computer science %K CUDA %K Heterogeneous systems %K Intel Xeon Phi %K linear algebra %K nVidia %K Tesla K20 %K Tesla M2090 %X Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resourcespecific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware. %B IPDPS 2014 %I IEEE %C Phoenix, AZ %8 2014-05 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software (also LAWN 246) %D 2013 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K algorithms %K dense linear algebra %K experimentation %K graphics processing units %K linear systems %K lu factorization %K multiplicative preconditioning %K numerical linear algebra %K performance %K plasma %K randomization %X We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system Ax = b. We study a random transformation of A that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at a very affordable computational price while providing us with a satisfying accuracy when compared to partial pivoting. This random transformation called Partial Random Butterfly Transformation (PRBT) is optimized in terms of data storage and flops count. We propose a solver where PRBT and the LU factorization with no pivoting take advantage of the current hybrid multicore/GPU machines and we compare its Gflop/s performance with a solver implemented in a current parallel library. %B ACM Transactions on Mathematical Software (also LAWN 246) %V 39 %8 2013-02 %G eng %U http://dl.acm.org/citation.cfm?id=2427025 %N 2 %R 10.1145/2427023.2427025 %0 Journal Article %J Journal of Parallel and Distributed Computing %D 2013 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %X In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the ‘‘subdomain’’ handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing. %B Journal of Parallel and Distributed Computing %V 73 %P 1613–1626 %8 2013-12 %G eng %N 12 %R http://dx.doi.org/10.1016/j.jpdc.2013.05.008 %0 Generic %D 2013 %T clMAGMA: High Performance Dense Linear Algebra with OpenCL %A Chongxiao Cao %A Jack Dongarra %A Peng Du %A Mark Gates %A Piotr Luszczek %A Stanimire Tomov %X This paper presents the design and implementation of sev- eral fundamental dense linear algebra (DLA) algorithms in OpenCL. In particular, these are linear system solvers and eigenvalue problem solvers. Further, we give an overview of the clMAGMA library, an open source, high performance OpenCL library that incorporates the developments pre- sented, and in general provides to heterogeneous architec- tures the DLA functionality of the popular LAPACK library. The LAPACK-compliance and use of OpenCL simplify the use of clMAGMA in applications, while providing them with portably performant DLA. High performance is ob- tained through use of the high-performance OpenCL BLAS, hardware and OpenCL-speci c tuning, and a hybridization methodology where we split the algorithm into computa- tional tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. %B University of Tennessee Technical Report (Lawn 275) %I University of Tennessee %8 2013-03 %G eng %0 Generic %D 2013 %T Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %X Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU computing approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoids data transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on high-end hybrid CPU/GPU systems show that our dynamically balanced synchronization-avoiding LU is both multicore and GPU scalable. Comparisons with state-of-the-art libraries like MKL (for multicore) and MAGMA (for hybrid systems) are provided, demonstrating significant performance improvements. The approach is applicable to other linear algebra algorithms. The scheduling mechanisms and tuning models can be incorporated into respectively dynamic runtime systems/schedulers and autotuning frameworks for hybrid CPU/MIC/GPU architectures. %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Generic %D 2013 %T Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters %A Tingxing Dong %A Veselin Dobrev %A Tzanio Kolev %A Robert Rieben %A Stanimire Tomov %A Jack Dongarra %X The explosion of parallelism and heterogeneity in today's computer architectures has created opportunities as well as challenges for redesigning legacy numerical software to harness the power of new hardware. In this paper we address the main challenges in redesigning BLAST { a numerical library that solves the equations of compressible hydrodynamics using high order nite element methods (FEM) in a moving Lagrangian frame { to support CPU-GPU clusters. We use a hybrid MPI + OpenMP + CUDA programming model that includes two layers: domain decomposed MPI parallelization and OpenMP + CUDA acceleration in a given domain. To optimize the code, we implemented custom linear algebra kernels and introduced an auto-tuning technique to deal with heterogeneity and load balancing at runtime. Our tests show that 12 Intel Xeon cores and two M2050 GPUs deliver a 24x speedup compared to a single core, and a 2.5x speedup compared to 12 MPI tasks in one node. Further, we achieve perfect weak scaling, demonstrated on a cluster with up to 64 GPUs in 32 nodes. Our choice of programming model and proposed solutions, as related to parallelism and load balancing, specifically targets high order FEM discretizations, and can be used equally successfully for applications beyond hydrodynamics. A major accomplishment is that we further establish the appeal of high order FEMs, which despite their better approximation properties, are often avoided due to their high computational cost. GPUs, as we show, have the potential to make them the method of choice, as the increased computational cost is also localized, e.g., cast as Level 3 BLAS, and thus can be done very efficiently (close to \free" relative to the usual overheads inherent in sparse computations). %B University of Tennessee Computer Science Technical Report %8 2013-07 %G eng %0 Book Section %B Contemporary High Performance Computing: From Petascale Toward Exascale %D 2013 %T Keeneland: Computational Science Using Heterogeneous GPU Computing %A Jeffrey Vetter %A Richard Glassbrook %A Karsten Schwan %A Sudha Yalamanchili %A Mitch Horton %A Ada Gavrilovska %A Magda Slawinska %A Jack Dongarra %A Jeremy Meredith %A Philip Roth %A Kyle Spafford %A Stanimire Tomov %A John Wynkoop %X The Keeneland Project is a five year Track 2D grant awarded by the National Science Foundation (NSF) under solicitation NSF 08-573 in August 2009 for the development and deployment of an innovative high performance computing system. The Keeneland project is led by the Georgia Institute of Technology (Georgia Tech) in collaboration with the University of Tennessee at Knoxville, National Institute of Computational Sciences, and Oak Ridge National Laboratory. %B Contemporary High Performance Computing: From Petascale Toward Exascale %S CRC Computational Science Series %I Taylor and Francis %C Boca Raton, FL %G eng %& 7 %0 Conference Proceedings %B International Supercomputing Conference (ISC) %D 2013 %T Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %A Raffaele Solcà %A Thomas C. Schulthess %X Today’s high computational demands from engineering fields and complex hardware development make it necessary to develop and optimize new algorithms toward achieving high performance and good scalability on the next generation of computers. The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe and analyze a successful methodology to address the challenges—starting from our algorithm design, kernel optimization and tuning, to our programming model—in the development of a scalable high-performance generalized eigenvalue solver in the context of electronic structure calculations in materials science applications. We developed a set of leading edge dense linear algebra algorithms, as part of a generalized eigensolver, featuring fine grained memory aware kernels, a task based approach and hybrid execution/scheduling. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. We report the performance impact on the generalized eigensolver when different fractions of eigenvectors are needed. The algorithm described provides an enormous performance boost compared to current GPU-based solutions, and performance comparable to state-of-the-art distributed solutions, using a single node with multiple GPUs. %B International Supercomputing Conference (ISC) %7 Lecture Notes in Computer Science %I Springer Berlin Heidelberg %C Leipzig, Germany %V 7905 %P 67-80 %8 2013-06 %@ 978-3-642-38750-0 %G eng %R 10.1007/978-3-642-38750-0_6 %0 Conference Paper %B PPAM 2013 %D 2013 %T Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi %A Jack Dongarra %A Mark Gates %A Azzam Haidar %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %K magma %K mic %K xeon phi %X This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi Coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library that incorporates the developments presented, and in general provides to heterogeneous architectures of multicore with coprocessors the DLA functionality of the popular LAPACK library. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology where we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware components by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA. %B PPAM 2013 %C Warsaw, Poland %8 2013-09 %G eng %0 Book Section %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %D 2013 %T Scalable Dense Linear Algebra on Heterogeneous Hardware %A George Bosilca %A Aurelien Bouteiller %A Anthony Danalis %A Thomas Herault %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %X Abstract. Design of systems exceeding 1 Pflop/s and the push toward 1 Eflop/s, forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers, as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect, that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand, when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs), which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores. %B HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing %G eng %0 Journal Article %J Journal of Computational Science %D 2013 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K gpgpu %K gpu %K magma %X The general purpose graphics processing units (GPGPUs) are increasingly deployed for scientific computing due to their performance advantages over CPUs. What followed is the fact that fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors, for example, in the form of bit flips. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include: (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R; and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can successfully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs. %B Journal of Computational Science %V 4 %P 457–464 %8 2013-11 %G eng %N 6 %R http://dx.doi.org/10.1016/j.jocs.2013.01.004 %0 Conference Paper %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %D 2013 %T Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication %A Azzam Haidar %A Mark Gates %A Stanimire Tomov %A Jack Dongarra %E Allen D. Malony %E Nemirovsky, Mario %E Midkiff, Sam %K eigenvalue %K gpu communication %K gpu computation %K heterogeneous programming model %K performance %K reduction to tridiagonal %K singular value decomposiiton %K task parallelism %X The enormous gap between the high-performance capabilities of GPUs and the slow interconnect between them has made the development of numerical software that is scalable across multiple GPUs extremely challenging. We describe a successful methodology on how to address the challenges---starting from our algorithm design, kernel optimization and tuning, to our programming model---in the development of a scalable high-performance tridiagonal reduction algorithm for the symmetric eigenvalue problem. This is a fundamental linear algebra problem with many engineering and physics applications. We use a combination of a task-based approach to parallelism and a new algorithmic design to achieve high performance. The goal of the new design is to increase the computational intensity of the major compute kernels and to reduce synchronization and data transfers between GPUs. This may increase the number of flops, but the increase is offset by the more efficient execution and reduced data transfers. Our performance results are the best available, providing an enormous performance boost compared to current state-of-the-art solutions. In particular, our software scales up to 1070 Gflop/s using 16 Intel E5-2670 cores and eight M2090 GPUs, compared to 45 Gflop/s achieved by the optimized Intel Math Kernel Library (MKL) using only the 16 CPU cores. %B Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13) %I ACM Press %C Eugene, Oregon, USA %8 2013-06 %@ 9781450321303 %G eng %U http://dl.acm.org/citation.cfm?doid=2464996.2465438 %R 10.1145/2464996.2465438 %0 Journal Article %J Concurrency and Computation: Practice and Experience %D 2013 %T Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems %A Ichitaro Yamazaki %A Tingxing Dong %A Raffaele Solcà %A Stanimire Tomov %A Jack Dongarra %A Thomas C. Schulthess %X For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetric matrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernel into computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolver and a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher-level kernels, not only reducing the solution time but also enabling the solution of larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. %B Concurrency and Computation: Practice and Experience %8 2013-10 %G eng %0 Conference Paper %B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %D 2013 %T Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster %A Ichitaro Yamazaki %A Tingxing Dong %A Stanimire Tomov %A Jack Dongarra %B The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) %8 2013-05 %G eng %0 Generic %D 2012 %T Acceleration of the BLAST Hydro Code on GPU %A Tingxing Dong %A Tzanio Kolev %A Robert Rieben %A Veselin Dobrev %A Stanimire Tomov %A Jack Dongarra %B Supercomputing '12 (poster) %I SC12 %C Salt Lake City, Utah %8 2012-11 %G eng %0 Journal Article %J IEEE Transactions on Parallel and Distributed Systems %D 2012 %T Autotuning GEMM Kernels for the Fermi GPU %A Jakub Kurzak %A Stanimire Tomov %A Jack Dongarra %X Abstract—In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations. %B IEEE Transactions on Parallel and Distributed Systems %V 23 %8 2012-11 %G eng %R https://doi.org/10.1109/TPDS.2011.311 %0 Journal Article %J ICCS 2012 %D 2012 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %B ICCS 2012 %C Omaha, NE %8 2012-06 %G eng %0 Conference Proceedings %B Proc. of the International Conference on Computational Science (ICCS) %D 2012 %T A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines %A Marc Baboulin %A Simplice Donfack %A Jack Dongarra %A Laura Grigori %A Adrien Remi %A Stanimire Tomov %K magma %B Proc. of the International Conference on Computational Science (ICCS) %V 9 %P 17-26 %8 2012-06 %G eng %0 Journal Article %J High Performance Scientific Computing: Algorithms and Applications %D 2012 %T Dense Linear Algebra on Accelerated Multicore Hardware %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %E Michael Berry %E et al., %B High Performance Scientific Computing: Algorithms and Applications %I Springer-Verlag %C London, UK %8 2012-00 %G eng %0 Journal Article %J SIAM Journal on Scientific Computing %D 2012 %T Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems %A Christof Voemel %A Stanimire Tomov %A Jack Dongarra %K magma %B SIAM Journal on Scientific Computing %V 34(2) %P C70-C82 %8 2012-04 %G eng %0 Conference Proceedings %B 26th ACM International Conference on Supercomputing (ICS 2012) %D 2012 %T Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %B 26th ACM International Conference on Supercomputing (ICS 2012) %I ACM %C San Servolo Island, Venice, Italy %8 2012-06 %G eng %0 Journal Article %J Parallel Computing %D 2012 %T From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming %A Peng Du %A Rick Weber %A Piotr Luszczek %A Stanimire Tomov %A Gregory D. Peterson %A Jack Dongarra %B Parallel Computing %V 38 %P 391-407 %8 2012-08 %G eng %0 Generic %D 2012 %T The Future of Computing: Software Libraries %A Stanimire Tomov %A Jack Dongarra %I DOD CREATE Developers' Review, Keynote Presentation %C Savannah, GA %8 2012-02 %G eng %0 Generic %D 2012 %T MAGMA: A Breakthrough in Solvers for Eigenvalue Problems %A Stanimire Tomov %A Jack Dongarra %A Azzam Haidar %A Ichitaro Yamazaki %A Tingxing Dong %A Thomas Schulthess %A Raffaele Solcà %I GPU Technology Conference (GTC12), Presentation %C San Jose, CA %8 2012-05 %G eng %0 Generic %D 2012 %T MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures %A Jack Dongarra %A Tingxing Dong %A Mark Gates %A Azzam Haidar %A Stanimire Tomov %A Ichitaro Yamazaki %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation %C Salt Lake City, UT %8 2012-11 %G eng %0 Generic %D 2012 %T MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors %A Jack Dongarra %A Mark Gates %A Yulu Jia %A Khairul Kabir %A Piotr Luszczek %A Stanimire Tomov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12) %C Salt Lake City, UT %8 2012-11 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T Matrices Over Runtime Systems at Exascale %A Emmanuel Agullo %A George Bosilca %A Cedric Castagnède %A Jack Dongarra %A Hatem Ltaeif %A Stanimire Tomov %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 2012-11 %G eng %0 Journal Article %J Supercomputing '12 (poster) %D 2012 %T A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks %A Raffaele Solcà %A Azzam Haidar %A Stanimire Tomov %A Jack Dongarra %A Thomas C. Schulthess %B Supercomputing '12 (poster) %C Salt Lake City, Utah %8 2012-11 %G eng %0 Conference Proceedings %B The International Conference on Computational Science (ICCS) %D 2012 %T One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators %A Ichitaro Yamazaki %A Stanimire Tomov %A Jack Dongarra %K magma %B The International Conference on Computational Science (ICCS) %8 2012-06 %G eng %0 Generic %D 2012 %T Performance evaluation of LU factorization through hardware counter measurements %A Simplice Donfack %A Stanimire Tomov %A Jack Dongarra %B University of Tennessee Computer Science Technical Report %8 2012-10 %G eng %0 Journal Article %J SAAHPC '12 (Best Paper Award) %D 2012 %T Power Aware Computing on GPUs %A Kiran Kasichayanula %A Dan Terpstra %A Piotr Luszczek %A Stanimire Tomov %A Shirley Moore %A Gregory D. Peterson %K magma %B SAAHPC '12 (Best Paper Award) %C Argonne, IL %8 2012-07 %G eng %0 Journal Article %J LAWN 267 %D 2012 %T Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %B LAWN 267 %8 2012-00 %G eng %0 Generic %D 2012 %T Providing GPU Capability to LU and QR within the ScaLAPACK Framework %A Peng Du %A Stanimire Tomov %A Jack Dongarra %B University of Tennessee Computer Science Technical Report (also LAWN 272) %8 2012-09 %G eng %0 Conference Proceedings %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %D 2012 %T Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %B Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper) %C Rhodes Island, Greece %8 2012-08 %G eng %0 Journal Article %J INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %D 2011 %T Accelerating Linear System Solutions Using Randomization Techniques %A Marc Baboulin %A Jack Dongarra %A Julien Herrmann %A Stanimire Tomov %K magma %B INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11) %C Waterloo, Ontario, Canada %8 2011-07 %G eng %0 Generic %D 2011 %T Autotuning GEMMs for Fermi %A Jakub Kurzak %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245) %8 2011-04 %G eng %0 Journal Article %D 2011 %T Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems %A Hartwig Anzt %A Stanimire Tomov %A Mark Gates %A Jack Dongarra %A Vincent Heuveline %K magma %8 2011-12 %G eng %0 Generic %D 2011 %T A Block-Asynchronous Relaxation Method for Graphics Processing Units %A Hartwig Anzt %A Stanimire Tomov %A Jack Dongarra %A Vincent Heuveline %K magma %B University of Tennessee Computer Science Technical Report %8 2011-11 %G eng %0 Conference Proceedings %B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11) %D 2011 %T A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures %A Mitch Horton %A Stanimire Tomov %A Jack Dongarra %K magma %K quark %B Symposium for Application Accelerators in High Performance Computing (SAAHPC'11) %C Knoxville, TN %8 2011-07 %G eng %0 Generic %D 2011 %T Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures %A Fengguang Song %A Stanimire Tomov %A Jack Dongarra %K magma %K plasma %B University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250) %8 2011-06 %G eng %0 Journal Article %J in GPU Computing Gems, Jade Edition %D 2011 %T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %E Wen-mei W. Hwu %K magma %K morse %B in GPU Computing Gems, Jade Edition %I Elsevier %V 2 %P 473-484 %8 2011-00 %G eng %0 Journal Article %J IEEE/ACS AICCSA 2011 %D 2011 %T LU Factorization for Accelerator-Based Systems %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Julien Langou %A Hatem Ltaeif %A Stanimire Tomov %K magma %K morse %B IEEE/ACS AICCSA 2011 %C Sharm-El-Sheikh, Egypt %8 2011-12 %G eng %0 Generic %D 2011 %T MAGMA - LAPACK for GPUs %A Stanimire Tomov %I Keeneland GPU Tutorial %C Atlanta, GA %8 2011-04 %G eng %0 Generic %D 2011 %T MAGMA - LAPACK for HPC on Heterogeneous Architectures %A Stanimire Tomov %A Jack Dongarra %I Titan Summit at Oak Ridge National Laboratory, Presentation %C Oak Ridge, TN %8 2011-08 %G eng %0 Generic %D 2011 %T Matrix Algebra on GPU and Multicore Architectures %A Stanimire Tomov %I Workshop on GPU-enabled Numerical Libraries, Presentation %C Basel, Switzerland %8 2011-05 %G eng %0 Conference Proceedings %B ACM/IEEE Conference on Supercomputing (SC’11) %D 2011 %T Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs %A Rajib Nath %A Stanimire Tomov %A Tingxing Dong %A Jack Dongarra %K magma %B ACM/IEEE Conference on Supercomputing (SC’11) %C Seattle, WA %8 2011-11 %G eng %0 Conference Paper %B International Conference on Parallel Processing (ICPP'11) %D 2011 %T Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs %A Allen D. Malony %A Scott Biersdorff %A Sameer Shende %A Heike Jagode %A Stanimire Tomov %A Guido Juckeland %A Robert Dietrich %A Duncan Poole %A Christopher Lamb %K magma %K mumi %K papi %X The power of GPUs is giving rise to heterogeneous parallel computing, with new demands on programming environments, runtime systems, and tools to deliver high-performing applications. This paper studies the problems associated with performance measurement of heterogeneous machines with GPUs. A heterogeneous computation model and alternative host-GPU measurement approaches are discussed to set the stage for reporting new capabilities for heterogeneous parallel performance measurement in three leading HPC tools: PAPI, Vampir, and the TAU Performance System. Our work leverages the new CUPTI tool support in NVIDIA's CUDA device library. Heterogeneous benchmarks from the SHOC suite are used to demonstrate the measurement methods and tool support. %B International Conference on Parallel Processing (ICPP'11) %I ACM %C Taipei, Taiwan %8 2011-09 %@ 978-0-7695-4510-3 %G eng %R 10.1109/ICPP.2011.71 %0 Journal Article %J IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %D 2011 %T Performance Portability of a GPU Enabled Factorization with the DAGuE Framework %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %K magma %K parsec %B IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC) %8 2011-06 %G eng %0 Generic %D 2011 %T Power-aware Computing on GPGPUs %A Kiran Kasichayanula %A Haihang You %A Shirley Moore %A Stanimire Tomov %A Heike Jagode %A Matt Johnson %I Fall Creek Falls Conference, Poster %C Gatlinburg, TN %8 2011-09 %G eng %0 Generic %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B University of Tennessee Computer Science Technical Report %C Knoxville, TN %8 2011-07 %G eng %0 Journal Article %J UT-CS-11-675 (also LAPACK Working Note #252) %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B UT-CS-11-675 (also LAPACK Working Note #252) %8 2011-07 %G eng %0 Journal Article %J Journal of Computational Science %D 2011 %T Soft Error Resilient QR Factorization for Hybrid System with GPGPU %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K ft-la %B Journal of Computational Science %I Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11 %C Seattle, WA %8 2011-11 %G eng %0 Conference Proceedings %B IEEE International Parallel and Distributed Processing Symposium (submitted) %D 2011 %T A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems %A George Bosilca %A Aurelien Bouteiller %A Thomas Herault %A Pierre Lemariner %A Narapat Ohm Saengpatsa %A Stanimire Tomov %A Jack Dongarra %K dague %B IEEE International Parallel and Distributed Processing Symposium (submitted) %C Anchorage, AK %8 2011-05 %G eng %0 Journal Article %J Proc. of VECPAR'10 %D 2010 %T Accelerating GPU Kernels for Dense Linear Algebra %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B Proc. of VECPAR'10 %C Berkeley, CA %8 2010-06 %G eng %0 Generic %D 2010 %T Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers %A Stanimire Tomov %A George Bosilca %A Cedric Augonnet %I 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial %8 2010-07 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %B Parallel Computing %V 36 %P 645-654 %8 2010-00 %G eng %0 Generic %D 2010 %T Autotuning Dense Linear Algebra Libraries on GPUs %A Rajib Nath %A Stanimire Tomov %A Emmanuel Agullo %A Jack Dongarra %I Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010) %C Basel, Switzerland %8 2010-06 %G eng %0 Book Section %B Scientific Computing with Multicore and Accelerators %D 2010 %T Blas for GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %B Scientific Computing with Multicore and Accelerators %S Chapman & Hall/CRC Computational Science %I CRC Press %C Boca Raton, Florida %@ 9781439825365 %G eng %& 4 %0 Book Section %B Scientific Computing with Multicore and Accelerators %D 2010 %T Dense Linear Algebra for Hybrid GPU-based Systems %A Stanimire Tomov %A Jack Dongarra %B Scientific Computing with Multicore and Accelerators %S Chapman & Hall/CRC Computational Science %I CRC Press %C Boca Raton, Florida %@ 9781439825365 %G eng %& 3 %0 Generic %D 2010 %T Dense Linear Algebra Solvers for Multicore with GPU Accelerators %A Stanimire Tomov %I International Parallel and Distributed Processing Symposium (IPDPS 2010) %C Atlanta, GA %8 2010-04 %G eng %0 Conference Proceedings %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %D 2010 %T Dense Linear Algebra Solvers for Multicore with GPU Accelerators %A Stanimire Tomov %A Rajib Nath %A Hatem Ltaeif %A Jack Dongarra %X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library. %B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on %C Atlanta, GA %P 1-8 %G eng %R 10.1109/IPDPSW.2010.5470941 %0 Journal Article %J SIAM Journal on Scientific Computing (submitted) %D 2010 %T Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems %A Christof Voemel %A Stanimire Tomov %A Jack Dongarra %K magma %B SIAM Journal on Scientific Computing (submitted) %8 2010-08 %G eng %0 Generic %D 2010 %T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %B LAPACK Working Note %8 2010-00 %G eng %0 Journal Article %J IEEE Transaction on Parallel and Distributed Systems (submitted) %D 2010 %T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Jack Dongarra %K magma %K plasma %B IEEE Transaction on Parallel and Distributed Systems (submitted) %8 2010-03 %G eng %0 Journal Article %J International Journal of High Performance Computing %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B International Journal of High Performance Computing %V 24 %P 511-515 %8 2010-00 %G eng %0 Generic %D 2010 %T An Improved MAGMA GEMM for Fermi GPUs %A Rajib Nath %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report %8 2010-07 %G eng %0 Generic %D 2010 %T An Introduction to the MAGMA project - Acceleration of Dense Linear Algebra %A Jack Dongarra %A Stanimire Tomov %I NVIDIA Webinar %8 2010-06 %G eng %U http://developer.download.nvidia.com/CUDA/training/introtomagma.mp4 %0 Conference Proceedings %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %D 2010 %T Mixed-Tool Performance Analysis on Hybrid Multicore Architectures %A Peng Du %A Piotr Luszczek %A Stanimire Tomov %A Jack Dongarra %K magma %B First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010) %C San Diego, CA %8 2010-09 %G eng %0 Conference Proceedings %B Proceedings of the Cray Users' Group Meeting %D 2010 %T Performance Evaluation for Petascale Quantum Simulation Tools %A Stanimire Tomov %A Wenchang Lu %A %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %B Proceedings of the Cray Users' Group Meeting %C Atlanta, GA %8 2010-05 %G eng %0 Conference Proceedings %B Proceedings of IPDPS 2011 %D 2010 %T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Mathieu Faverge %A Hatem Ltaeif %A Samuel Thibault %A Stanimire Tomov %K magma %K morse %K plasma %B Proceedings of IPDPS 2011 %C Anchorage, AK %8 2010-10 %G eng %0 Journal Article %J PARA 2010 %D 2010 %T Scalability Study of a Quantum Simulation Code %A Jerzy Bernholc %A Miroslav Hodak %A Wenchang Lu %A Shirley Moore %A Stanimire Tomov %B PARA 2010 %C Reykjavik, Iceland %8 2010-06 %G eng %0 Journal Article %J Proc. of VECPAR'10 (to appear) %D 2010 %T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators %A Hatem Ltaeif %A Stanimire Tomov %A Rajib Nath %A Peng Du %A Jack Dongarra %K magma %K plasma %B Proc. of VECPAR'10 (to appear) %C Berkeley, CA %8 2010-06 %G eng %0 Generic %D 2010 %T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators %A Emmanuel Agullo %A Cedric Augonnet %A Jack Dongarra %A Hatem Ltaeif %A Raymond Namyst %A Rajib Nath %A Jean Roman %A Samuel Thibault %A Stanimire Tomov %I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster %C Knoxville, TN %8 2010-07 %G eng %0 Journal Article %J Parallel Computing %D 2010 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B Parallel Computing %V 36 %P 232-240 %8 2010-00 %G eng %0 Journal Article %J PGI Insider %D 2010 %T Using MAGMA with PGI Fortran %A Stanimire Tomov %A Mathieu Faverge %A Piotr Luszczek %A Jack Dongarra %K magma %B PGI Insider %8 2010-11 %G eng %0 Journal Article %J Computer Physics Communications %D 2009 %T Accelerating Scientific Computations with Mixed Precision Algorithms %A Marc Baboulin %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julie Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %X On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. %B Computer Physics Communications %V 180 %P 2526-2533 %8 2009-12 %G eng %N 12 %R https://doi.org/10.1016/j.cpc.2008.11.005 %0 Generic %D 2009 %T Accelerating the Reduction to Upper Hessenberg Form through Hybrid GPU-Based Computing %A Stanimire Tomov %A Jack Dongarra %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-09-642 (also LAPACK Working Note 219) %8 2009-05 %G eng %0 Conference Proceedings %B 9th International Conference on Computational Science (ICCS 2009) %D 2009 %T A Note on Auto-tuning GEMM for GPUs %A Yinan Li %A Jack Dongarra %A Stanimire Tomov %E Gabrielle Allen %E Jarosław Nabrzyski %E E. Seidel %E Geert Dick van Albada %E Jack Dongarra %E Peter M. Sloot %B 9th International Conference on Computational Science (ICCS 2009) %C Baton Rouge, LA %P 884-892 %8 2009-05 %G eng %R 10.1007/978-3-642-01970-8_89 %0 Conference Proceedings %B Journal of Physics: Conference Series %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Stanimire Tomov %K magma %K plasma %B Journal of Physics: Conference Series %V 180 %8 2009-00 %G eng %0 Generic %D 2009 %T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects %A Emmanuel Agullo %A James Demmel %A Jack Dongarra %A Bilel Hadri %A Jakub Kurzak %A Julien Langou %A Hatem Ltaeif %A Piotr Luszczek %A Rajib Nath %A Stanimire Tomov %A Asim YarKhan %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, OR %8 2009-11 %G eng %0 Generic %D 2009 %T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project %A Rajib Nath %A Jack Dongarra %A Stanimire Tomov %A Hatem Ltaeif %A Peng Du %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09) %C Portland, Oregon %8 2009-11 %G eng %0 Conference Proceedings %B Proceedings of CUG09 %D 2009 %T Performance evaluation for petascale quantum simulation tools %A Stanimire Tomov %A Wenchang Lu %A Jerzy Bernholc %A Shirley Moore %A Jack Dongarra %K doe-nano %B Proceedings of CUG09 %C Atlanta, GA %8 2009-05 %G eng %0 Generic %D 2008 %T Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project) %A Marc Baboulin %A James Demmel %A Jack Dongarra %A Stanimire Tomov %A Vasily Volkov %I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08) %C Austin, TX %8 2008-11 %G eng %0 Journal Article %J in High Performance Computing and Grids in Action %D 2008 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B in High Performance Computing and Grids in Action %I IOS Press %C Amsterdam %8 2008-01 %G eng %0 Conference Proceedings %B Proceedings of the DoD HPCMP User Group Conference %D 2008 %T Exploring New Architectures in Accelerating CFD for Air Force Applications %A Jack Dongarra %A Shirley Moore %A Gregory D. Peterson %A Stanimire Tomov %A Jeff Allred %A Vincent Natoli %A David Richie %K magma %B Proceedings of the DoD HPCMP User Group Conference %C Seattle, Washington %8 2008-01 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Interior State Computation of Nano Structures %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim, Norway %8 2008-05 %G eng %0 Generic %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Jack Dongarra %A Stanimire Tomov %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200) %8 2008-01 %G eng %0 Conference Proceedings %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %D 2008 %T Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures %A Marc Baboulin %A Stanimire Tomov %A Jack Dongarra %K magma %B PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing %C Trondheim Norway %8 2008-05 %G eng %0 Journal Article %J Journal of Computational Physics %D 2008 %T State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems %A Christof Voemel %A Stanimire Tomov %A Osni Marques %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %B Journal of Computational Physics %V 227 %P 7113-7124 %8 2008-01 %G eng %0 Generic %D 2008 %T Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems %A Stanimire Tomov %A Jack Dongarra %A Marc Baboulin %K magma %B University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210) %8 2008-01 %G eng %0 Journal Article %J ACM Transactions on Mathematical Software %D 2008 %T Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Piotr Luszczek %A Stanimire Tomov %K plasma %B ACM Transactions on Mathematical Software %V 34 %P 17-22 %8 2008-00 %G eng %0 Journal Article %J In High Performance Computing and Grids in Action (to appear) %D 2007 %T Exploiting Mixed Precision Floating Point Hardware in Scientific Computations %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Julie Langou %A Piotr Luszczek %A Stanimire Tomov %E Lucio Grandinetti %B In High Performance Computing and Grids in Action (to appear) %I IOS Press %C Amsterdam %8 2007-00 %G eng %0 Journal Article %J Journal of Computational Physics %D 2007 %T The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot %A Christof Voemel %A Stanimire Tomov %A Lin-Wang Wang %A Osni Marques %A Jack Dongarra %B Journal of Computational Physics %V 223 %P 774-782 %8 2007-00 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering %D 2006 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Jack Dongarra %A Andrew Canning %A Lin-Wang Wang %B International Journal of Computational Science and Engineering %V 2 %P 205-212 %8 2006-00 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T The Impact of Multicore on Math Software %A Alfredo Buttari %A Jack Dongarra %A Jakub Kurzak %A Julien Langou %A Piotr Luszczek %A Stanimire Tomov %K plasma %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Performance evaluation of eigensolvers in nano-structure computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Journal Article %J J. Phys.: Conf. Ser. 46 %D 2006 %T Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures %A Alex Zunger %A Alberto Franceschetti %A Gabriel Bester %A Wesley B. Jones %A Kwiseon Kim %A Peter A. Graf %A Lin-Wang Wang %A Andrew Canning %A Osni Marques %A Christof Voemel %A Jack Dongarra %A Julien Langou %A Stanimire Tomov %K DOE_NANO %B J. Phys.: Conf. Ser. 46 %V :101088/1742-6596/46/1/040 %P 292-298 %8 2006-01 %G eng %0 Journal Article %J PARA 2006 %D 2006 %T Prospectus for the Next LAPACK and ScaLAPACK Libraries %A James Demmel %A Jack Dongarra %A B. Parlett %A William Kahan %A Ming Gu %A David Bindel %A Yozo Hida %A Xiaoye Li %A Osni Marques %A Jason E. Riedy %A Christof Voemel %A Julien Langou %A Piotr Luszczek %A Jakub Kurzak %A Alfredo Buttari %A Julien Langou %A Stanimire Tomov %B PARA 2006 %C Umea, Sweden %8 2006-06 %G eng %0 Conference Proceedings %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %D 2006 %T Towards bulk based preconditioning for quantum dot computations %A Andrew Canning %A Jack Dongarra %A Julien Langou %A Osni Marques %A Stanimire Tomov %A Christof Voemel %A Lin-Wang Wang %K doe-nano %B IEEE/ACM Proceedings of HPCNano SC06 (to appear) %8 2006-01 %G eng %0 Journal Article %J Journal of Computational Physics (submitted) %D 2006 %T The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot %A Christof Voemel %A Stanimire Tomov %A Lin-Wang Wang %A Osni Marques %A Jack Dongarra %K doe-nano %B Journal of Computational Physics (submitted) %8 2006-01 %G eng %0 Conference Proceedings %B Proceedings of 5th International Conference on Computational Science (ICCS) %D 2005 %T Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %E V. S. Sunderman %E Geert Dick van Albada %E Peter M. Sloot %E Jack Dongarra %K doe-nano %B Proceedings of 5th International Conference on Computational Science (ICCS) %I Springer's Lecture Notes in Computer Science %C Atlanta, GA, USA %P 317-325 %8 2005-01 %G eng %0 Journal Article %J International Journal of Computational Science and Engineering (to appear) %D 2005 %T Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures %A Stanimire Tomov %A Julien Langou %A Andrew Canning %A Lin-Wang Wang %A Jack Dongarra %B International Journal of Computational Science and Engineering (to appear) %8 2005-01 %G eng