%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2016
%T Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
%A Ahmad Abdelfattah
%A Hatem Ltaeif
%A David Keyes
%A Jack Dongarra
%X Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications.
%B Concurrency and Computation: Practice and Experience
%V 28
%P 3447 - 3465
%8 2016-05
%G eng
%U http://onlinelibrary.wiley.com/doi/10.1002/cpe.3874/full
%N 12
%! Concurrency Computat.: Pract. Exper.
%R 10.1002/cpe.v28.1210.1002/cpe.3874
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2014
%T Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K factorization
%K parallel linear algebra
%K plasma
%K recursion
%K shared memory synchronization
%K threaded parallelism
%X The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS.
%B Concurrency and Computation: Practice and Experience
%V 26
%P 1408-1431
%8 2014-05
%G eng
%U http://doi.wiley.com/10.1002/cpe.3110
%N 7
%! Concurrency Computat.: Pract. Exper.
%& 1408
%R 10.1002/cpe.3110
%0 Journal Article
%J ACM Transactions on Mathematical Software (TOMS)
%D 2013
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K algorithms
%K bidiagional reduction
%K bulge chasing
%K data translation layer
%K dynamic scheduling
%K high performance kernels
%K performance
%K tile algorithms
%K two-stage approach
%X This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
%B ACM Transactions on Mathematical Software (TOMS)
%V 39
%G eng
%N 3
%R 10.1145/2450153.2450154
%0 Journal Article
%J IPDPS 2012
%D 2012
%T A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
%A Azzam Haidar
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B IPDPS 2012
%C Shanghai, China
%8 2012-05
%G eng
%0 Conference Proceedings
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%D 2012
%T Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture
%A Jack Dongarra
%A Hatem Ltaeif
%A Piotr Luszczek
%A Vincent M Weaver
%B The 2nd International Conference on Cloud and Green Computing (submitted)
%C Xiangtan, Hunan, China
%8 2012-11
%G eng
%0 Journal Article
%J Lecture Notes in Computer Science
%D 2012
%T Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%B Lecture Notes in Computer Science
%V 7203
%P 661-670
%8 2012-09
%G eng
%0 Journal Article
%J Supercomputing '12 (poster)
%D 2012
%T Matrices Over Runtime Systems at Exascale
%A Emmanuel Agullo
%A George Bosilca
%A Cedric Castagnède
%A Jack Dongarra
%A Hatem Ltaeif
%A Stanimire Tomov
%B Supercomputing '12 (poster)
%C Salt Lake City, Utah
%8 2012-11
%G eng
%0 Journal Article
%J VECPAR 2012
%D 2012
%T Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators
%A Ahmad Abdelfattah
%A Jack Dongarra
%A David Keyes
%A Hatem Ltaeif
%B VECPAR 2012
%C Kobe, Japan
%8 2012-07
%G eng
%0 Conference Proceedings
%B Third International Conference on Energy-Aware High Performance Computing
%D 2012
%T Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
%A George Bosilca
%A Jack Dongarra
%A Hatem Ltaeif
%B Third International Conference on Energy-Aware High Performance Computing
%C Hamburg, Germany
%8 2012-09
%G eng
%0 Journal Article
%J SIAM Journal on Scientific Computing (Accepted)
%D 2012
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B SIAM Journal on Scientific Computing (Accepted)
%8 2012-07
%G eng
%0 Generic
%D 2011
%T Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report (also as a LAWN)
%8 2011-09
%G eng
%0 Generic
%D 2011
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243)
%8 2011-03
%G eng
%0 Conference Proceedings
%B Proceedings of PARCO'11
%D 2011
%T Exploiting Fine-Grain Parallelism in Recursive LU Factorization
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%K plasma
%B Proceedings of PARCO'11
%C Gent, Belgium
%8 2011-04
%G eng
%0 Conference Proceedings
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%D 2011
%T Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%B Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops)
%I IEEE
%C Anchorage, Alaska, USA
%P 1432-1441
%8 2011-05
%G eng
%0 Generic
%D 2011
%T High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247)
%8 2011-05
%G eng
%0 Conference Proceedings
%B Proceedings of MTAGS11
%D 2011
%T High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Piotr Luszczek
%B Proceedings of MTAGS11
%C Seattle, WA
%8 2011-11
%G eng
%0 Journal Article
%J in GPU Computing Gems, Jade Edition
%D 2011
%T A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%E Wen-mei W. Hwu
%K magma
%K morse
%B in GPU Computing Gems, Jade Edition
%I Elsevier
%V 2
%P 473-484
%8 2011-00
%G eng
%0 Journal Article
%J IEEE/ACS AICCSA 2011
%D 2011
%T LU Factorization for Accelerator-Based Systems
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Julien Langou
%A Hatem Ltaeif
%A Stanimire Tomov
%K magma
%K morse
%B IEEE/ACS AICCSA 2011
%C Sharm-El-Sheikh, Egypt
%8 2011-12
%G eng
%0 Conference Proceedings
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%K plasma
%K quark
%B Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11)
%C Seattle, WA
%8 2011-11
%G eng
%0 Generic
%D 2011
%T Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254)
%8 2011-08
%G eng
%0 Conference Proceedings
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%D 2011
%T Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency
%A Hatem Ltaeif
%A Piotr Luszczek
%A Jack Dongarra
%K mumi
%B International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011)
%C Hamburg, Germany
%8 2011-09
%G eng
%0 Journal Article
%J Submitted to SIAM Journal on Scientific Computing (SISC)
%D 2011
%T Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices.
%A Azzam Haidar
%A Hatem Ltaeif
%A Jack Dongarra
%B Submitted to SIAM Journal on Scientific Computing (SISC)
%8 2011-00
%G eng
%0 Conference Proceedings
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%D 2011
%T Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures
%A Piotr Luszczek
%A Hatem Ltaeif
%A Jack Dongarra
%B IEEE International Parallel and Distributed Processing Symposium (submitted)
%C Anchorage, AK
%8 2011-05
%G eng
%0 Journal Article
%J Submitted to Concurrency and Computations: Practice and Experience
%D 2010
%T Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures
%A Azzam Haidar
%A Hatem Ltaeif
%A Asim YarKhan
%A Jack Dongarra
%K plasma
%K quark
%B Submitted to Concurrency and Computations: Practice and Experience
%8 2010-11
%G eng
%0 Conference Proceedings
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%D 2010
%T Dense Linear Algebra Solvers for Multicore with GPU Accelerators
%A Stanimire Tomov
%A Rajib Nath
%A Hatem Ltaeif
%A Jack Dongarra
%X Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.
%B Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on
%C Atlanta, GA
%P 1-8
%G eng
%R 10.1109/IPDPSW.2010.5470941
%0 Generic
%D 2010
%T Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K dplasma
%K parsec
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-10-660
%8 2010-09
%G eng
%0 Generic
%D 2010
%T Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project
%A George Bosilca
%A Aurelien Bouteiller
%A Anthony Danalis
%A Mathieu Faverge
%A Azzam Haidar
%A Thomas Herault
%A Jakub Kurzak
%A Julien Langou
%A Pierre Lemariner
%A Hatem Ltaeif
%A Piotr Luszczek
%A Asim YarKhan
%A Jack Dongarra
%K dague
%K plasma
%B Innovative Computing Laboratory Technical Report
%8 2010-00
%G eng
%0 Generic
%D 2010
%T Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%B LAPACK Working Note
%8 2010-00
%G eng
%0 Journal Article
%J IEEE Transaction on Parallel and Distributed Systems (submitted)
%D 2010
%T Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Jack Dongarra
%K magma
%K plasma
%B IEEE Transaction on Parallel and Distributed Systems (submitted)
%8 2010-03
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems
%D 2010
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems
%P 417-423
%8 2010-04
%G eng
%0 Conference Proceedings
%B Proceedings of IPDPS 2011
%D 2010
%T QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Mathieu Faverge
%A Hatem Ltaeif
%A Samuel Thibault
%A Stanimire Tomov
%K magma
%K morse
%K plasma
%B Proceedings of IPDPS 2011
%C Anchorage, AK
%8 2010-10
%G eng
%0 Journal Article
%J Proc. of VECPAR'10 (to appear)
%D 2010
%T A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
%A Hatem Ltaeif
%A Stanimire Tomov
%A Rajib Nath
%A Peng Du
%A Jack Dongarra
%K magma
%K plasma
%B Proc. of VECPAR'10 (to appear)
%C Berkeley, CA
%8 2010-06
%G eng
%0 Journal Article
%J SC'10
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B SC'10
%I ACM SIGARCH/ IEEE Computer Society
%C New Orleans, LA
%8 2010-11
%G eng
%0 Generic
%D 2010
%T Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems
%A Fengguang Song
%A Hatem Ltaeif
%A Bilel Hadri
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report
%V –10-653
%8 2010-04
%G eng
%0 Generic
%D 2010
%T Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
%A Emmanuel Agullo
%A Cedric Augonnet
%A Jack Dongarra
%A Hatem Ltaeif
%A Raymond Namyst
%A Rajib Nath
%A Jean Roman
%A Samuel Thibault
%A Stanimire Tomov
%I 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster
%C Knoxville, TN
%8 2010-07
%G eng
%0 Journal Article
%J Concurrency and Computation: Practice and Experience
%D 2010
%T Scheduling Dense Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K gridpac
%K plasma
%B Concurrency and Computation: Practice and Experience
%V 22
%P 15-44
%8 2010-01
%G eng
%0 Journal Article
%J Journal of Scientific Computing
%D 2010
%T Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Journal of Scientific Computing
%V 18
%P 33-50
%8 2010-00
%G eng
%0 Conference Proceedings
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%D 2009
%T Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware
%A Emmanuel Agullo
%A Bilel Hadri
%A Hatem Ltaeif
%A Jack Dongarra
%B 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear)
%8 2009-00
%G eng
%0 Journal Article
%J PPAM 2009
%D 2009
%T Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B PPAM 2009
%C Poland
%8 2009-09
%G eng
%0 Journal Article
%J Submitted to Transaction on Parallel and Distributed Systems
%D 2009
%T Enhancing Parallelism of Tile QR Factorization for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Submitted to Transaction on Parallel and Distributed Systems
%8 2009-12
%G eng
%0 Generic
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Rajib Nath
%A Stanimire Tomov
%A Asim YarKhan
%A Vasily Volkov
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, OR
%8 2009-11
%G eng
%0 Conference Proceedings
%B Journal of Physics: Conference Series
%D 2009
%T Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
%A Emmanuel Agullo
%A James Demmel
%A Jack Dongarra
%A Bilel Hadri
%A Jakub Kurzak
%A Julien Langou
%A Hatem Ltaeif
%A Piotr Luszczek
%A Stanimire Tomov
%K magma
%K plasma
%B Journal of Physics: Conference Series
%V 180
%8 2009-00
%G eng
%0 Generic
%D 2009
%T Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project
%A Rajib Nath
%A Jack Dongarra
%A Stanimire Tomov
%A Hatem Ltaeif
%A Peng Du
%I The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09)
%C Portland, Oregon
%8 2009-11
%G eng
%0 Journal Article
%J IEEE Transactions on Parallel and Distributed Systems (to appear)
%D 2009
%T Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%B IEEE Transactions on Parallel and Distributed Systems (to appear)
%8 2009-05
%G eng
%0 Generic
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%B University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213)
%8 2009-00
%G eng
%0 Journal Article
%J Concurrency Practice and Experience (to appear)
%D 2009
%T Scheduling Linear Algebra Operations on Multicore Processors
%A Jakub Kurzak
%A Hatem Ltaeif
%A Jack Dongarra
%A Rosa M. Badia
%K plasma
%B Concurrency Practice and Experience (to appear)
%8 2009-00
%G eng
%0 Generic
%D 2009
%T Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645)
%8 2009-09
%G eng
%0 Conference Proceedings
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%D 2009
%T Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
%A Bilel Hadri
%A Hatem Ltaeif
%A Emmanuel Agullo
%A Jack Dongarra
%K plasma
%B accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010)
%C Atlanta, GA
%8 2009-12
%G eng
%0 Generic
%D 2008
%T Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited
%A Hatem Ltaeif
%A Jakub Kurzak
%A Jack Dongarra
%K plasma
%B University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208)
%8 2008-08
%G eng