Publications
High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,”
International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016.
(277.51 KB)
“High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 3, no. 16, 2013.
(665.7 KB)
“GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,”
EuroPar 2012 (also LAWN 260), Rhodes Island, Greece, August 2012.
(662.98 KB)
“From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,”
Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.
(1.64 MB)
“Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy,”
University of Tennessee Computer Science Tech Report, no. UT-CS-06-574, LAPACK Working Note #175, April 2006.
(221.39 KB)
“Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,”
in High Performance Computing and Grids in Action, Amsterdam, IOS Press, January 2008.
(92.95 KB)
“Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,”
In High Performance Computing and Grids in Action (to appear), Amsterdam, IOS Press, 00 2007.
(122.01 KB)
“Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 17, pp. 5096 - 5113, Oct 12, 2015.
(1.99 MB)
“Experiences in autotuning matrix multiplication for energy minimization on GPUs,”
Concurrency in Computation: Practice and Experience, vol. 27, issue 17, pp. 5096-5113, December 2015.
(1.98 MB)
“Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction,”
Lecture Notes in Computer Science, vol. 7203, pp. 661-670, September 2012.
(185.77 KB)
“Design and Implementation of the PULSAR Programming System for Large Scale Computing,”
Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017.
(764.96 KB)
“Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,”
Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.
(1.01 MB)
“Dense Linear Algebra on Accelerated Multicore Hardware,”
High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.
“DARPA's HPCS Program: History, Models, Tools, Languages,”
in Advances in Computers, vol. 72: Elsevier, January 2008.
(3.61 MB)
“Cray X1 Evaluation Status Report,”
Oak Ridge National Laboratory Report, vol. /-2004/13, January 2004.
(817.33 KB)
“A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction,”
IPDPS 2012, Shanghai, China, May 2012.
(480.43 KB)
“Changes in Dense Linear Algebra Kernels - Decades Long Perspective,”
in Solving the Schrodinger Equation: Has everything been tried? (to appear): Imperial College Press, 00 2011.
“BlackjackBench: Portable Hardware Characterization with Automated Results Analysis,”
The Computer Journal, March 2013.
(408.45 KB)
“Batched matrix computations on hardware accelerators based on GPUs,”
International Journal of High Performance Computing Applications, February 2015.
(2.16 MB)
“Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
(720.15 KB)
“Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
(2.53 MB)
“Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting,”
Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, May 2014.
(1.96 MB)
“Acceleration of GPU-based Krylov solvers via Data Transfer Reduction,”
International Journal of High Performance Computing Applications, 2015.
“Accelerating Scientific Computations with Mixed Precision Algorithms,”
Computer Physics Communications, vol. 180, issue 12, pp. 2526-2533, December 2009.
(402.69 KB)
“Accelerating Restarted GMRES with Mixed Precision Arithmetic,”
IEEE Transactions on Parallel and Distributed Systems, June 2021.
(572.4 KB)
“Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators,”
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15), vol. No. 5, Austin, TX, ACM, November 2015.
(347.6 KB)
“Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“Task-graph scheduling extensions for efficient synchronization and communication,”
Proceedings of the ACM International Conference on Supercomputing, pp. 88–101, 2021.
“Recursive approach in sparse matrix LU factorization,”
Proceedings of 1st SGI Users Conference, Cracow, Poland (ACC Cyfronet UMM, 2000), pp. 409-418, January 2000.
(176.14 KB)
“Programming the LU Factorization for a Multicore System with Accelerators,”
Proceedings of VECPAR’12, Kobe, Japan, April 2012.
(414.33 KB)
“Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency,”
International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011), Hamburg, Germany, September 2011.
(1.27 MB)
“OpenCL Evaluation for Numerical Linear Algebra Library Development,”
Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10), Knoxville, TN, July 2010.
(2.69 MB)
“Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects,”
Journal of Physics: Conference Series, vol. 180, 00 2009.
(119.37 KB)
“Mixed-Tool Performance Analysis on Hybrid Multicore Architectures,”
First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), San Diego, CA, September 2010.
(1.24 MB)
“Measuring Energy and Power with PAPI,”
International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 2012.
(146.79 KB)
“LAPACK for Clusters Project: An Example of Self Adapting Numerical Software,”
Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04'), vol. 9, Big Island, Hawaii, pp. 90282, January 2004.
(80.97 KB)
“Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives,”
Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award, Orlando, FL, June 2017.
(453.66 KB)
“The HPC Challenge (HPCC) Benchmark Suite,”
SC06 Conference Tutorial, Tampa, Florida, IEEE, November 2006.
(1.08 MB)
“High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures,”
Proceedings of MTAGS11, Seattle, WA, November 2011.
(879.49 KB)
“Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1432-1441, May 2011.
(1.26 MB)
“Exploiting Fine-Grain Parallelism in Recursive LU Factorization,”
Proceedings of PARCO'11, no. ICL-UT-11-04, Gent, Belgium, April 2011.
“Evaluation of the HPC Challenge Benchmarks in Virtualized Environments,”
6th Workshop on Virtualization in High-Performance Cloud Computing, Bordeaux, France, August 2011.
(114.73 KB)
“Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture,”
The 2nd International Conference on Cloud and Green Computing (submitted), Xiangtan, Hunan, China, November 2012.
(329.5 KB)
“Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations,”
International Conference on Computational Science, Poland, Springer Verlag, June 2004.
(88.31 KB)
“CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience,”
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Montpellier, France, November 2013.
(238.58 KB)
“BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“Anatomy of a Globally Recursive Embedded LINPACK Benchmark,”
2012 IEEE High Performance Extreme Computing Conference, Waltham, MA, pp. 1-6, September 2012.
(204.74 KB)
“Virtual Systolic Array for QR Decomposition,”
15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013), Boston, MA, IEEE, May 2013.
(749.84 KB)
“Using Additive Modifications in LU Factorization Instead of Pivoting,”
37th ACM International Conference on Supercomputing (ICS'23), Orlando, FL, ACM, June 2023.
(624.18 KB)
“Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.51 MB)
“