Publications
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
Concurrency and Computation: Practice and Experience, November 2013.
(894.61 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Updated Set of Basic Linear Algebra Subprograms (BLAS),”
ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135-151, December 2002.
(228.33 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Updating Incomplete Factorization Preconditioners for Model Order Reduction,”
Numerical Algorithms, vol. 73, issue 3, no. 3, pp. 611–630, February 2016.
(565.34 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot,”
Journal of Computational Physics, vol. 223, pp. 774-782, 00 2007.
(452.6 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot,”
Journal of Computational Physics (submitted), January 2006.
(337.08 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning,”
Journal of Parallel and Distributed Computing, vol. 119, pp. 219–230, November 2018.
(273.53 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Using MAGMA with PGI Fortran,”
PGI Insider, November 2010.
(176.67 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,”
ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 17-22, 00 2008.
(364.48 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,”
Parallel Computing, vol. 81, pp. 131-146, January 2019.
(1.9 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The Virtual Instrument: Support for Grid-enabled Scientific Simulations,”
Journal of Parallel and Distributed Computing (submitted), October 2002.
(282.16 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The Virtual Instrument: Support for Grid-enabled Scientific Simulations,”
International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 3-17, January 2004.
(282.16 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
VisPerf: Monitoring Tool for Grid Computing,”
Lecture Notes in Computer Science, vol. 2659: Springer Verlag, Heidelberg, pp. 233-243, 00 2003.
(835.09 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems,”
SIAM Journal on Computing (submitted), March 2012.
(811.01 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
With Extreme Computing, the Rules Have Changed,”
Computing in Science & Engineering, vol. 19, issue 3, pp. 52-62, May 2017.
(485.34 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating FFT towards Exascale Computing
: NVIDIA GPU Technology Conference (GTC2021), 2021.
(27.23 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
, Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster, September 2016.
(4.29 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Acceleration of the BLAST Hydro Code on GPU,”
Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.
“Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
, San Jose, CA, GPU Technology Conference (GTC16), Poster, April 2016.
(480.51 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Clover: Computational Libraries Optimized via Exascale Research
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(872 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Enabled Libraries and Applications
: 2021 Exascale Computing Project Annual Meeting, April 2021.
(3.24 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Enabled Libraries and Applications (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(979.27 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Systems and Interfaces (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(840.54 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project)
, Austin, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08), November 2008.
(5.28 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Exa-PAPI: The Exascale Performance API with Modern C++
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(556.78 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
FFT-ECP Fast Fourier Transform
, Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
(1.51 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(699 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,”
EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.
(2.25 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(2.96 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
heFFTe: Highly Efficient FFT for Exascale (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(6.2 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
heFFTe: Highly Efficient FFT for Exascale (Poster)
: NVIDIA GPU Technology Conference (GTC2020), October 2020.
(866.88 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
heFFTe: Highly Efficient FFT for Exascale (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(1.54 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
(2.55 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
, Washington, DC, NSF PI Meeting, Poster, April 2018.
(2.4 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
, Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
(3.53 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project
, Portland, Oregon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
(1.41 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
, San Jose, CA, GPU Technology Conference (GTC), March 2019.
(2.47 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster)
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, 20 2020.
(2.53 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
The PLASMA Library on CORAL Systems and Beyond (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(550.86 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
, Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
(3.86 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
SLATE: Software for Linear Algebra Targeting Exascale (POSTER)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(546.56 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
A Standard for Batched BLAS Routines
, Paris, France, 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16), April 2016.
(1.93 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Tensor Contractions using Optimized Batch GEMM Routines
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(1.64 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Towards a High-Performance Tensor Algebra Package for Accelerators
, Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC15), September 2015.
(1.76 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Using Advanced Vector Extensions AVX-512 for MPI Reduction (Poster)
, Austin, TX, EuroMPI/USA '20: 27th European MPI Users' Group Meeting, September 2020.
(708.68 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
, Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
(3.01 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Using Quantized Integer in LU Factorization with Partial Pivoting (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(6.65 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Exascale Computing and Big Data,”
Communications of the ACM, vol. 58, no. 7: ACM, pp. 56-68, July 2015.
(7.3 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Generalizing Random Butterfly Transforms to Arbitrary Matrix Sizes
: arXiv, December 2023.
hipMAGMA v1.0
: Zenodo, March 2020.
hipMAGMA v2.0
: Zenodo, July 2020.