Publications
Performance, Design, and Autotuning of Batched GEMM for GPUs,”
The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany, June 2016.
(1.27 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures,”
Procedia Computer Science, vol. 108, pp. 606–615, June 2017.
(643.44 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
, San Jose, CA, GPU Technology Conference (GTC16), Poster, April 2016.
(480.51 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
High-Performance Tensor Contractions for GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738: University of Tennessee, January 2016.
(2.36 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Set of Batched Basic Linear Algebra Subprograms,”
ACM Transactions on Mathematical Software, October 2020.
“MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
(2.55 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
GPU algorithms for Efficient Exascale Discretizations,”
Parallel Computing, vol. 108, pp. 102841, 2021.
“Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators,”
VECPAR 2012, Kobe, Japan, July 2012.
(737.28 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Tensor Contractions using Optimized Batch GEMM Routines
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(1.64 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Port to AMD and Intel Platforms,”
SLATE Working Notes, no. 16, ICL-UT-21-01, April 2021.
(890.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(626.21 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,”
Journal of Computational Science, vol. 20, pp. 85–93, May 2017.
(3.6 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Linear Algebra Software for Large-Scale Accelerated Multicore Computing,”
Acta Numerica, vol. 25, pp. 1-160, May 2016.
“Progressive Optimization of Batched LU Factorization on GPUs,”
IEEE High Performance Extreme Computing Conference (HPEC’19), Waltham, MA, IEEE, September 2019.
(299.38 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic,”
SLATE Working Notes, no. 15, ICL-UT-20-08: University of Tennessee, July 2020.
(3.98 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 01, ICL-UT-17-02: Innovative Computing Laboratory, University of Tennessee, June 2017.
(2.8 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices using GPUs,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, Springer, Cham, June 2020.
(702.38 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,”
IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.
(729.87 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
, San Jose, CA, GPU Technology Conference (GTC), March 2019.
(2.47 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs,”
International Conference on Supercomputing (ICS '17), Chicago, Illinois, ACM, June 2017.
(1.04 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
PLASMA 17.1 Functionality Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-10: University of Tennessee, June 2017.
(1.8 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
PLASMA 17 Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-11: University of Tennessee, June 2017.
(7.57 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,”
Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015.
(3.68 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The Future of Supercomputing: An Interim Report,”
National Research Council, Washington, D.C., The National Academies Press, January 2003.
“