Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,” Journal of Computational Science, vol. 26, pp. 226–236, May 2018.“
Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,” Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018.“
The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,” International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018.“
Evaluation and Design of FFT for Distributed Accelerated Systems,” ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.“
A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973–984, May 2018.“
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,” The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.“
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100 , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Investigating Power Capping toward Energy-Efficient Scientific Applications,” Concurrency Computation: Practice and Experience, vol. 2018, issue e4485, pp. 1-14, April 2018.“
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR) , Washington, DC, NSF PI Meeting, Poster, April 2018.
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,” IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018.“
Tensor Contractions using Optimized Batch GEMM Routines , San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption,” ISC High Performance (ISC'18), Best Poster, Frankfurt, Germany, June 2018.“
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption , Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,” Parallel Computing, vol. 81, pp. 1–21, January 2019.“
Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.“
Evaluation of Directive-Based Performance Portable Programming Models,” International Journal of High Performance Computing and Networking, vol. 14, issue 2, pp. 165-182.“
FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.“
GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,” EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.“
Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,” Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.“
MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs : University of Tennessee, January 2019.
PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,” ACM Transactions on Mathematical Software, vol. 45, issue 2, June 2019.“
FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs,” ECP Milestone Report, no. FFT-ECP STML13-27: Innovative Computing Laboratory, University of Tennessee, January 2020.“
heFFTe: Highly Efficient FFT for Exascale,” International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.“
heFFTe: Highly Efficient FFT for Exascale (Poster) , Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
heFFTe: Highly Efficient FFT for Exascale (Poster) , Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
heFFTe: Highly Efficient FFT for Exascale (Poster) : NVIDIA GPU Technology Conference (GTC2020), October 2020.
MAGMA Templates for Scalable Linear Algebra on Emerging Architectures,” The International Journal of High Performance Computing Applications, vol. 34, issue 6, pp. 645-658, November 2020.“
Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems,” Proceedings of the Royal Society A, vol. 476, issue 2243, November 2020.“
Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing,” Innovative Computing Laboratory Technical Report, no. ICL-UT-20-05: University of Tennessee, May 2020.“
A Set of Batched Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, October 2020.“