Publications
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Parallel Computing, vol. 81, pp. 1–21, January 2019.
DOI: 10.1016/j.parco.2018.10.003
(3.27 MB)
“
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-09: Innovative Computing Laboratory, University of Tennessee, September 2018.
(3.74 MB)
“
Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem,”
LAPACK Working Note, no. LAWN 295, ICL-UT-18-02: University of Tennessee, April 2018.
(1.53 MB)
“
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,”
International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, ACM, September 2011.
DOI: 10.1109/ICPP.2011.71
(1.41 MB)
“
Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators,”
IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist, Waltham, MA, IEEE, September 2019.
(470.21 KB)
“
Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
DOI: 10.14529/jsfi180404
(720.15 KB)
“
Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance,”
2022 IEEE High Performance Extreme Computing Conference (HPEC): IEEE, September 2022.
DOI: 10.1109/HPEC55821.2022.9926401
“The PLASMA Library on CORAL Systems and Beyond (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(550.86 KB)

Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds,”
IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017), Boston, MA, IEEE, December 2017.
DOI: 10.1109/BigData.2017.8258258
(6.71 MB)
“
Combining multitask and transfer learning with deep Gaussian processes for autotuning-based performance engineering,”
The International Journal of High Performance Computing Applications, March 2023.
DOI: 10.1177/10943420231166365
“Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applications,”
Future Generation Computer Systems, vol. 160, pp. 359 - 374, November 2024.
DOI: 10.1016/j.future.2024.06.004
“Scalable Data Generation for Evaluating Mixed-Precision Solvers,”
2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, IEEE, September 2020.
DOI: 10.1109/HPEC43674.2020.9286145
(1.3 MB)
“
Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic,”
2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091031
(1.67 MB)
“
Numerical eigen-spectrum slicing, accurate orthogonal eigen-basis, and mixed-precision eigenvalue refinement using OpenMP data-dependent tasks and accelerator offload,”
The International Journal of High Performance Computing Applications, vol. 303, issue 136, September 2024.
DOI: 10.1177/10943420241281050
“Parallel Programming in MATLAB,”
The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 277-283, July 2009.
(215.71 KB)
“
ADAPT: An Event-Based Adaptive Collective Communication Framework,”
The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18), Tempe, Arizona, ACM Press, June 2018.
DOI: 10.1145/3208040.3208054
(493.65 KB)
“
HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“
Evolution of the computational science community: The dynamics of topics and collaborations in 24 years of ICCS and JoCS publications,”
Journal of Computational Science, vol. 89, July 2025.
DOI: 10.1016/j.jocs.2025.102609
“Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD,”
Concurrency and Computation: Practice and Experience, April 2020.
DOI: 10.1002/cpe.5754
(1.43 MB)
“
Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited,”
University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208), August 2008.
(420.31 KB)
“
High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247), May 2011.
(424.93 KB)
“
Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures,”
Journal of Scientific Computing, vol. 18, no. 1, pp. 33-50, 00 2010.
(334.5 KB)
“