Publications
HPC Forecast: Cloudy and Uncertain,”
Communications of the ACM, vol. 66, issue 2, pp. 82 - 90, January 2023.
DOI: 10.1145/3552309
“HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“
Harnessing the Computing Continuum for Programming Our World,”
Fog Computing: Theory and Practice: John Wiley & Sons, Inc., 2020.
DOI: 10.1002/9781119551713.ch7
(1.4 MB)
“
heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
DOI: 10.1007/978-3-030-50371-0_19
(2.62 MB)
“
heFFTe: Highly Efficient FFT for Exascale (Poster)
: NVIDIA GPU Technology Conference (GTC2020), October 2020.
(866.88 KB)

heFFTe: Highly Efficient FFT for Exascale (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(1.54 MB)

heFFTe: Highly Efficient FFT for Exascale (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(6.2 MB)

High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
(1.3 MB)
“
hipMAGMA v1.0
: Zenodo, March 2020.
DOI: 10.5281/zenodo.3908549
hipMAGMA v2.0
: Zenodo, July 2020.
DOI: 10.5281/zenodo.3928667
How to Build Your Own Deep Neural Network
: PEARC20, July 2020.
(18.8 MB)

Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1016.52 KB)
“
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
DOI: 10.1109/SC.2018.00050
(642.51 KB)
“
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(2.96 MB)

High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation,”
8th Workshop on Irregular Applications: Architectures and Algorithms, 2018.
“High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
DOI: 10.1145/3038228.3038237
(872.18 KB)
“
Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“
High-performance Matrix-matrix Multiplications of Very Small Matrices,”
22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016.
“High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“
Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“
Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems,”
Parallel Computing, vol. 39, issue 4-5, pp. 212-232, May 2013.
(1.43 MB)
“
Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.
(866.68 KB)
“
Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, IEEE Computer Society Press, May 2012.
(405.71 KB)
“
Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
University of Tennessee Computer Science Technical Report (also Lawn 257), no. UT-CS-11-684, October 2011.
(405.71 KB)
“
High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247), May 2011.
(424.93 KB)
“
High-Performance High-Resolution Semi-Lagrangian Tracer Transport on a Sphere,”
Journal of Computational Physics, vol. 230, issue 17, pp. 6778-6799, July 2011.
DOI: 10.1016/j.jcp.2011.05.008
(1.68 MB)
“
A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,”
in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.
“Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,”
IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.
(3.75 MB)
“
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,”
IEEE Transactions on Computers, vol. 58, issue 11, pp. 1512-1524, November 2009.
DOI: 10.1109/TC.2009.42
(1.81 MB)
“