Publications
HPC Forecast: Cloudy and Uncertain,”
Communications of the ACM, vol. 66, issue 2, pp. 82 - 90, January 2023.
DOI: 10.1145/3552309
“HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“Harnessing the Computing Continuum for Programming Our World,”
Fog Computing: Theory and Practice: John Wiley & Sons, Inc., 2020.
DOI: 10.1002/9781119551713.ch7 (1.4 MB)
“heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
DOI: 10.1007/978-3-030-50371-0_19 (2.62 MB)
“heFFTe: Highly Efficient FFT for Exascale (Poster)
: NVIDIA GPU Technology Conference (GTC2020), October 2020.
(866.88 KB)
heFFTe: Highly Efficient FFT for Exascale (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(1.54 MB)
heFFTe: Highly Efficient FFT for Exascale (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(6.2 MB)
High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
(1.3 MB)
“hipMAGMA v1.0
: Zenodo, March 2020.
DOI: 10.5281/zenodo.3908549
hipMAGMA v2.0
: Zenodo, July 2020.
DOI: 10.5281/zenodo.3928667
How to Build Your Own Deep Neural Network
: PEARC20, July 2020.
(18.8 MB)
Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1016.52 KB)
“Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
DOI: 10.1109/SC.2018.00050 (642.51 KB)
“Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(2.96 MB)
High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation,”
8th Workshop on Irregular Applications: Architectures and Algorithms, 2018.
“High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
DOI: 10.1145/3038228.3038237 (872.18 KB)
“Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(535.72 KB)
“Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,”
International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016.
DOI: 10.1177/1094342015593158 (277.51 KB)
“High Performance Realtime Convex Solver for Embedded Systems,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-745, October 2016.
(225.43 KB)
“High-performance Matrix-matrix Multiplications of Very Small Matrices,”
22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016.
“High-Performance Tensor Contractions for GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738: University of Tennessee, January 2016.
(2.36 MB)
“High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“The HPL Benchmark: Past, Present & Future
, ISC High Performance, Frankfurt, Germany, July 2016.
(3.41 MB)
Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“High-Performance Computing,”
The Princeton Companion to Applied Mathematics, Princeton, New Jersey, Princeton University Press, pp. 839-842, 2015.
“High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems,”
The International Journal of High Performance Computing Applications, 2015.
DOI: 10.1177/1094342015593158 (336.19 KB)
“HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
Scientific Programming, vol. 23, issue 1, January 2015.
DOI: 10.3233/SPR-140404 (553.94 KB)
“HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems,”
University of Tennessee Computer Science Technical Report , no. ut-eecs-15-736: University of Tennessee, January 2015.
“Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments,”
VECPAR 2014, Eugene, OR, June 2014.
(276.52 KB)
“Hybrid Multi-Elimination ILU Preconditioners on GPUs,”
International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.67 MB)
“Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems,”
Parallel Computing, vol. 39, issue 4-5, pp. 212-232, May 2013.
(1.43 MB)
“High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 3, no. 16, 2013.
DOI: 10.1145/2450153.2450154 (665.7 KB)
“HPC Challenge: Design, History, and Implementation Highlights,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(790.01 KB)
“Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.
(866.68 KB)
“Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, IEEE Computer Society Press, May 2012.
(405.71 KB)
“HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“High Performance Computing Systems: Status and Outlook,”
Acta Numerica, vol. 21, Cambridge, UK, Cambridge University Press, pp. 379-474, May 2012.
(1.48 MB)
“High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors,”
ICCS 2012, Omaha, NE, June 2012.
(1.27 MB)
“How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE,”
University of Tennessee Computer Science Technical Report (also LAWN 270), no. UT-CS-12-698, July 2012.
(501.53 KB)
“HPC Challenge: Design, History, and Implementation Highlights,”
On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear): Chapman & Hall/CRC Press, 00 2012.
(469.92 KB)
“Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
University of Tennessee Computer Science Technical Report (also Lawn 257), no. UT-CS-11-684, October 2011.
(405.71 KB)
“High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247), May 2011.
(424.93 KB)
“High Performance Dense Linear System Solver with Soft Error Resilience,”
IEEE Cluster 2011, Austin, TX, September 2011.
(1.27 MB)
“High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures,”
Proceedings of MTAGS11, Seattle, WA, November 2011.
(879.49 KB)
“High-Performance High-Resolution Semi-Lagrangian Tracer Transport on a Sphere,”
Journal of Computational Physics, vol. 230, issue 17, pp. 6778-6799, July 2011.
DOI: 10.1016/j.jcp.2011.05.008 (1.68 MB)
“A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,”
in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.
“Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,”
IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.
(3.75 MB)
“Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,”
IEEE Transactions on Computers, vol. 58, issue 11, pp. 1512-1524, November 2009.
DOI: 10.1109/TC.2009.42 (1.81 MB)
“A Holistic Approach for Performance Measurement and Analysis for Petascale Applications,”
ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing, vol. 2009, Baton Rouge, Louisiana, Springer-Verlag Berlin Heidelberg 2009, pp. 686-695, May 2009.
(3.96 MB)
“