Publications
Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels,”
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), Seattle, WA, November 2011.
(636.01 KB)
“
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,”
Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015.
DOI: 10.14529/jsfi1504
(3.68 MB)
“
Parallel Programming in MATLAB,”
The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 277-283, July 2009.
(215.71 KB)
“
Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I,”
Lecture Notes in Computer Science, 1, no. 12043: Springer International Publishing, pp. 581, March 2020.
DOI: 10.1007/978-3-030-43229-4
“Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part II,”
Lecture Notes in Computer Science, no. 12044: Springer International Publishing, pp. 503, March 2020.
DOI: 10.1007/978-3-030-43222-5
“Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,”
International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, ACM, September 2011.
DOI: 10.1109/ICPP.2011.71
(1.41 MB)
“
Parallel Norms Performance Report,”
SLATE Working Notes, no. 06, ICL-UT-18-06: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.13 MB)
“
Parallel Dense Linear Algebra Software in the Multicore Era,”
in Cyberinfrastructure Technologies and Applications: Nova Science Publishers, Inc., pp. 9-24, 00 2009.
“Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited,”
University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208), August 2008.
(420.31 KB)
“
Parallel BLAS Performance Report,”
SLATE Working Notes, no. 05, ICL-UT-18-01: University of Tennessee, April 2018.
(4.39 MB)
“
PAQR: Pivoting Avoiding QR factorization,”
ICL Technical Report, no. ICL-UT-22-06, June 2022.
(364.85 KB)
“
PAQR: Pivoting Avoiding QR factorization,”
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, FL, USA, IEEE, 2023.
DOI: 10.1109/IPDPS54959.2023.00040
“PAPI-V: Performance Monitoring for Virtual Machines,”
CloudTech-HPC 2012, Pittsburgh, PA, September 2012.
DOI: 10.1109/ICPPW.2012.29
(2.69 MB)
“
PAPI's new Software-Defined Events for in-depth Performance Analysis
, Dresden, Germany, 13th Parallel Tools Workshop, September 2019.
(3.14 MB)

PAPI's New Software-Defined Events for In-Depth Performance Analysis
, Lyon, France, CCDSC 2018: Workshop on Clusters, Clouds, and Data for Scientific Computing, September 2018.
PAPI Software-Defined Events for in-Depth Performance Analysis,”
The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1113-1127, November 2019.
(442.39 KB)
“
The PAPI Cross-Platform Interface to Hardware Performance Counters,”
Department of Defense Users' Group Conference Proceedings, Biloxi, Mississippi, June 2001.
(328.56 KB)
“
PAPI: Counting outside the Box
, Barcelona, Spain, 8th JLESC Meeting, April 2018.
PAPI 5: Measuring Power, Energy, and the Cloud
, Austin, TX, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, April 2013.
(78.39 KB)

P1673R3: A Free Function Linear algebra Interface Based on the BLAS,”
ISO JTC1 SC22 WG22, no. P1673R3: ISO, April 2021.
(858.89 KB)
“
Overhead of Using Spare Nodes,”
The International Journal of High Performance Computing Applications, February 2020.
DOI: 10.1177%2F1094342020901885
(2.15 MB)
“
Out of Memory SVD Solver for Big Data,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Waltham, MA, IEEE, September 2017.
(1.33 MB)
“
Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,”
International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.
DOI: 10.1016/j.procs.2017.05.237
(364.95 KB)
“
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,”
ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.
(630.63 KB)
“
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,”
IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.
(729.87 KB)
“
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
, San Jose, CA, GPU Technology Conference (GTC), March 2019.
(2.47 MB)

Optimized Batched Linear Algebra for Modern Architectures,”
Euro-Par 2017, Santiago de Compostela, Spain, Springer, August 2017.
DOI: 10.1007/978-3-319-64203-1_37
(618.33 KB)
“