Publications
Performance Application Programming Interface,”
Accelerated Computing with HIP: Sun, Baruah and Kaeli, December 2022.
“PAPI's new Software-Defined Events for in-depth Performance Analysis
, Dresden, Germany, 13th Parallel Tools Workshop, September 2019.
(3.14 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI,”
2019 International Conference on Parallel Computing (ParCo2019), Prague, Czech Republic, September 2019.
“Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC),”
LAPACK Working Notes, no. 297, ICL-UT-20-07: University of Tennessee.
(1.41 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Prospectus for the Next LAPACK and ScaLAPACK Libraries,”
PARA 2006, Umea, Sweden, June 2006.
(460.11 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
LAPACK 2005 Prospectus: Reliable and Scalable Software for Linear Algebra Computations on High End Computers
: LAPACK Working Note 164, January 2005.
(172.59 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating Time-To-Solution for Computational Science and Engineering,”
SciDAC Review, 00 2009.
(739.11 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Self Adapting Linear Algebra Algorithms and Software,”
IEEE Proceedings (to appear), 00 2004.
(587.67 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Towards An Efficient, Scalable Replication Mechanism for the I2-DSI Project,”
University of North Carolina School of Library and Information Science Technical Report, no. TR-1999-01, January 1999.
“Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors,”
ACM Transactions on Mathematical Software, vol. 49, issue 3, pp. 1 - 29, September 2023.
“O(N) distributed direct factorization of structured dense matrices using runtime systems,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, August 2023.
“FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study,”
Lecture Notes in Computer Science, vol. 4192, no. ICL-UT-06-14: Springer Berlin / Heidelberg, pp. 133-140, 00 2006.
(362.44 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, May 2014.
(490.08 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015.
(783.45 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013.
(659.77 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,”
University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.
(358.98 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance evaluation of LU factorization through hardware counter measurements,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.
(794.82 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.01 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-16-02: University of Tennessee, August 2016.
(929.79 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Fast Batched Cholesky Factorization on a GPU,”
International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, September 2014.
(1.37 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.
(866.68 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Acceleration of the BLAST Hydro Code on GPU,”
Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.
“Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,”
International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.
(364.95 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,”
Journal of Computational Science, vol. 26, pp. 237–245, May 2018.
(2.18 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU,”
16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(684.73 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, no. CS-89-85: University of Tennessee, June 2014.
(514.64 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
HPCS Library Study Effort,”
University of Tennessee Computer Science Technical Report, UT-CS-08-617, January 2008.
(73.22 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster)
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, 20 2020.
(2.53 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Numerical Linear Algebra,”
Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J., vol. 41, pp. 207-233, August 1999.
(262 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Report on the Oak Ridge National Laboratory's Frontier System,”
ICL Technical Report, no. ICL-UT-22-05, May 2022.
(16.87 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
University of Tennessee Computer Science Technical Report (also Lawn 257), no. UT-CS-11-684, October 2011.
(405.71 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Overview of Heterogeneous High Performance and Grid Computing,”
Engineering the Grid (to appear): Nova Science Publishers, Inc., 00 2004.
(199.93 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Using PAPI for Hardware Performance Monitoring on Linux Systems,”
Conference on Linux Clusters: The HPC Revolution, Urbana, Illinois, Linux Clusters Institute, June 2001.
(422.35 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
(2.53 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
HPC Challenge: Design, History, and Implementation Highlights,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(790.01 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Not So Simple Matter of Software,”
NCSA Access Online: NCSA, 00 2005.
(457.69 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Asynchronous Algorithm on NetSolve Global Computing System,”
Future Generation Computer Systems, vol. 22, issue 3, pp. 279-290, February 2006.
(568.92 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Remembering Ken Kennedy,”
SciDAC Review, vol. 5, no. 2007, 00 2007.
(519.68 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
High Performance Computing Trends,”
HERMIS, vol. 2, pp. 155-163, November 2001.
“Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
, July 2018.
(483.05 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems,”
University of Tennessee Computer Science Technical Report , no. ut-eecs-15-736: University of Tennessee, January 2015.
“High Performance Computing Trends, Supercomputers, Clusters, and Grids,”
Information Processing Society of Japan Symposium Series, vol. 2003, no. 14, pp. 55-58, January 2003.
“How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination,”
in Beautiful Code Leading Programmers Explain How They Think: O'Reilly Media, Inc., June 2007.
(257 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, UT-CS-89-85, 00 2010.
(6.42 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Revisiting the Double Checkpointing Algorithm,”
University of Tennessee Computer Science Technical Report (LAWN 274), no. ut-cs-13-705, January 2013.
(682.22 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Recent Advances in Parallel Virtual Machine and Message Passing Interface,”
Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000, (Hungary: Springer Verlag), pp. V1908, January 2000.
“Empirical Performance Tuning of Dense Linear Algebra Software,”
in Performance Tuning of Scientific Applications (to appear), 00 2010.
“Finite-choice Algorithm Optimization in Conjugate Gradients (LAPACK Working Note 159),”
University of Tennessee Computer Science Technical Report, UT-CS-03-502, January 2003.
(64.52 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Numerical Libraries and Tools for Scalable Parallel Cluster Computing,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 2, pp. 175-180, January 2001.
(37.38 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The evolution of mathematical software,”
Communications of the ACM, vol. 65227, issue 12, pp. 66 - 72, December 2022.
“