Publications
Resiliency in numerical algorithm design for extreme scale simulations,”
The International Journal of High Performance Computing Applications, vol. 36371337212766180823, issue 2, pp. 251 - 285, March 2022.
“Matrices Over Runtime Systems at Exascale,”
Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.
“MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project,”
Sparse Days 2010 Meeting at CERFACS, Toulouse, France, June 2010.
“A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,”
in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.
“Towards a Complexity Analysis of Sparse Hybrid Linear Solvers,”
PARA 2010, Reykjavik, Iceland, June 2010.
“QCG-OMPI: MPI Applications on Grids,”
Future Generation Computer Systems, vol. 27, no. 4, pp. 357-369, March 2010.
(1.48 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs,”
LAPACK Working Note, no. 230, 00 2010.
(334.48 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
QCG-OMPI: MPI Applications on Grids.,”
Future Generation Computer Systems, vol. 27, no. 4, pp. 435-369, January 2011.
(1.48 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware,”
2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear), 00 2009.
(515.63 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Parallel algebraic domain decomposition solver for the solution of augmented systems.,”
Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April, 00 2011.
“LU Factorization for Accelerator-Based Systems,”
IEEE/ACS AICCSA 2011, Sharm-El-Sheikh, Egypt, December 2011.
(234.86 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.,”
The Twentieth International Conference on Domain Decomposition Methods, La Jolla, California, February 2011.
“QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment,”
24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224), Atlanta, GA, April 2010.
(261.55 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
, Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
(3.86 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators,”
Proceedings of IPDPS 2011, no. ICL-UT-10-04, Anchorage, AK, October 2010.
(468.17 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-15: University of Tennessee, Knoxville, November 2018.
(9.26 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Performance Improvements: QR and Eigenvalues,”
SLATE Working Notes, no. 17, ICL-UT-21-02, April 2021.
(2 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation,”
Euro-Par 2009, Lecture Notes in Computer Science, vol. 5704/2009, Delft, The Netherlands, Springer Berlin / Heidelberg, pp. 334-344, August 2009.
(312.74 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units,”
Concurrency and Computation: Practice and Experience, vol. 34, issue 14, June 2022.
(749.82 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors,”
Concurrency and Computation: Practice and Experience, August 2023.
“Compressed basis GMRES on high-performance graphics processing units,”
The International Journal of High Performance Computing Applications, May 2022.
(13.52 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 4, pp. 885-904, September 2014.
(1.83 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
“Computational Science – ICCS 2009, Proceedings of the 9th International Conference,”
Lecture Notes in Computer Science: Theoretical Computer Science and General Issues, vol. -, no. 5544-5545, Baton Rouge, LA, May 2009.
Communication Avoiding LU with Tournament Pivoting in SLATE,”
SLATE Working Notes, no. 18, ICL-UT-22-01, January 2022.
(3.74 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Collection of White Papers from the BDEC2 Workshop in San Diego, CA,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-13: University of Tennessee, October 2019.
(8.25 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor,”
University of Tennessee Computer Science Technical Report, no. UT-CS-08-609, (also LAPACK Working Note 189), January 2008.
(500.99 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor,”
Parallel Computing, vol. 35, pp. 138-150, 00 2009.
(591.16 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
LAPACK Users' Guide, 3rd ed.,”
Philadelphia: Society for Industrial and Applied Mathematics, January 1999.
“Analysis and Optimization of Yee_Bench using Hardware Performance Counters,”
Proceedings of Parallel Computing 2005 (ParCo), Malaga, Spain, January 2005.
(72.27 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Self-Healing Network for Scalable Fault-Tolerant Runtime Environments,”
Future Generation Computer Systems, vol. 26, no. 3, pp. 479-485, March 2010.
(1.54 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Reliability Analysis of Self-Healing Network using Discrete-Event Simulation,”
Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07): IEEE Computer Society, pp. 437-444, May 2007.
“Optimal Routing in Binomial Graph Networks,”
The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT), Adelaide, Australia, IEEE Computer Society, December 2007.
“Self-Healing Network for Scalable Fault Tolerant Runtime Environments,”
DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems, Innsbruck, Austria, January 2006.
(162.83 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Self-Healing in Binomial Graph Networks,”
2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007), Vilamoura, Algarve, Portugal, November 2007.
(322.39 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology,”
Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), Niagara Falls, Canada, Springer, August 2007.
(480.47 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Scalable Fault Tolerant Protocol for Parallel Runtime Environments,”
2006 Euro PVM/MPI, no. ICL-UT-06-12, Bonn, Germany, 00 2006.
(149.07 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Collection of White Papers from the BDEC2 Workshop in Poznan, Poland,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-10: University of Tennessee, Knoxville, May 2019.
(5.82 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,”
Parallel Computing, vol. 81, pp. 131-146, January 2019.
(1.9 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Variable-Size Batched Condition Number Calculation on GPUs,”
SBAC-PAD, Lyon, France, September 2018.
(509.3 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures,”
VECPAR 2014, Eugene, OR, June 2014.
(430.56 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,”
University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260), December 2011.
(662.98 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
ParILUT – A Parallel Threshold ILU for GPUs,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(505.95 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
, Frankfurt, Germany, ISC High Performance (ISC15), Intel Booth Presentation, June 2015.
(2.03 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Ginkgo: A High Performance Numerical Linear Algebra Library,”
Journal of Open Source Software, vol. 5, issue 52, August 2020.
(721.84 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Batched Generation of Incomplete Sparse Approximate Inverses on GPUs,”
Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 49–56, November 2016.
“Tuning Stationary Iterative Solvers for Fault Resilience,”
6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15), Austin, TX, ACM, November 2015.
(1.28 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems,”
Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, July 2014.
(779.57 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,”
ICCS 2012, Omaha, NE, June 2012.
(608.95 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Gingko: A Sparse Linear Algebrea Library for HPC
: 2021 ECP Annual Meeting, April 2021.
(893.04 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations,”
2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS): IEEE, November 2020.
(1.9 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)