Publications
PAPI Software-Defined Events for in-Depth Performance Analysis,”
The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1113-1127, November 2019.
(442.39 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
PAPI's new Software-Defined Events for in-depth Performance Analysis
, Dresden, Germany, 13th Parallel Tools Workshop, September 2019.
(3.14 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
ParILUT – A Parallel Threshold ILU for GPUs,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(505.95 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools,”
Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19, Denver, CO, ACM, November 2019.
(429.55 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance of Asynchronous Optimized Schwarz with One-sided Communication,”
Parallel Computing, vol. 86, pp. 66-81, August 2019.
(3.09 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,”
ACM Transactions on Mathematical Software, vol. 45, issue 2, June 2019.
(7.5 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Progressive Optimization of Batched LU Factorization on GPUs,”
IEEE High Performance Extreme Computing Conference (HPEC’19), Waltham, MA, IEEE, September 2019.
(299.38 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Race to Exascale,”
Computing in Science and Engineering, vol. 21, issue 1, pp. 4-5, March 2019.
(106.97 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
, Denver, CO, International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), November 2019.
(16.19 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library,”
International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Denver, CO, ACM, November 2019.
(2.01 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Developers' Guide,”
SLATE Working Notes, no. 11, ICL-UT-19-02: Innovative Computing Laboratory, University of Tennessee, December 2019.
(1.68 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Mixed Precision Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-03: University of Tennessee, April 2019.
(1.04 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Working Note 12: Implementing Matrix Inversions,”
SLATE Working Notes, no. 12, ICL-UT-19-04: Innovative Computing Laboratory, University of Tennessee, June 2019.
(1.95 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers,”
SLATE Working Notes, no. 13, ICL-UT-19-07: Innovative Computing Laboratory, University of Tennessee, September 2019.
(3.47 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Software-Defined Events through PAPI,”
2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, IEEE, May 2019.
(446.41 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Solving Linear Diophantine Systems on Parallel Architectures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 30, issue 5, pp. 1158-1169, May 2019.
(802.97 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Towards Continuous Benchmarking,”
Platform for Advanced Scientific Computing Conference (PASC 2019), Zurich, Switzerland, ACM Press, June 2019.
(1.51 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs,”
ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
(523.87 KB)
(3.42 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
![application/pdf](/modules/file/icons/application-pdf.png)
Understanding Native Event Semantics
, Knoxville, TN, 9th JLESC Workshop, April 2019.
(2.33 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,”
Parallel Computing, vol. 81, pp. 131-146, January 2019.
(1.9 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
What it Takes to keep PAPI Instrumental for the HPC Community,”
1st Workshop on Sustainable Scientific Software (CW3S19), Collegeville, Minnesota, July 2019.
(50.57 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
What it Takes to keep PAPI Instrumental for the HPC Community
, Collegeville, MN, The 2019 Collegeville Workshop on Sustainable Scientific Software (CW3S19), July 2019.
(3.29 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Is your scheduling good? How would you know?
, Bordeaux, France, 14th Scheduling for Large Scale Systems Workshop, June 2019.
(2.5 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
The 30th Anniversary of the Supercomputing Conference: Bringing the Future Closer—Supercomputing History and the Immortality of Now,”
Computer, vol. 51, issue 10, pp. 74–85, November 2018.
(1.73 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating NWChem Coupled Cluster through dataflow-based Execution,”
The International Journal of High Performance Computing Applications, vol. 32, issue 4, pp. 540--551, July 2018.
(1.68 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,”
Journal of Computational Science, vol. 26, pp. 237–245, May 2018.
(2.18 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,”
Parallel Computing, vol. 74, pp. 3–18, May 2018.
(1.34 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
ADAPT: An Event-Based Adaptive Collective Communication Framework,”
The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18), Tempe, Arizona, ACM Press, June 2018.
(493.65 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-09: Innovative Computing Laboratory, University of Tennessee, September 2018.
(3.74 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 12, pp. 2700–2712, December 2018.
(2.53 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, IEEE, May 2018.
(1.37 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Autotuning in High-Performance Computing Applications,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2068–2083, November 2018.
(2.5 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
(2.53 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
(720.15 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
, July 2018.
(483.05 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,”
Journal of Computational Science, vol. 26, pp. 226–236, May 2018.
(3.73 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry,”
The International Journal of High Performance Computing Applications, vol. 32, issue 4, pp. 435–479, July 2018.
(1.29 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,”
Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018.
(3.4 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Data Movement Interfaces to Support Dataflow Runtimes,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-03: University of Tennessee, May 2018.
(210.94 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,”
International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018.
(487.88 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Distributed Termination Detection for HPC Task-Based Environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-14: University of Tennessee, June 2018.
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Evaluation and Design of FFT for Distributed Accelerated Systems,”
ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.
(7.53 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Evaluation of Dataflow Programming Models for Electronic Structure Theory,”
Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms, vol. 2018, issue e4490, pp. 1–20, May 2018.
(1.69 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
(1.04 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973–984, May 2018.
(832.92 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
(642.51 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(2.96 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
Implementation of the C++ API for Batch BLAS,”
SLATE Working Notes, no. 07, ICL-UT-18-04: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.07 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Incomplete Sparse Approximate Inverses for Parallel Preconditioning,”
Parallel Computing, vol. 71, pp. 1–22, January 2018.
(1.24 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)