Publications
Programming the LU Factorization for a Multicore System with Accelerators,”
Proceedings of VECPAR’12, Kobe, Japan, April 2012.
(414.33 KB)
“Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC,”
Lawn 277, no. UT-CS-13-709, May 2013.
(298.63 KB)
“An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
University of Tennessee Computer Science Technical Report (also LAWN 283), no. ut-eecs-13-720: University of Tennessee, October 2013.
(1.23 MB)
“LU Factorization with Partial Pivoting for a Multicore System with Accelerators,”
IEEE Transactions on Parallel and Distributed Computing, vol. 24, issue 8, pp. 1613-1621, August 2013.
DOI: http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242 (1.08 MB)
“Multithreading in the PLASMA Library,”
Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications: Taylor & Francis, 00 2013.
(536.28 KB)
“Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,”
Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015.
DOI: 10.14529/jsfi1504 (3.68 MB)
“A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015.
DOI: 10.1002/cpe.3306 (783.45 KB)
“Visualizing Execution Traces with Task Dependencies,”
2nd Workshop on Visual Performance Analysis (VPA '15), Austin, TX, ACM, November 2015.
(927.5 KB)
“Linear Algebra Software for Large-Scale Accelerated Multicore Computing,”
Acta Numerica, vol. 25, pp. 1-160, May 2016.
DOI: 10.1017/S0962492916000015
“Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,”
Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017.
DOI: 10.1109/IPDPSW.2017.18
“Bringing High Performance Computing to Big Data Algorithms,”
Handbook of Big Data Technologies: Springer, 2017.
DOI: 10.1007/978-3-319-49340-4 (1.22 MB)
“C++ API for Batch BLAS,”
SLATE Working Notes, no. 04, ICL-UT-17-12: University of Tennessee, December 2017.
(1.89 MB)
“C++ API for BLAS and LAPACK,”
SLATE Working Notes, no. 02, ICL-UT-17-03: Innovative Computing Laboratory, University of Tennessee, June 2017.
(1.12 MB)
“The Case for Directive Programming for Accelerator Autotuner Optimization,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-07: University of Tennessee, October 2017.
(341.52 KB)
“Design and Implementation of the PULSAR Programming System for Large Scale Computing,”
Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017.
DOI: 10.14529/jsfi170101 (764.96 KB)
“Designing SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 03, ICL-UT-17-06: Innovative Computing Laboratory, University of Tennessee, October 2017.
(2.8 MB)
“PLASMA 17 Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-11: University of Tennessee, June 2017.
(7.57 MB)
“PLASMA 17.1 Functionality Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-10: University of Tennessee, June 2017.
(1.8 MB)
“Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 01, ICL-UT-17-02: Innovative Computing Laboratory, University of Tennessee, June 2017.
(2.8 MB)
“Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds,”
IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017), Boston, MA, IEEE, December 2017.
DOI: 10.1109/BigData.2017.8258258 (6.71 MB)
“Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic,”
2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091031 (1.67 MB)
“With Extreme Computing, the Rules Have Changed,”
Computing in Science & Engineering, vol. 19, issue 3, pp. 52-62, May 2017.
DOI: 10.1109/MCSE.2017.48 (485.34 KB)
“Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
DOI: 10.1109/JPROC.2018.2868961 (2.53 MB)
“Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
DOI: 10.14529/jsfi180404 (720.15 KB)
“Implementation of the C++ API for Batch BLAS,”
SLATE Working Notes, no. 07, ICL-UT-18-04: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.07 MB)
“Least Squares Performance Report,”
SLATE Working Notes, no. 09, ICL-UT-18-10: Innovative Computing Laboratory, University of Tennessee, December 2018.
(1.76 MB)
“