Publications
Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(301.01 KB)
“Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives,”
Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award, Orlando, FL, June 2017.
DOI: 10.1109/IPDPSW.2017.65 (453.66 KB)
“Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.
“Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,”
IEEE Embedded Systems Letters, vol. 9, issue 3, pp. 61–64, May 2017.
DOI: 10.1109/LES.2017.2700401 (339.11 KB)
“Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,”
Concurrency and Computation: Practice and Experience, October 2013.
(1.71 MB)
“Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures,”
IPDPS 2013 (submitted), Boston, MA, 00 2013.
(1.22 MB)
“Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,”
Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(398.16 KB)
“Cholesky Across Accelerators,”
17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015), Elizabeth, NJ, IEEE, August 2015.
“GridSolve: The Evolution of Network Enabled Solver,”
Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006): Springer, pp. 215-226, 00 2007.
(377.48 KB)
“Experiments with Scheduling Using Simulated Annealing in a Grid Environment,”
Grid Computing - GRID 2002, Third International Workshop, vol. 2536, Baltimore, MD, Springer, pp. 232-242, November 2002.
(66.91 KB)
“Porting the PLASMA Numerical Library to the OpenMP Standard,”
International Journal of Parallel Programming, June 2016.
DOI: 10.1007/s10766-016-0441-6 (1.66 MB)
“QUARK Users' Guide: QUeueing And Runtime for Kernels,”
University of Tennessee Innovative Computing Laboratory Technical Report, no. ICL-UT-11-02, 00 2011.
(247.12 KB)
“SLATE Performance Report: Updates to Cholesky and LU Factorizations,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-14: University of Tennessee, October 2020.
(1.64 MB)
“An Empirical View of SLATE Algorithms on Scalable Hybrid System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-08: University of Tennessee, Knoxville, September 2019.
(441.16 KB)
“Recent Developments in GridSolve,”
International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms), vol. 20, no. 1: Sage Science Press, 00 2006.
(496.69 KB)
“Dynamic Task Execution on Shared and Distributed Memory Architectures
, 2012.
(3.29 MB)
Initial Integration and Evaluation of SLATE Parallel BLAS in LATTE,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-07: Innovative Computing Laboratory, University of Tennessee, June 2018.
(366.6 KB)
“Biological Sequence Alignment on the Computational Grid Using the GrADS Framework,”
Future Generation Computing Systems, vol. 21, no. 6: Elsevier, pp. 980-986, June 2005.
(147.29 KB)
“Automatic Blocking of QR and LU Factorizations for Locality,”
2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004), Washington, DC, ACM, June 2004.
DOI: 10.1145/1065895.1065898 (212.77 KB)
“The Design of an Auto-tuning I/O Framework on Cray XT5 System,”
Cray Users Group Conference (CUG'11) (Best Paper Finalist), Fairbanks, Alaska, May 2011.
(459.57 KB)
“An Effective Empirical Search Method for Automatic Software Tuning,”
ICL Technical Report, no. ICL-UT-05-02, January 2005.
(74.66 KB)
“Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator,”
ICL Technical Report, no. ICL-UT-07-02, January 2007.
(123.34 KB)
“Autotuned Parallel I/O for Highly Scalable Biosequence Analysis,”
TeraGrid'11, Salt Lake City, Utah, July 2011.
(275.34 KB)
“Automated Empirical Tuning of a Multiresolution Analysis Kernel,”
ICL Technical Report, no. ICL-UT-07-01, pp. 10, January 2007.
(120.7 KB)
“The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software,”
ACM/IEEE International Symposium on High Performance Distributed Computing, Boston, MA., June 2008.
(403.89 KB)
“Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software,”
Cluster Computing Journal: Special Issue on High Performance Distributed Computing, vol. 12, no. 2: Springer Netherlands, pp. 101-122, 00 2009.
(451.07 KB)
“Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK,”
Computer Modeling and Intelligent Systems CMIS-2020, Zaporizhzhoa, March 2020.
(451.33 KB)
“Solving Linear Diophantine Systems on Parallel Architectures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 30, issue 5, pp. 1158-1169, May 2019.
DOI: http://dx.doi.org/10.1109/TPDS.2018.2873354 (802.97 KB)
“Efficient Communications in Training Large Scale Neural Networks,”
ACM MultiMedia Workshop 2017, Mountain View, CA, ACM, October 2017.
(1.41 MB)
“Using long vector extensions for MPI reductions,”
Parallel Computing, vol. 109, pp. 102871, March 2022.
DOI: 10.1016/j.parco.2021.102871
“Using Arm Scalable Vector Extension to Optimize Open MPI,”
20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020), Melbourne, Australia, IEEE/ACM, May 2020.
DOI: 10.1109/CCGrid49817.2020.00-71 (359.95 KB)
“Runtime Level Failure Detection and Propagation in HPC Systems,”
European MPI Users' Group Meeting (EuroMPI '19), Zürich, Switzerland, ACM, September 2019.
DOI: 10.1145/3343211.3343225 (1.11 MB)
“Using Advanced Vector Extensions AVX-512 for MPI Reduction (Poster)
, Austin, TX, EuroMPI/USA '20: 27th European MPI Users' Group Meeting, September 2020.
(708.68 KB)
Using Advanced Vector Extensions AVX-512 for MPI Reduction,”
EuroMPI/USA '20: 27th European MPI Users' Group Meeting, Austin, TX, September 2020.
DOI: 10.1145/3416315.3416316 (634.45 KB)
“Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures,”
J. Phys.: Conf. Ser. 46, vol. :101088/1742-6596/46/1/040, pp. 292-298, January 2006.
(644.1 KB)
“