Publications
Out of Memory SVD Solver for Big Data,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Waltham, MA, IEEE, September 2017.
(1.33 MB)
“Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,”
International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.
(364.95 KB)
“Optimizing Krylov Subspace Solvers on Graphics Processing Units,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(536.32 KB)
“Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,”
IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.
(729.87 KB)
“Optimized Batched Linear Algebra for Modern Architectures,”
Euro-Par 2017, Santiago de Compostela, Spain, Springer, August 2017.
(618.33 KB)
“Optimization for Performance and Energy for Batched Matrix Computations on GPUs,”
8th Workshop on General Purpose Processing Using GPUs (GPGPU 8), San Francisco, CA, ACM, February 2015.
(699.5 KB)
“Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(603.58 KB)
“Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award, Vancouver, BC, Canada, IEEE, May 2018.
(899.3 KB)
“Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
(1.02 MB)
“OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.48 MB)
“Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs,”
International Conference on Supercomputing (ICS '17), Chicago, Illinois, ACM, June 2017.
(1.04 MB)
“Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations,”
2013 IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, IEEE, April 2013.
(307.24 KB)
“New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem,”
Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper), Phoenix, AZ, IEEE, May 2014.
(2.33 MB)
“A Multithreaded Communication Substrate for OpenSHMEM,”
8th International Conference on Partitioned Global Address Space Programming Models (PGAS), Eugene, OR, October 2014.
(261.66 KB)
“Multiprecision Block-Jacobi for Iterative Triangular Solves,”
European Conference on Parallel Processing (Euro-Par 2020): Springer, August 2020.
“Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization,”
Euro-Par 2013, Aachen, Germany, Springer, August 2013.
(431.84 KB)
“MPI Continuations And How To Invoke Them,”
Sustained Simulation Performance 2021, Cham, Springer International Publishing, pp. 67 - 83, February 2023.
“Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs,”
VECPAR 2014 (Best Paper), Eugene, OR, June 2014.
(438.54 KB)
“Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(301.01 KB)
“Mixed-precision Block Gram Schmidt Orthogonalization,”
6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Austin, TX, ACM, November 2015.
(235.69 KB)
“Mixed Precision Algebraic Multigrid on GPUs,”
Parallel Processing and Applied Mathematics (PPAM 2022), vol. 13826, Cham, Springer International Publishing, April 2023.
“MIAMI: A Framework for Application Performance Diagnosis ,”
IPASS-2014, Monterey, CA, IEEE, March 2014.
(1010.75 KB)
“Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation,”
International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(480.73 KB)
“Massively Parallel Automated Software Tuning,”
48th International Conference on Parallel Processing (ICPP 2019), Kyoto, Japan, ACM Press, August 2019.
(911.88 KB)
“MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1.37 MB) (8.72 MB)
“MagmaDNN: Accelerated Deep Learning Using MAGMA,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.09 MB)
“MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing,”
2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award), Waltham, MA, IEEE, September 2015.
(678.86 KB)
“LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi,”
IEEE High Performance Extreme Computing Conference (HPEC'16), Waltham, MA, IEEE, September 2016.
(943.23 KB)
“LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU,”
16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(684.73 KB)
“Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
(1.08 MB)
“A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs,”
SBAC-PAD, Lyon, France, IEEE, 2018.
(237.68 KB)
“Iterative Sparse Triangular Solves for Preconditioning,”
EuroPar 2015, Vienna, Austria, Springer Berlin, August 2015.
(322.36 KB)
“Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices using GPUs,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, Springer, Cham, June 2020.
(702.38 KB)
“Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers,”
ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, ACM.
(766.35 KB)
“Integrating Deep Learning in Domain Sciences at Exascale,”
2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020), August 2020.
“Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators,”
IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist, Waltham, MA, IEEE, September 2019.
(470.21 KB)
“Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication Engine,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, September 2023.
“Improving the Performance of the GMRES Method using Mixed-Precision Techniques,”
Smoky Mountains Computational Sciences & Engineering Conference (SMC2020), August 2020.
(600.33 KB)
“Improving the performance of CA-GMRES on multicores with multiple GPUs,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(333.82 KB)
“An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
Supercomputing 2013, Denver, CO, November 2013.
“Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints,”
40th IEEE Real-Time Systems Symposium (RTSS 2019), York, UK, IEEE Press, February 2020.
“Implicit Actions and Non-blocking Failure Recovery with MPI,”
2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, IEEE, January 2023, 2022.
“Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,”
Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.
(1.6 MB)
“Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
(865.68 KB)
“Hybrid Multi-Elimination ILU Preconditioners on GPUs,”
International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.67 MB)
“High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“High-performance Matrix-matrix Multiplications of Very Small Matrices,”
22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016.
“High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation,”
8th Workshop on Irregular Applications: Architectures and Algorithms, 2018.
“High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
(872.18 KB)
“High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
(1.3 MB)
“