Publications
Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments,”
VECPAR 2014, Eugene, OR, June 2014.
(276.52 KB)
“Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(535.72 KB)
“heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
(2.62 MB)
“Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
(642.51 KB)
“Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1016.52 KB)
“HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
“GPU-Aware Non-contiguous Data Movement In Open MPI,”
25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16), Kyoto, Japan, ACM, June 2016.
(482.32 KB)
“GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap,”
2nd International Workshop on Hardware-Software Co-Design for High Performance Computing, Austin, TX, ACM, November 2015.
(1.46 MB)
“Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs,”
IEEE Cluster, Albuquerque, NM, IEEE, September 2019.
(220.84 KB)
“Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC,”
ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
(260.69 KB)
“A Generic Approach to Scheduling and Checkpointing Workflows,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(737.11 KB)
“Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment,”
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, IEEE, July 2022.
“From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 2022.
(1.03 MB)
“Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations,”
ISC High Performance, Frankfurt, Germany, Springer, July 2015.
(778.26 KB)
“Flexible Linear Algebra Development and Scheduling with Cholesky Factorization,”
17th IEEE International Conference on High Performance Computing and Communications, Newark, NJ, August 2015.
(494.31 KB)
“Flexible Data Redistribution in a Task-Based Runtime System,”
IEEE International Conference on Cluster Computing (Cluster 2020), Kobe, Japan, IEEE, September 2020.
(354.8 KB)
“Flexible Batched Sparse Matrix-Vector Product on GPUs,”
8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), Denver, CO, ACM Press, November 2017.
(583.4 KB)
“FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks,”
9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20), Stockholm, Sweden, ACM, June 2020.
(4.72 MB)
“Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,”
33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(675.5 KB)
“A Fast Batched Cholesky Factorization on a GPU,”
International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, September 2014.
(1.37 MB)
“Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications,”
Platform for Advanced Scientific Computing Conference (PASC20), Geneva, Switzerland, ACM, June 2020.
(2.71 MB)
“Extending MAGMA Portability with OneAPI,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, November 2022.
(999.19 KB)
“Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters,”
PADTAD Workshop, IPDPS 2003, Nice, France, IEEE, April 2003.
(432.57 KB)
“Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization,”
PAW-ATM Workshop at SC19, Denver, CO, ACM, November 2019.
(4.51 MB)
“Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations,”
2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS): IEEE, November 2020.
(1.9 MB)
“Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms,”
49th International Conference on Parallel Processing (ICPP 2020), Edmonton, AB, Canada, ACM Press, 2020.
(804.96 KB)
“Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers,”
Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15), San Francisco, CA, ACM, February 2015.
(2.29 MB)
“End-user Tools for Application Performance Analysis, Using Hardware Counters,”
International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August 2001.
(306.54 KB)
“Elastic deep learning through resilient collective operations,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
“Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit,”
13th International Workshop on Parallel Tools for High Performance Computing, Dresden, Germany, Springer International Publishing, September 2020.
(738.47 KB)
“Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures,”
7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Berlin, Germany, September 2013.
(102.51 KB)
“Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(1.09 MB)
“Efficient Eigensolver Algorithms on Accelerator Based Architectures,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(6.98 MB)
“Efficient Communications in Training Large Scale Neural Networks,”
ACM MultiMedia Workshop 2017, Mountain View, CA, ACM, October 2017.
(1.41 MB)
“Efficiency of General Krylov Methods on GPUs – An Experimental Study,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Chicago, IL, IEEE, May 2016.
(285.28 KB)
“Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, May 2014.
(490.08 KB)
“Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.
“Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK,”
Computer Modeling and Intelligent Systems CMIS-2020, Zaporizhzhoa, March 2020.
(451.33 KB)
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
“O(N) distributed direct factorization of structured dense matrices using runtime systems,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, August 2023.
“Diagnosis and Optimization of Application Prefetching Performance,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
(827.31 KB)
“On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,”
The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(708.62 KB)
“Designing LU-QR Hybrid Solvers for Performance and Stability,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(4.2 MB)
“Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
2020 IEEE High Performance Extreme Computing Virtual Conference: IEEE, September 2020.
(476.36 KB)
“Design for a Soft Error Resilient Dynamic Task-based Runtime,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(2.31 MB)
“