Publications
Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints,”
40th IEEE Real-Time Systems Symposium (RTSS 2019), York, UK, IEEE Press, February 2020.
“Implicit Actions and Non-blocking Failure Recovery with MPI,”
2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, IEEE, January 2023, 2022.
DOI: 10.1109/FTXS56515.2022.00009
“Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,”
Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.
(1.6 MB)
“
High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“
High-performance Matrix-matrix Multiplications of Very Small Matrices,”
22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016.
“High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation,”
8th Workshop on Irregular Applications: Architectures and Algorithms, 2018.
“High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
DOI: 10.1145/3038228.3038237
(872.18 KB)
“
High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
(1.3 MB)
“
Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“
Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“
heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
DOI: 10.1007/978-3-030-50371-0_19
(2.62 MB)
“
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
DOI: 10.1109/SC.2018.00050
(642.51 KB)
“
Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1016.52 KB)
“
HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“
GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3624247
“Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs,”
IEEE Cluster, Albuquerque, NM, IEEE, September 2019.
(220.84 KB)
“
Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC,”
ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
(260.69 KB)
“
A Generic Approach to Scheduling and Checkpointing Workflows,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(737.11 KB)
“
Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment,”
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, IEEE, July 2022.
DOI: 10.1109/IPDPS53621.2022.00086
“From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“
A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 2022.
DOI: 10.1109/IPDPS53621.2022.00047
(1.03 MB)
“
Flexible Data Redistribution in a Task-Based Runtime System,”
IEEE International Conference on Cluster Computing (Cluster 2020), Kobe, Japan, IEEE, September 2020.
DOI: 10.1109/CLUSTER49012.2020.00032
(354.8 KB)
“
Flexible Batched Sparse Matrix-Vector Product on GPUs,”
8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), Denver, CO, ACM Press, November 2017.
DOI: http://dx.doi.org/10.1145/3148226.3148230
(583.4 KB)
“
FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks,”
9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20), Stockholm, Sweden, ACM, June 2020.
DOI: 10.1145/3369583.3392681
(4.72 MB)
“
Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,”
33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(675.5 KB)
“
Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications,”
Platform for Advanced Scientific Computing Conference (PASC20), Geneva, Switzerland, ACM, June 2020.
DOI: 10.1145/3394277.3401846
(2.71 MB)
“
Extending MAGMA Portability with OneAPI,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, November 2022.
(999.19 KB)
“
Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters,”
PADTAD Workshop, IPDPS 2003, Nice, France, IEEE, April 2003.
(432.57 KB)
“
Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization,”
PAW-ATM Workshop at SC19, Denver, CO, ACM, November 2019.
(4.51 MB)
“
Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations,”
2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS): IEEE, November 2020.
(1.9 MB)
“
Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms,”
49th International Conference on Parallel Processing (ICPP 2020), Edmonton, AB, Canada, ACM Press, 2020.
(804.96 KB)
“
End-user Tools for Application Performance Analysis, Using Hardware Counters,”
International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August 2001.
(306.54 KB)
“
Elastic deep learning through resilient collective operations,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3626080
“Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit,”
13th International Workshop on Parallel Tools for High Performance Computing, Dresden, Germany, Springer International Publishing, September 2020.
(738.47 KB)
“