Publications
Assessing the Impact of ABFT and Checkpoint Composite Strategies,”
16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.02 MB)
“Assuming failure independence: are we right to be wrong?,”
The 3rd International Workshop on Fault Tolerant Systems (FTS), Honolulu, Hawaii, IEEE, September 2017.
(597.11 KB)
“Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs,”
International Supercomputing Conference (ISC 2015), Frankfurt, Germany, July 2015.
“Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,”
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.
(440.7 KB)
“Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures,”
Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020), May 2020.
(188.51 KB)
“Automatic Blocking of QR and LU Factorizations for Locality,”
2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004), Washington, DC, ACM, June 2004.
(212.77 KB)
“Automatically Tuned Linear Algebra Software,”
1998 ACM/IEEE conference on Supercomputing (SC '98), Orlando, FL, IEEE Computer Society, November 1998.
“Automating the Large-Scale Collection and Analysis of Performance,”
5th LCI International Conference on Linux Clusters: The HPC Revolution, Austin, Texas, May 2004.
(511.6 KB)
“Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,”
Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017.
“Batched Matrix Computations on Hardware Accelerators,”
EuroMPI/Asia 2015 Workshop, Bordeaux, France, September 2015.
(589.05 KB)
“Batched Matrix Computations on Hardware Accelerators Based on GPUs,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(9.36 MB)
“Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations,”
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, IEEE, July 2022.
(1.26 MB)
“Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q,”
International Supercomputing Conference 2013 (ISC'13), Leipzig, Germany, Springer, June 2013.
(624.58 KB)
“Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, IEEE, May 2017.
(328.15 KB)
“Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, IEEE, May 2018.
(1.31 MB)
“Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI,”
2019 International Conference on Parallel Computing (ParCo2019), Prague, Czech Republic, September 2019.
“Checkpointing Workflows for Fail-Stop Errors,”
IEEE Cluster, Honolulu, Hawaii, IEEE, September 2017.
(400.64 KB)
“Cholesky Across Accelerators,”
17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015), Elizabeth, NJ, IEEE, August 2015.
“clMAGMA: High Performance Dense Linear Algebra with OpenCL ,”
International Workshop on OpenCL, Bristol University, England, May 2014.
(460.91 KB)
“Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime,”
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), New Orleans, LA, IEEE, May 2020.
(1.33 MB)
“Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(4.7 MB)
“A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware,”
International Conference on Computational Science (ICCS 2002), Amsterdam, Netherlands, Springer, April 2002.
(122 KB)
“Composition of Algorithmic Building Blocks in Template Task Graphs,”
2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM), Dallas, TX, USA, IEEE, January 2023, 2022.
(1015.99 KB)
“Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems,”
International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, August 2014.
(130.18 KB)
“Co-Scheduling Algorithms for Cache-Partitioned Systems,”
19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017.
(584.76 KB)
“Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
Cluster 2018, Belfast, UK, IEEE Computer Society Press, September 2018.
(423.75 KB)
“Counter Inspection Toolkit: Making Sense out of Hardware Performance Events,”
11th International Workshop on Parallel Tools for High Performance Computing, Dresden, Germany, Cham, Switzerland: Springer, February 2019.
(216.39 KB)
“ A Data Flow Divide and Conquer Algorithm for Multicore Architecture,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(535.44 KB)
“Data Logistics: Toolkit and Applications,”
5th EAI International Conference on Smart Objects and Technologies for Social Good, Valencia, Spain, September 2019.
(6.71 MB)
“DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models,”
20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, IEEE, May 2020.
(424.19 KB)
“Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES,”
5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, November 2014.
(465.52 KB)
“Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs,”
22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(696.21 KB)
“Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,”
Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(398.16 KB)
“The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems,”
International Conference on Computational Science (ICCS 2017), Zürich, Switzerland, Elsevier, June 2017.
(446.14 KB)
“On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors,”
ISC High Performance 2015, Frankfurt, Germany, July 2015.
(1.49 MB)
“Design for a Soft Error Resilient Dynamic Task-based Runtime,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(2.31 MB)
“Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
2020 IEEE High Performance Extreme Computing Virtual Conference: IEEE, September 2020.
(476.36 KB)
“Designing LU-QR Hybrid Solvers for Performance and Stability,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(4.2 MB)
“On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,”
The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(708.62 KB)
“Diagnosis and Optimization of Application Prefetching Performance,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
(827.31 KB)
“O(N) distributed direct factorization of structured dense matrices using runtime systems,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, August 2023.
“Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK,”
Computer Modeling and Intelligent Systems CMIS-2020, Zaporizhzhoa, March 2020.
(451.33 KB)
“Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.
“Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, May 2014.
(490.08 KB)
“Efficiency of General Krylov Methods on GPUs – An Experimental Study,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Chicago, IL, IEEE, May 2016.
(285.28 KB)
“Efficient Communications in Training Large Scale Neural Networks,”
ACM MultiMedia Workshop 2017, Mountain View, CA, ACM, October 2017.
(1.41 MB)
“Efficient Eigensolver Algorithms on Accelerator Based Architectures,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(6.98 MB)
“Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(1.09 MB)
“