Publications
Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(550.96 KB)
“UCX: An Open Source Framework for HPC Network APIs and Beyond,”
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, USA, IEEE, pp. 40-43, 2015.
“Failure Detection and Propagation in HPC Systems,”
Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
“Surviving Errors with OpenSHMEM,”
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, Baltimore, MD, USA, Springer International Publishing, pp. 66–81, 2016.
“A Framework for Out of Memory SVD Algorithms,”
ISC High Performance 2017, pp. 158–178, June 2017.
(393.22 KB)
“PMIx: Process Management for Exascale Environments,”
Proceedings of the 24th European MPI Users' Group Meeting, New York, NY, USA, ACM, pp. 14:1–14:10, 2017.
“Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 01, ICL-UT-17-02: Innovative Computing Laboratory, University of Tennessee, June 2017.
(2.8 MB)
“Data Movement Interfaces to Support Dataflow Runtimes,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-03: University of Tennessee, May 2018.
(210.94 KB)
“Distributed Termination Detection for HPC Task-Based Environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-14: University of Tennessee, June 2018.
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“Evaluating Contexts in OpenSHMEM-X Reference Implementation,”
OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, Cham, Springer International Publishing, pp. 50–62, 2018.
“A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
(1.04 MB)
“Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award, Vancouver, BC, Canada, IEEE, May 2018.
(899.3 KB)
“PMIx: Process Management for Exascale Environments,”
Parallel Computing, vol. 79, pp. 9–29, January 2018.
“Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,”
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.
(440.7 KB)
“Checkpointing Strategies for Shared High-Performance Computing Platforms,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 28–52, 2019.
(490.5 KB)
“Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms,”
Parallel Computing, vol. 85, pp. 1–12, July 2019.
(865.18 KB)
“Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging,”
Future Generation Computer Systems, vol. 91, pp. 450-464, February 2019.
(1.16 MB)
“Performance of Asynchronous Optimized Schwarz with One-sided Communication,”
Parallel Computing, vol. 86, pp. 66-81, August 2019.
(3.09 MB)
“Runtime Level Failure Detection and Propagation in HPC Systems,”
European MPI Users' Group Meeting (EuroMPI '19), Zürich, Switzerland, ACM, September 2019.
(1.11 MB)
“System Software for Many-Core and Multi-Core Architectures,”
Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project, Singapore, Springer Singapore, pp. 59–75, 2019.
“Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution,”
Future Generation Computer Systems, vol. 106, pp. 467-481, May 2020.
(2.06 MB)
“Flexible Data Redistribution in a Task-Based Runtime System,”
IEEE International Conference on Cluster Computing (Cluster 2020), Kobe, Japan, IEEE, September 2020.
(354.8 KB)
“Overhead of Using Spare Nodes,”
The International Journal of High Performance Computing Applications, February 2020.
(2.15 MB)
“Revisiting Credit Distribution Algorithms for Distributed Termination Detection,”
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): IEEE, pp. 611–620, 2021.
“