Publications
Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(603.58 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(868.44 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Resilient Co-Scheduling of Malleable Applications,”
International Journal of High Performance Computing Applications (IJHPCA), May 2017.
(1.62 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
(865.68 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
(1.39 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
(573.71 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Checkpointing à la Young/Daly: An Overview,”
IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, Noida, India, ACM Press, pp. 701-710, August 2022.
(639.77 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
(1.02 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Resilient scheduling heuristics for rigid parallel jobs,”
Int. J. of Networking and Computing, vol. 11, no. 1, pp. 2-26, 2021.
(8.67 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The GrADS Project: Software Support for High-Level Grid Application Development,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 4, pp. 327-344, January 2001.
(271.52 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The GrADS Project: Software Support for High-Level Grid Application Development,”
Technical Report, February 2000.
(347.41 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
New Grid Scheduling and Rescheduling Methods in the GrADS Project,”
International Journal of Parallel Programming, vol. 33, no. 2: Springer, pp. 209-229, June 2005.
(306.41 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Scalability Study of a Quantum Simulation Code,”
PARA 2010, Reykjavik, Iceland, June 2010.
“A Survey of MPI Usage in the US Exascale Computing Project,”
Concurrency Computation: Practice and Experience, September 2018.
(359.54 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Atlanta Organizers Put Mathematics to Work For the Math Sciences Community,”
SIAM News, vol. 32, no. 6, January 1999.
(45.98 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.48 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Pattern-Based Approach to Automated Application Performance Analysis,”
Workshop on Patterns in High Performance Computing, University of Illinois at Urbana-Champaign, May 2005.
(3.47 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Automatic Experimental Analysis of Communication Patterns in Virtual Topologies,”
In Proceedings of the International Conference on Parallel Processing, Oslo, Norway, IEEE Computer Society, June 2005.
(227.13 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Application of Machine Learning to the Selection of Sparse Linear Solvers,”
International Journal of High Performance Computing Applications (submitted), 00 2006.
(392.96 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Updated Set of Basic Linear Algebra Subprograms (BLAS),”
ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135-151, December 2002.
(228.33 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Basic Linear Algebra Subprograms (BLAS),”
(an update), submitted to ACM TOMS, February 2001.
(228.33 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Post-failure recovery of MPI communication capability: Design and rationale,”
International Journal of High Performance Computing Applications, vol. 27, issue 3, pp. 244 - 254, January 2013.
(285.77 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Enabling Application Resilience With and Without the MPI Standard,”
11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, May 2012.
(262.93 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
(311.23 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.
(422.76 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,”
University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.
(159.46 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
(3.89 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“User Level Failure Mitigation in MPI,”
Euro-Par 2012: Parallel Processing Workshops, vol. 7640, Rhodes Island, Greece, Springer Berlin Heidelberg, pp. 499-504, August 2012.
(136.15 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Distributed Storage in RIB,”
ICL Tech Report, no. ICL-UT-03-01, March 2003.
(213.02 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Task-Based Programming for Seismic Imaging: Preliminary Results,”
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(625.86 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
ICL Technical Report, no. ICL-UT-04-04, January 2004.
(241.36 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure,”
Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, vol. 6960, Santorini, Greece, Springer, pp. 342-344, September 2011.
(115.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA,”
University of Tennessee Computer Science Technical Report, UT-CS-10-660, September 2010.
(366.26 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
(1.04 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 00 2005.
(241.36 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Enabled Libraries and Applications
: 2021 Exascale Computing Project Annual Meeting, April 2021.
(3.24 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Systems and Interfaces (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(840.54 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
DAGuE: A generic distributed DAG engine for high performance computing,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-01, April 2010.
(830.85 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,”
Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.
(290.27 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-13: University of Tennessee, December 2018.
(326.11 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,”
IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 2011.
(290.98 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Constructing resiliant communication infrastructure for runtime environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-09-02, July 2009.
(463.71 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Enabled Libraries and Applications (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(979.27 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms,”
International Journal of Networking and Computing, vol. 12, issue 1, pp. 26 - 46, January 2022.
“Revisiting Credit Distribution Algorithms for Distributed Termination Detection,”
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): IEEE, pp. 611–620, 2021.
“Hash Functions for Datatype Signatures in MPI,”
Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, vol. 3666, Sorrento (Naples), Italy, Springer-Verlag Berlin, pp. 76-83, September 2005.
(304.2 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)