Publications
Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Replication is More Efficient Than You Think,”
The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019.
(975.69 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Max-Stretch Minimization on an Edge-Cloud Platform,”
IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium: IEEE Computer Society Press, 2021.
(4.94 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(868.44 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
(573.71 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Resilient Co-Scheduling of Malleable Applications,”
International Journal of High Performance Computing Applications (IJHPCA), May 2017.
(1.62 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
(865.68 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
(1.39 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
(1.02 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The GrADS Project: Software Support for High-Level Grid Application Development,”
Technical Report, February 2000.
(347.41 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
New Grid Scheduling and Rescheduling Methods in the GrADS Project,”
International Journal of Parallel Programming, vol. 33, no. 2: Springer, pp. 209-229, June 2005.
(306.41 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
The GrADS Project: Software Support for High-Level Grid Application Development,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 4, pp. 327-344, January 2001.
(271.52 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Scalability Study of a Quantum Simulation Code,”
PARA 2010, Reykjavik, Iceland, June 2010.
“A Survey of MPI Usage in the US Exascale Computing Project,”
Concurrency Computation: Practice and Experience, September 2018.
(359.54 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Atlanta Organizers Put Mathematics to Work For the Math Sciences Community,”
SIAM News, vol. 32, no. 6, January 1999.
(45.98 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.48 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Pattern-Based Approach to Automated Application Performance Analysis,”
Workshop on Patterns in High Performance Computing, University of Illinois at Urbana-Champaign, May 2005.
(3.47 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Automatic Experimental Analysis of Communication Patterns in Virtual Topologies,”
In Proceedings of the International Conference on Parallel Processing, Oslo, Norway, IEEE Computer Society, June 2005.
(227.13 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Application of Machine Learning to the Selection of Sparse Linear Solvers,”
International Journal of High Performance Computing Applications (submitted), 00 2006.
(392.96 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Basic Linear Algebra Subprograms (BLAS),”
(an update), submitted to ACM TOMS, February 2001.
(228.33 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Updated Set of Basic Linear Algebra Subprograms (BLAS),”
ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135-151, December 2002.
(228.33 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
User Level Failure Mitigation in MPI,”
Euro-Par 2012: Parallel Processing Workshops, vol. 7640, Rhodes Island, Greece, Springer Berlin Heidelberg, pp. 499-504, August 2012.
(136.15 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Post-failure recovery of MPI communication capability: Design and rationale,”
International Journal of High Performance Computing Applications, vol. 27, issue 3, pp. 244 - 254, January 2013.
(285.77 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Enabling Application Resilience With and Without the MPI Standard,”
11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, May 2012.
(262.93 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.
(422.76 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,”
University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.
(159.46 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
(3.89 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
(311.23 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“Distributed Storage in RIB,”
ICL Tech Report, no. ICL-UT-03-01, March 2003.
(213.02 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Task-Based Programming for Seismic Imaging: Preliminary Results,”
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(625.86 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Assessing the impact of ABFT and Checkpoint composite strategies,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
(968.47 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
University of Tennessee Computer Science Technical Report (also LAWN 269), no. UT-CS-12-697, June 2012.
(2.76 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
(1.04 MB)
“![application/pdf](/modules/file/icons/application-pdf.png)
DAGuE: A generic distributed DAG Engine for High Performance Computing.,”
Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.
(830.85 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Failure Detection and Propagation in HPC Systems,”
Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
“Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
ICL Technical Report, no. ICL-UT-04-04, January 2004.
(241.36 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 00 2005.
(241.36 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA,”
University of Tennessee Computer Science Technical Report, UT-CS-10-660, September 2010.
(366.26 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-13: University of Tennessee, December 2018.
(326.11 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Systems and Interfaces (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(840.54 KB)
![application/pdf](/modules/file/icons/application-pdf.png)
Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure,”
Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, vol. 6960, Santorini, Greece, Springer, pp. 342-344, September 2011.
(115.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
DTE: PaRSEC Enabled Libraries and Applications
: 2021 Exascale Computing Project Annual Meeting, April 2021.
(3.24 MB)
![application/pdf](/modules/file/icons/application-pdf.png)
On Scalability for MPI Runtime Systems,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-11-05, Knoxville, TN, May 2011.
(898.76 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-02, 00 2010.
(400.75 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,”
Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.
(290.27 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)
Context Identifier Allocation in Open MPI,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-16-01: Innovative Computing Laboratory, University of Tennessee, January 2016.
(490.89 KB)
“![application/pdf](/modules/file/icons/application-pdf.png)