Publications
Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging,”
Accepted for Euro PVM/MPI 2007: Springer, September 2007.
“Fault Tolerance Management for a Hierarchical GridRPC Middleware,”
8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France, January 2008.
(319.79 KB)
“Redesigning the Message Logging Model for High Performance,”
International Supercomputer Conference (ISC 2008), Dresden, Germany, January 2008.
(622.1 KB)
“Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery,”
CLUSTER '09, New Orleans, IEEE, August 2009.
DOI: 10.1109/CLUSTR.2009.5289157 (191.36 KB)
“DAGuE: A generic distributed DAG engine for high performance computing,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-01, April 2010.
(830.85 KB)
“Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA,”
University of Tennessee Computer Science Technical Report, UT-CS-10-660, September 2010.
(366.26 KB)
“Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-02, 00 2010.
(400.75 KB)
“Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols,”
Proceedings of EuroMPI 2010, Stuttgart, Germany, Springer, September 2010.
(202.87 KB)
“Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs,”
University of Tennessee Computer Science Technical Report, UT-CS-10-663, November 2010.
(384.75 KB)
“Locality and Topology aware Intra-node Communication Among Multicore CPUs,”
Proceedings of the 17th EuroMPI conference, Stuttgart, Germany, LNCS, September 2010.
(327.01 KB)
“Redesigning the Message Logging Model for High Performance,”
Concurrency and Computation: Practice and Experience (online version), June 2010.
(438.42 KB)
“Algorithm-based Fault Tolerance for Dense Matrix Factorizations,”
University of Tennessee Computer Science Technical Report, no. UT-CS-11-676, Knoxville, TN, August 2011.
(865.79 KB)
“Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Proceedings of 17th International Conference, Euro-Par 2011, Part II, vol. 6853, Bordeaux, France, Springer, pp. 51-64, August 2011.
(486.68 KB)
“DAGuE: A Generic Distributed DAG Engine for High Performance Computing,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1151-1158, 00 2011.
(830.85 KB)
“Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1432-1441, May 2011.
(1.26 MB)
“Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW,”
18th EuroMPI, Santorini, Greece, Springer, pp. 247-254, September 2011.
“Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs,”
Int'l Conference on Parallel Processing (ICPP '11), Taipei, Taiwan, September 2011.
“Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,”
IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 2011.
(290.98 KB)
“A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“Algorithm-Based Fault Tolerance for Dense Matrix Factorization,”
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, ACM, pp. 225-234, February 2012.
DOI: 10.1145/2145816.2145845 (865.79 KB)
“A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“DAGuE: A generic distributed DAG Engine for High Performance Computing.,”
Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.
(830.85 KB)
“An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.
(422.76 KB)
“From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,”
University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.
(159.46 KB)
“Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
University of Tennessee Computer Science Technical Report (also LAWN 269), no. UT-CS-12-697, June 2012.
(2.76 MB)
“Assessing the impact of ABFT and Checkpoint composite strategies,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
(968.47 KB)
“Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Concurrency and Computation: Practice and Experience, vol. 25, issue 4, pp. 572-585, March 2013.
DOI: 10.1002/cpe.2859 (636.68 KB)
“Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,”
Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.
(1.01 MB)
“Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures,”
7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Berlin, Germany, September 2013.
(102.51 KB)
“An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
DOI: 10.1007/s00607-013-0331-3 (311.23 KB)
“Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
DOI: 10.1002/cpe.3100 (3.89 MB)
“Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms,”
Journal of Parallel and Distributed Computing, vol. 73, issue 7, pp. 1000-1010, July 2013.
DOI: 10.1016/j.jpdc.2013.01.015 (1.4 MB)
“Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-01, February 2013.
(497.64 KB)
“Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization,”
Euro-Par 2013, Aachen, Germany, Springer, August 2013.
(431.84 KB)
“PaRSEC: Exploiting Heterogeneity to Enhance Scalability,”
IEEE Computing in Science and Engineering, vol. 15, issue 6, pp. 36-45, November 2013.
DOI: 10.1109/MCSE.2013.98 (2.16 MB)
“Post-failure recovery of MPI communication capability: Design and rationale,”
International Journal of High Performance Computing Applications, vol. 27, issue 3, pp. 244 - 254, January 2013.
DOI: 10.1177/1094342013488238 (285.77 KB)
“Scalable Dense Linear Algebra on Heterogeneous Hardware,”
HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.
(760.32 KB)
“Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
Concurrency and Computation: Practice and Experience, November 2013.
DOI: 10.1002/cpe.3173 (894.61 KB)
“Assessing the Impact of ABFT and Checkpoint Composite Strategies,”
16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.02 MB)
“A Multithreaded Communication Substrate for OpenSHMEM,”
8th International Conference on Partitioned Global Address Space Programming Models (PGAS), Eugene, OR, October 2014.
(261.66 KB)
“PTG: An Abstraction for Unhindered Parallelism,”
International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), New Orleans, LA, IEEE Press, November 2014.
(480.05 KB)
“Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy,”
ACM Transactions on Parallel Computing, vol. 1, issue 2, no. 10, pp. 10:1-10:28, January 2015.
DOI: 10.1145/2686892 (1.14 MB)
“Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing,”
International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, January 2015.
(755.54 KB)
“From MPI to OpenSHMEM: Porting LAMMPS,”
OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, Annapolis, MD, USA, Springer International Publishing, pp. 121–137, 2015.
DOI: 10.1007/978-3-319-26428-8_8
“Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery,”
22nd European MPI Users' Group Meeting, Bordeaux, France, ACM, September 2015.
DOI: 10.1145/2802658.2802668 (543.32 KB)
“Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(550.96 KB)
“