Publications
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,”
IEEE Transactions on Computers, vol. 58, issue 11, pp. 1512-1524, November 2009.
(1.81 MB)
“Algorithm-Based Fault Tolerance for Fail-Stop Failures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 12, January 2008.
(340.49 KB)
“Disaster Survival Guide in Petascale Computing: An Algorithmic Approach,”
in Petascale Computing: Algorithms and Applications (to appear): Chapman & Hall - CRC Press, 00 2007.
(260.18 KB)
“Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
SIAM SISC (to appear), May 2007.
(241.36 KB)
“Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing,”
Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS, pp. 1-8, March 2007.
(162.47 KB)
“Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources,”
IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, January 2006.
(266.54 KB)
“Self Adapting Numerical Software SANS Effort,”
IBM Journal of Research and Development, vol. 50, no. 2/3, pp. 223-238, January 2006.
(357.53 KB)
“Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources,”
University of Tennessee Computer Science Department Technical Report, vol. –05-561, November 2005.
(266.54 KB)
“Condition Numbers of Gaussian Random Matrices,”
SIAM Journal on Matrix Analysis and Applications (to appear), January 2005.
(186.46 KB)
“Condition Numbers of Gaussian Random Matrices,”
University of Tennessee Computer Science Department Technical Report, vol. –04-539, 00 2005.
(186.46 KB)
“Fault Tolerant High Performance Computing by a Coding Approach,”
Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear), Chicago, Illinois, January 2005.
(209.37 KB)
“Numerically Stable Real Number Codes Based on Random Matrices,”
The International Conference on Computational Science, Atlanta, GA, LNCS 3514, Springer-Verlag, January 2005.
(166.2 KB)
“Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 00 2005.
(241.36 KB)
“Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,”
Proceedings of ISC2004 (to appear), Heidelberg, Germany, June 2004.
(548.38 KB)
“LAPACK for Clusters Project: An Example of Self Adapting Numerical Software,”
Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04'), vol. 9, Big Island, Hawaii, pp. 90282, January 2004.
(80.97 KB)
“Numerically Stable Real-Number Codes Based on Random Matrices,”
University of Tennessee Computer Science Department Technical Report, vol. –04-526, October 2004.
(91.66 KB)
“Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,”
International Journal for High Performance Applications and Supercomputing (to appear), April 2004.
(186.9 KB)
“Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
ICL Technical Report, no. ICL-UT-04-04, January 2004.
(241.36 KB)
“Fault Tolerant Communication Library and Applications for High Performance Computing,”
Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented), Santa Fe, NM, October 2003.
(146.05 KB)
“Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160),”
University of Tennessee Computer Science Technical Report, UT-CS-03-499, January 2003.
(343.44 KB)
“Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters,”
Parallel Computing, vol. 29, no. 11-12, pp. 1723-1743, November 2003.
(343.44 KB)
“