Publications
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 2, February 2013.
(439.46 KB)
“
LAPACK Users' Guide, 3rd ed.,”
Philadelphia: Society for Industrial and Applied Mathematics, January 1999.
“LAPACK 2005 Prospectus: Reliable and Scalable Software for Linear Algebra Computations on High End Computers
: LAPACK Working Note 164, January 2005.
(172.59 KB)

Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms,”
Journal of Parallel and Distributed Computing, vol. 73, issue 7, pp. 1000-1010, July 2013.
(1.4 MB)
“
Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,”
IEEE Computing in Science & Engineering, vol. 13, issue 5, pp. 90-95, August 2011.
(932.57 KB)
“
JLAPACK - Compiling LAPACK Fortran to Java,”
Scientific Programming, vol. 7, no. 2, pp. 111-138, October 2002.
(307.46 KB)
“
Iterative Solver Benchmark (LAPACK Working Note 152),”
Scientific Programming, vol. 9, no. 4, pp. 223-231, 00 2001.
(168.05 KB)
“
An Iterative Solver Benchmark,”
Scientific Programming (to appear), 00 2002.
(142.67 KB)
“
I/O Performance Analysis for the Petascale Simulation Code FLASH,”
ISC'09, Hamburg, Germany, June 2009.
(88.88 KB)
“
Investigating Power Capping toward Energy-Efficient Scientific Applications,”
Concurrency Computation: Practice and Experience, vol. 2018, issue e4485, pp. 1-14, April 2018.
(1.2 MB)
“
Introduction to the HPC Challenge Benchmark Suite
, March 2005.
(124.86 KB)

The International Exascale Software Project Roadmap,”
International Journal of High Performance Computing, vol. 25, no. 1, pp. 3-60, January 2011.
(719.74 KB)
“
The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community,”
International Journal of High Performance Computing Applications (to appear), July 2009.
(203.04 KB)
“
Interactive Grid-Access Using Gridsolve and Giggle,”
Computing and Informatics, vol. 27, no. 2, pp. 233-248,ISSN1335-9150, 00 2008.
(533.4 KB)
“
Innovations of the NetSolve Grid Computing System,”
Concurrency: Practice and Experience, vol. 14, no. 13-15, pp. 1457-1479, January 2002.
(311.31 KB)
“
Incomplete Sparse Approximate Inverses for Parallel Preconditioning,”
Parallel Computing, vol. 71, pp. 1–22, January 2018.
(1.24 MB)
“
Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware,”
Parallel Processing Letters, vol. 17, no. 1, pp. 47-59, March 2007.
(718.4 KB)
“
Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Server,”
Parallel Processing Letters, vol. 17, no. 1, pp. 47-59, March 2006.
(718.4 KB)
“
An Improved MAGMA GEMM for Fermi GPUs,”
International Journal of High Performance Computing, vol. 24, no. 4, pp. 511-515, 00 2010.
“Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead,”
University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178, January 2006.
(304.4 KB)
“
Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures,”
IPDPS 2013 (submitted), Boston, MA, 00 2013.
(1.22 MB)
“
An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,”
Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.
(623.5 KB)
“
Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor,”
University of Tennessee Computer Science Tech Report, no. UT-CS-06-580, LAPACK Working Note #177, September 2006.
(506.18 KB)
“
Implementation of Mixed Precision in Solving Systems of Linear Equations on the Cell Processor,”
Concurrency and Computation: Practice and Experience, vol. 19, no. 10, pp. 1371-1385, July 2007.
(453.78 KB)
“
Implementation and Usage of the PERUSE-Interface in Open MPI,”
Euro PVM/MPI 2006, Bonn, Germany, September 2006.
(310.76 KB)
“
Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs,”
IEEE Transactions on Parallel and Distributed Systems, no. 1045-9219, November 2015.
“The Impact of Multicore on Math Software,”
PARA 2006, Umea, Sweden, June 2006.
(223.53 KB)
“
The Impact of Multicore on Computational Science Software,”
CTWatch Quarterly, vol. 3, issue 1, February 2007.
“Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW,”
18th EuroMPI, Santorini, Greece, Springer, pp. 247-254, September 2011.
“Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW,”
18th EuroMPI, Santorini, Greece, Springer, pp. 247-254, September 2011.
“A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,”
in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.
“Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,”
IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.
(3.75 MB)
“
HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
Scientific Programming, vol. 23, issue 1, January 2015.
(553.94 KB)
“
HPC Challenge: Design, History, and Implementation Highlights,”
On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear): Chapman & Hall/CRC Press, 00 2012.
(469.92 KB)
“
How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination,”
in Beautiful Code Leading Programmers Explain How They Think (Chapter 14), pp. 243-282, January 2008.
(257 KB)
“
How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination,”
in Beautiful Code Leading Programmers Explain How They Think: O'Reilly Media, Inc., June 2007.
(257 KB)
“
High-Performance High-Resolution Semi-Lagrangian Tracer Transport on a Sphere,”
Journal of Computational Physics, vol. 230, issue 17, pp. 6778-6799, July 2011.
(1.68 MB)
“
High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems,”
The International Journal of High Performance Computing Applications, 2015.
(336.19 KB)
“
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,”
IEEE Transactions on Computers, vol. 58, issue 11, pp. 1512-1524, November 2009.
(1.81 MB)
“
High Performance Development for High End Computing with Python Language Wrapper (PLW),”
International Journal for High Performance Computer Applications, vol. 21, no. 3, pp. 360-369, 00 2007.
(179.32 KB)
“
High Performance Dense Linear System Solver with Soft Error Resilience,”
IEEE Cluster 2011, Austin, TX, September 2011.
(1.27 MB)
“
High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors,”
ICCS 2012, Omaha, NE, June 2012.
(1.27 MB)
“
High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,”
International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016.
(277.51 KB)
“
High Performance Computing Trends,”
HERMIS, vol. 2, pp. 155-163, November 2001.
“High Performance Computing Systems: Status and Outlook,”
Acta Numerica, vol. 21, Cambridge, UK, Cambridge University Press, pp. 379-474, May 2012.
(1.48 MB)
“
High Performance Computing for Computational Science,”
Lecture Notes in Computer Science, vol. 2565, VECPAR 2002, 5th International Conference June 26-28, 2002, Springer-Verlag, Berlin, January 2003.
“High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 3, no. 16, 2013.
(665.7 KB)
“
HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“
Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems,”
Parallel Computing, vol. 39, issue 4-5, pp. 212-232, May 2013.
(1.43 MB)
“
HARNESS Fault Tolerant MPI Design, Usage and Performance Issues,”
Future Generation Computer Systems, vol. 18, no. 8, pp. 1127-1142, January 2002.
(403.41 KB)
“