Publications
Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
PPAM 2013, Warsaw, Poland, September 2013.
(284.97 KB)
“Post-failure recovery of MPI communication capability: Design and rationale,”
International Journal of High Performance Computing Applications, vol. 27, issue 3, pp. 244 - 254, January 2013.
(285.77 KB)
“Revisiting the Double Checkpointing Algorithm,”
University of Tennessee Computer Science Technical Report (LAWN 274), no. ut-cs-13-705, January 2013.
(682.22 KB)
“Revisiting the Double Checkpointing Algorithm,”
15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium, Boston, MA, May 2013.
(591.1 KB)
“Scalable Dense Linear Algebra on Heterogeneous Hardware,”
HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.
(760.32 KB)
“Soft Error Resilient QR Factorization for Hybrid System with GPGPU,”
Journal of Computational Science, vol. 4, issue 6, pp. 457–464, November 2013.
(995.45 KB)
“Standards for Graph Algorithm Primitives,”
17th IEEE High Performance Extreme Computing Conference (HPEC '13), Waltham, MA, IEEE, September 2013.
(108.86 KB)
“Toward a New Metric for Ranking High Performance Computing Systems,”
SAND2013 - 4744, June 2013.
(225.32 KB)
“Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
(1.27 MB)
“Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures,”
UT-CS-13-712: University of Tennessee Computer Science Technical Report, June 2013.
(206.42 KB)
“Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,”
Concurrency and Computation: Practice and Experience, October 2013.
(1.71 MB)
“Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster,”
The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2013.
“Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
Concurrency and Computation: Practice and Experience, November 2013.
(894.61 KB)
“Virtual Systolic Array for QR Decomposition,”
15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013), Boston, MA, IEEE, May 2013.
(749.84 KB)
“Acceleration of the BLAST Hydro Code on GPU,”
Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.
“Algorithm-Based Fault Tolerance for Dense Matrix Factorization,”
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, ACM, pp. 225-234, February 2012.
(865.79 KB)
“On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,”
University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.
(358.98 KB)
“Anatomy of a Globally Recursive Embedded LINPACK Benchmark,”
2012 IEEE High Performance Extreme Computing Conference, Waltham, MA, pp. 1-6, September 2012.
(204.74 KB)
“Autotuning GEMM Kernels for the Fermi GPU,”
IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, November 2012.
(742.5 KB)
“Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,”
ICCS 2012, Omaha, NE, June 2012.
(608.95 KB)
“A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines,”
Proc. of the International Conference on Computational Science (ICCS), vol. 9, pp. 17-26, June 2012.
“A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction,”
IPDPS 2012, Shanghai, China, May 2012.
(480.43 KB)
“DAGuE: A generic distributed DAG Engine for High Performance Computing.,”
Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.
(830.85 KB)
“Dense Linear Algebra on Accelerated Multicore Hardware,”
High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.
“Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,”
SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.
“Dynamic Task Execution on Shared and Distributed Memory Architectures
, 2012.
(3.29 MB)
An efficient distributed randomized solver with application to large dense linear systems,”
ICL Technical Report, no. ICL-UT-12-02, July 2012.
(626.26 KB)
“Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,”
26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.
(5.88 MB)
“Enabling Application Resilience With and Without the MPI Standard,”
11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, May 2012.
(262.93 KB)
“Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture,”
The 2nd International Conference on Cloud and Green Computing (submitted), Xiangtan, Hunan, China, November 2012.
(329.5 KB)
“Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction,”
Lecture Notes in Computer Science, vol. 7203, pp. 661-670, September 2012.
(185.77 KB)
“An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.
(422.76 KB)
“From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,”
Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.
(1.64 MB)
“From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“The Future of Computing: Software Libraries
, Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
(6.76 MB)
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,”
EuroPar 2012 (also LAWN 260), Rhodes Island, Greece, August 2012.
(662.98 KB)
“Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, IEEE Computer Society Press, May 2012.
(405.71 KB)
“HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“High Performance Computing Systems: Status and Outlook,”
Acta Numerica, vol. 21, Cambridge, UK, Cambridge University Press, pp. 379-474, May 2012.
(1.48 MB)
“High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors,”
ICCS 2012, Omaha, NE, June 2012.
(1.27 MB)
“How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE,”
University of Tennessee Computer Science Technical Report (also LAWN 270), no. UT-CS-12-698, July 2012.
(501.53 KB)
“HPC Challenge: Design, History, and Implementation Highlights,”
On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear): Chapman & Hall/CRC Press, 00 2012.
(469.92 KB)
“An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,”
Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.
(623.5 KB)
“Looking Back at Dense Linear Algebra Software,”
Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear), 00 2012.
(235.91 KB)
“MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
, San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
(9.23 MB)
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
(4.69 MB)
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
(6.4 MB)
MAGMA Tutorial
, Atlanta, GA, Keeneland Workshop, February 2012.
(2.47 MB)