Publications
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
(6.4 MB)
Matrices Over Runtime Systems at Exascale,”
Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.
“A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,”
Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.
“One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,”
The International Conference on Computational Science (ICCS), June 2012.
“Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators,”
VECPAR 2012, Kobe, Japan, July 2012.
(737.28 KB)
““Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011,”
Lecture Notes in Computer Science, vol. 7203, Torun, Poland, 00 2012.
A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures,”
IPDPS 2012, Shanghai, China, May 2012.
(544.09 KB)
“Performance evaluation of LU factorization through hardware counter measurements,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.
(794.82 KB)
“Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,”
Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.
(290.27 KB)
“Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,”
LAWN 267, 00 2012.
(1.14 MB)
“Programming the LU Factorization for a Multicore System with Accelerators,”
Proceedings of VECPAR’12, Kobe, Japan, April 2012.
(414.33 KB)
“A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,”
University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.
(159.46 KB)
“Providing GPU Capability to LU and QR within the ScaLAPACK Framework,”
University of Tennessee Computer Science Technical Report (also LAWN 272), no. UT-CS-12-699, September 2012.
(7.48 MB)
““Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012,”
Lecture Notes in Computer Science, vol. 7490, Vienna, Austria, 00 2012.
Reducing the Amount of Pivoting in Symmetric Indefinite Systems,”
Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011), vol. 7203: Springer-Verlag Berlin Heidelberg, pp. 133-142, 00 2012.
(145.76 KB)
“Reducing the Amount of Pivoting in Symmetric Indefinite Systems,”
Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011), vol. 7203: Springer-Verlag Berlin Heidelberg, pp. 133-142, 00 2012.
(145.76 KB)
“A Scalable Framework for Heterogeneous GPU-Based Clusters,”
The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), Pittsburgh, PA, USA, ACM, June 2012.
(3.39 MB)
“Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices,”
SIAM Journal on Scientific Computing (Accepted), July 2012.
“Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
University of Tennessee Computer Science Technical Report (also LAWN 269), no. UT-CS-12-697, June 2012.
(2.76 MB)
“Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,”
Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper), Rhodes Island, Greece, August 2012.
(764.02 KB)
“Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems,”
SIAM Journal on Computing (submitted), March 2012.
(811.01 KB)
“Accelerating Linear System Solutions Using Randomization Techniques,”
ACM Transactions on Mathematical Software (also LAWN 246), vol. 39, issue 2, February 2013.
(358.79 KB)
“Assessing the impact of ABFT and Checkpoint composite strategies,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
(968.47 KB)
“Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q,”
International Supercomputing Conference 2013 (ISC'13), Leipzig, Germany, Springer, June 2013.
(624.58 KB)
“BlackjackBench: Portable Hardware Characterization with Automated Results Analysis,”
The Computer Journal, March 2013.
(408.45 KB)
“A Block-Asynchronous Relaxation Method for Graphics Processing Units,”
Journal of Parallel and Distributed Computing, vol. 73, issue 12, pp. 1613–1626, December 2013.
(1.08 MB)
“clMAGMA: High Performance Dense Linear Algebra with OpenCL,”
University of Tennessee Technical Report (Lawn 275), no. UT-CS-13-706: University of Tennessee, March 2013.
(526.6 KB)
“Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Concurrency and Computation: Practice and Experience, vol. 25, issue 4, pp. 572-585, March 2013.
(636.68 KB)
“CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience,”
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Montpellier, France, November 2013.
(238.58 KB)
“Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,”
Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.
(1.01 MB)
“Designing LU-QR hybrid solvers for performance and stability,”
University of Tennessee Computer Science Technical Report (also LAWN 282), no. ut-eecs-13-719: University of Tennessee, October 2013.
(4.11 MB)
“Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013.
(659.77 KB)
“Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures,”
7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Berlin, Germany, September 2013.
(102.51 KB)
“Enabling Workflows in GridSolve: Request Sequencing and Service Trading,”
Journal of Supercomputing, vol. 64, issue 3, pp. 1133-1152, June 2013.
(821.29 KB)
“An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
(311.23 KB)
“Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
(3.89 MB)
“Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems,”
Parallel Computing, vol. 39, issue 4-5, pp. 212-232, May 2013.
(1.43 MB)
“High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 3, no. 16, 2013.
(665.7 KB)
“HPC Challenge: Design, History, and Implementation Highlights,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(790.01 KB)
“Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.
(866.68 KB)
“Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures,”
IPDPS 2013 (submitted), Boston, MA, 00 2013.
(1.22 MB)
“Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC,”
Lawn 277, no. UT-CS-13-709, May 2013.
(298.63 KB)
“An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
University of Tennessee Computer Science Technical Report (also LAWN 283), no. ut-eecs-13-720: University of Tennessee, October 2013.
(1.23 MB)
“An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
Supercomputing 2013, Denver, CO, November 2013.
“Keeneland: Computational Science Using Heterogeneous GPU Computing,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(2.7 MB)
“Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms,”
Journal of Parallel and Distributed Computing, vol. 73, issue 7, pp. 1000-1010, July 2013.
(1.4 MB)
“LAPACK,”
Handbook of Linear Algebra, Second, Boca Raton, FL, CRC Press, 2013.
(223.21 KB)
“Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations,”
International Supercomputing Conference (ISC), Lecture Notes in Computer Science, vol. 7905, Leipzig, Germany, Springer Berlin Heidelberg, pp. 67-80, June 2013.
(2.14 MB)
“Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 2, February 2013.
(439.46 KB)
“LU Factorization with Partial Pivoting for a Multicore System with Accelerators,”
IEEE Transactions on Parallel and Distributed Computing, vol. 24, issue 8, pp. 1613-1621, August 2013.
(1.08 MB)
“