Autotuning GEMM Kernels for the Fermi GPU,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, November 2012. DOI: 10.1109/TPDS.2011.311“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,” ICCS 2012, Omaha, NE, June 2012.“
A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines,” Proc. of the International Conference on Computational Science (ICCS), vol. 9, pp. 17-26, June 2012.“
Dense Linear Algebra on Accelerated Multicore Hardware,” High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.“
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,” SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.“
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,” 26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.“
From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,” Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.“
The Future of Computing: Software Libraries , Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
MAGMA: A Breakthrough in Solvers for Eigenvalue Problems , San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
Matrices Over Runtime Systems at Exascale,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,” Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.“
One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,” The International Conference on Computational Science (ICCS), June 2012.“
Performance evaluation of LU factorization through hardware counter measurements,” University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.“
Power Aware Computing on GPUs,” SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“
Providing GPU Capability to LU and QR within the ScaLAPACK Framework,” University of Tennessee Computer Science Technical Report (also LAWN 272), no. UT-CS-12-699, September 2012.“
Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,” Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper), Rhodes Island, Greece, August 2012.“
Accelerating Linear System Solutions Using Randomization Techniques,” INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11), Waterloo, Ontario, Canada, July 2011.“
Autotuning GEMMs for Fermi,” University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245), April 2011.“
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems , no. UT-CS-11-689, December 2011.
A Block-Asynchronous Relaxation Method for Graphics Processing Units,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-687 / LAWN 258, November 2011.“
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures,” Symposium for Application Accelerators in High Performance Computing (SAAHPC'11), Knoxville, TN, July 2011.“
Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,” University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.“
A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,” in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.“
LU Factorization for Accelerator-Based Systems,” IEEE/ACS AICCSA 2011, Sharm-El-Sheikh, Egypt, December 2011.“
MAGMA - LAPACK for GPUs , Atlanta, GA, Keeneland GPU Tutorial, April 2011.
MAGMA - LAPACK for HPC on Heterogeneous Architectures , Oak Ridge, TN, Titan Summit at Oak Ridge National Laboratory, Presentation, August 2011.
Matrix Algebra on GPU and Multicore Architectures , Basel, Switzerland, Workshop on GPU-enabled Numerical Libraries, Presentation, May 2011.
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,” ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.“
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,” International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, ACM, September 2011. DOI: 10.1109/ICPP.2011.71“
Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,” IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 2011.“
Power-aware Computing on GPGPUs , Gatlinburg, TN, Fall Creek Falls Conference, Poster, September 2011.
Soft Error Resilient QR Factorization for Hybrid System,” University of Tennessee Computer Science Technical Report, no. UT-CS-11-675, Knoxville, TN, July 2011.“
Soft Error Resilient QR Factorization for Hybrid System,” UT-CS-11-675 (also LAPACK Working Note #252), no. ICL-CS-11-675, July 2011.“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, Seattle, WA, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11, November 2011.“
A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,” IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.“
Accelerating GPU Kernels for Dense Linear Algebra,” Proc. of VECPAR'10, Berkeley, CA, June 2010.“
Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers : 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial, July 2010.
Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing,” Parallel Computing, vol. 36, no. 12, pp. 645-654, 00 2010.“
Autotuning Dense Linear Algebra Libraries on GPUs , Basel, Switzerland, Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010), June 2010.
Blas for GPUs,” Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.“
Dense Linear Algebra for Hybrid GPU-based Systems,” Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.“
Dense Linear Algebra Solvers for Multicore with GPU Accelerators , Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
Dense Linear Algebra Solvers for Multicore with GPU Accelerators,” Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, Atlanta, GA, pp. 1-8, 2010. DOI: 10.1109/IPDPSW.2010.5470941“
Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems,” SIAM Journal on Scientific Computing (submitted), August 2010.“
Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs,” LAPACK Working Note, no. 230, 00 2010.“
Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,” IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.“
An Improved MAGMA GEMM for Fermi GPUs,” International Journal of High Performance Computing, vol. 24, no. 4, pp. 511-515, 00 2010.“