Introduction to the HPC Challenge Benchmark Suite , March 2005.
Linear Algebra Software for Large-Scale Accelerated Multicore Computing,” Acta Numerica, vol. 25, pp. 1-160, May 2016. DOI: 10.1017/S0962492916000015“
The LINPACK Benchmark: Past, Present, and Future,” Concurrency: Practice and Experience, vol. 15, pp. 803-820, 00 2008.“
Looking Back at Dense Linear Algebra Software,” Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear), 00 2012.“
Looking Back at Dense Linear Algebra Software,” Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, July 2014. DOI: 10.1016/j.jpdc.2013.10.005“
LU Factorization with Partial Pivoting for a Multicore System with Accelerators,” IEEE Transactions on Parallel and Distributed Computing, vol. 24, issue 8, pp. 1613-1621, August 2013. DOI: http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242“
Materials fingerprinting classification,” Computer Physics Communications, pp. 108019, May Jan. DOI: 10.1016/j.cpc.2021.108019“
Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems,” International Journal of High Performance Computer Applications (to appear), August 2007.“
Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems,” Supercomputing Frontiers and Innovations, vol. 1, issue 1, 2014. DOI: http://dx.doi.org/10.14529/jsfi1401“
Multithreading in the PLASMA Library,” Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications: Taylor & Francis, 00 2013.“
A New Metric for Ranking High-Performance Computing Systems,” National Science Review, vol. 3, issue 1, pp. 30-35, January 2016. DOI: 10.1093/nsr/nwv084“
Parallel Programming in MATLAB,” The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 277-283, July 2009.“
Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,” Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015. DOI: 10.14529/jsfi1504“
PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,” ACM Transactions on Mathematical Software, vol. 45, issue 2, June 2019. DOI: 10.1145/3264491“
The PlayStation 3 for High Performance Scientific Computing,” Computing in Science and Engineering, pp. 80-83, January 2008.“
Porting the PLASMA Numerical Library to the OpenMP Standard,” International Journal of Parallel Programming, June 2016. DOI: 10.1007/s10766-016-0441-6“
Power Aware Computing on GPUs,” SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.“
Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,” LAWN 267, 00 2012.“
Prospectus for the Next LAPACK and ScaLAPACK Libraries,” PARA 2006, Umea, Sweden, June 2006.“
Recursive Approach in Sparse Matrix LU Factorization,” Scientific Programming, vol. 9, no. 1, pp. 51-60, 00 2001.“
Self Adapting Numerical Software SANS Effort,” IBM Journal of Research and Development, vol. 50, no. 2/3, pp. 223-238, January 2006.“
Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters,” Parallel Computing, vol. 29, no. 11-12, pp. 1723-1743, November 2003.“
A Set of Batched Basic Linear Algebra Subprograms,” ACM Transactions on Mathematical Software, October 2020.“
A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines,” ACM Transactions on Mathematical Software (TOMS), vol. 47, no. 3, pp. 1–23, 2021. DOI: 10.1145/3431921“
The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,” SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018. DOI: 10.1137/17M1117732“
Soft Error Resilient QR Factorization for Hybrid System,” UT-CS-11-675 (also LAPACK Working Note #252), no. ICL-CS-11-675, July 2011.“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, vol. 4, issue 6, pp. 457–464, November 2013. DOI: http://dx.doi.org/10.1016/j.jocs.2013.01.004“
Soft Error Resilient QR Factorization for Hybrid System with GPGPU,” Journal of Computational Science, Seattle, WA, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11, November 2011.“
A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,” Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015. DOI: 10.1002/cpe.3306“
Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP,” International Journal of Computational Science and Engineering (IJCSE), vol. 17, no. 3, October 2018. DOI: http://dx.doi.org/10.1504/IJCSE.2018.095851“
Translational process: Mathematical software perspective,” Journal of Computational Science, vol. 52, pp. 101216, 2021. DOI: 10.1016/j.jocs.2020.101216“
Translational Process: Mathematical Software Perspective,” Journal of Computational Science, September 2020. DOI: 10.1016/j.jocs.2020.101216“
Using MAGMA with PGI Fortran,” PGI Insider, November 2010.“
Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,” ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 17-22, 00 2008.“
Clover: Computational Libraries Optimized via Exascale Research , Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects , Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
The PLASMA Library on CORAL Systems and Beyond (Poster) , Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
Using Quantized Integer in LU Factorization with Partial Pivoting (Poster) , Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs , Pittsburgh, Pennsylvania, SIAM Annual Meeting, July 2017.
MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors , Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi , Frankfurt, Germany, ISC High Performance (ISC15), Intel Booth Presentation, June 2015.
Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization,” University of Tennessee Computer Science Technical Report (also as a LAWN), no. ICL-UT-11-08, September 2011.“
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,” University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.“
Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale,” ICL Technical Report, no. ICL-UT-22-07: Innovative Computing Laboratory, July 2022.“
Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess,” Innovative Computing Laboratory (ICL) Technical Report, no. ICL-UT-10-03, June 2010.“
C++ API for Batch BLAS,” SLATE Working Notes, no. 04, ICL-UT-17-12: University of Tennessee, December 2017.“
C++ API for BLAS and LAPACK,” SLATE Working Notes, no. 02, ICL-UT-17-03: Innovative Computing Laboratory, University of Tennessee, June 2017.“
The Case for Directive Programming for Accelerator Autotuner Optimization,” Innovative Computing Laboratory Technical Report, no. ICL-UT-17-07: University of Tennessee, October 2017.“