Publications
Accelerating Linear System Solutions Using Randomization Techniques,”
ACM Transactions on Mathematical Software (also LAWN 246), vol. 39, issue 2, February 2013.
DOI: 10.1145/2427023.2427025
(358.79 KB)
“
Accelerating NWChem Coupled Cluster through dataflow-based Execution,”
The International Journal of High Performance Computing Applications, vol. 32, issue 4, pp. 540--551, July 2018.
DOI: 10.1177/1094342016672543
(1.68 MB)
“
Accelerating NWChem Coupled Cluster through Dataflow-Based Execution,”
The International Journal of High Performance Computing Applications, pp. 1–13, January 2017.
DOI: 10.1177/1094342016672543
(4.07 MB)
“
Accelerating Restarted GMRES with Mixed Precision Arithmetic,”
IEEE Transactions on Parallel and Distributed Systems, June 2021.
DOI: 10.1109/TPDS.2021.3090757
(572.4 KB)
“
Accelerating Scientific Computations with Mixed Precision Algorithms,”
Computer Physics Communications, vol. 180, issue 12, pp. 2526-2533, December 2009.
DOI: 10.1016/j.cpc.2008.11.005
(402.69 KB)
“
Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing,”
Parallel Computing, vol. 36, no. 12, pp. 645-654, 00 2010.
(1.39 MB)
“
Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,”
Journal of Computational Science, vol. 26, pp. 237–245, May 2018.
DOI: 10.1016/j.jocs.2018.01.007
(2.18 MB)
“
Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,”
Parallel Computing, vol. 74, pp. 3–18, May 2018.
DOI: 10.1016/j.parco.2017.10.004
(1.34 MB)
“
Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting,”
Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, May 2014.
DOI: 10.1002/cpe.3110
(1.96 MB)
“
Adaptive Precision in Block-Jacobi Preconditioning for Iterative Sparse Linear System Solvers,”
Concurrency and Computation: Practice and Experience, vol. 31, no. 6, pp. e4460, March 2019.
DOI: 10.1002/cpe.4460
(341.54 KB)
“
Advancements of PAPI for the exascale generation,”
The International Journal of High Performance Computing Applications, December 2024.
DOI: 10.1177/10943420241303884
“Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Parallel Computing, vol. 81, pp. 1–21, January 2019.
DOI: 10.1016/j.parco.2018.10.003
(3.27 MB)
“
Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 12, pp. 2700–2712, December 2018.
DOI: 10.1109/TPDS.2018.2842785
(2.53 MB)
“
Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures,”
Submitted to Concurrency and Computations: Practice and Experience, November 2010.
(1.65 MB)
“
Argobots: A Lightweight Low-Level Threading and Tasking Framework,”
IEEE Transactions on Parallel and Distributed Systems, October 2017.
DOI: 10.1109/TPDS.2017.2766062
“An Asynchronous Algorithm on NetSolve Global Computing System,”
Future Generation Computer Systems, vol. 22, issue 3, pp. 279-290, February 2006.
DOI: 10.1016/j.future.2005.10.003
(568.92 KB)
“
Autotuning GEMM Kernels for the Fermi GPU,”
IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, November 2012.
DOI: 10.1109/TPDS.2011.311
(742.5 KB)
“
Autotuning in High-Performance Computing Applications,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2068–2083, November 2018.
DOI: 10.1109/JPROC.2018.2841200
(2.5 MB)
“
Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
DOI: 10.1109/JPROC.2018.2868961
(2.53 MB)
“
Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
DOI: 10.14529/jsfi180404
(720.15 KB)
“
Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,”
Journal of Computational Science, vol. 26, pp. 226–236, May 2018.
DOI: 10.1016/j.jocs.2018.01.005
(3.73 MB)
“
Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applications,”
Future Generation Computer Systems, vol. 160, pp. 359 - 374, November 2024.
DOI: 10.1016/j.future.2024.06.004
“