Research Assistant Professor

Betancourt, F., Wong, K., Asemota, E., Marshall, Q., Nichols, D., Tomov, S.**"openDIEL: A Parallel Workflow Engine and DataAnalytics Framework,"** *In Practice and Experience in Advanced Research Computing (PEARC ’19)*, ACM, Chicago, IL, July 28-August 1, 2019 [pdf] [bibtex]

Nichols, D., Wong, K., Tomov, S., Ng, L., Chen, S., Gessinger, A.**"MagmaDNN: Accelerated Deep Learning Using MAGMA,"** *In Practice and Experience in Advanced Research Computing (PEARC ’19)*, ACM, Chicago, IL, July 28-August 1, 2019 [pdf] [bibtex]

Wong, K., Tomov, S., Dongarra, J.**"Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,"** *ISC High Performance 2019, "HPC Education and Training for Emerging Technologies” workshop*, Springer International Publishing, Frankfurt, Germany, June 20, 2019 [pdf] [bibtex]

Nichols, D., Tomov, N.-S., Betancourt, F., Tomov, S., Wong, K., Dongarra, J.**"MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing,"** *ISC High Performance 2019, "Scalable Data Analytics in Scientific Computing” workshop*, Springer International Publishing, Frankfurt, Germany, June 20, 2019 [pdf] [bibtex]

Abdelfattah, A., Tomov, S., Dongarra, J.**"Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,"** *33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS)*, IEEE, Rio de Janeiro, Brazil, May 20-24, 2019 [bibtex]

Tomov, S., Haidar, A., Ayala, A., Schultz, D., Dongarra, J.**"Design and Implementation for FFT-ECP on Distributed Accelerated Systems,"** *ECP WBS 2.3.3.09 Milestone Report*, Innovative Computing Laboratory, University of Tennessee, FFT-ECP ST-MS-10-1410, April 4, 2019 [pdf] [bibtex]

Tomov, S., Haidar, A., Schultz, D., Dongarra, J.**"Evaluation and Design of FFT for Distributed Accelerated Systems,"** *ECP WBS 2.3.3.09 Milestone Report*, Innovative Computing Laboratory, University of Tennessee, FFT-ECP ST-MS-10-1216, October 1, 2018 [pdf] [bibtex]

Yamazaki, I., Tomov, S., Dongarra, J.**"Sampling Algorithms to Update Truncated SVD,"** *IEEE International Conference on Big Data*, Boston, MA, December 11-14, 2017 [pdf] [bibtex]

Dongarra, J., Haidar, A., Hernandez, O., Tomov, S., Gorentla Venkata, M.**"POMPEI: Programming with OpenMP4 for Exascale Investigations,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-17-754, December 7, 2017 [pdf] [bibtex]

Haidar, A., Abdelfatah, A., Zounon, M., Tomov, S., Dongarra, J.**"A Guide For Achieving High Performance With Very Small Matrices on GPU: A case Study of Batched LU and Cholesky Factorizations,"** *IEEE Transactions on Parallel and Distributed Systems, DOI: 10.1109/TPDS.2017.2783929*, December, 2017 [bibtex]

Haidar, A., Wu, P., Tomov, S., Dongarra, J.**"Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers,"** *ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems*, ACM, Denver, Colorado, November 12-17, 2017 [pdf] [bibtex]

Gates, M., Tomov, S., Dongarra, J.**"Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,"** *Parallel Computing*, 71, November, 2017 [bibtex]

Haidar, A., Jagode, H., YarKhan, A., Vaccaro, P., Tomov, S. , Dongarra, J.**"Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi,"** *2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist *, IEEE, Waltham, MA, September 12-14, 2017 [pdf] [bibtex]

Haidar, A., Kabir, K., Fayad, D., Tomov, S., Dongarra, J.**"Out Of Memory SVD Solver for Big Data,"** *2017 IEEE High Performance Extreme Computing Conference (HPEC'17)*, IEEE, Waltham, MA, September 12-14, 2017 [pdf] [bibtex]

Kabir, K., Haidar, A., Tomov, S., Bouteiller, A., Dongarra, J.**"A Framework for Out of Memory SVD Algorithms,"** *ISC High Performance 2017*, Springer International Publishing, Frankfurt, Germany, pp. 158-178, June 19-21, 2017 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs,"** *International Conference on Supercomputing (ICS'17)*, ACM, Chicago, Illinois, pp. 1-10, June 14-16, 2017 [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures,"** *International Conference on Computational Science (ICCS'17)*, Zurich, Switzerland, pp. 606-615, June 12-14, 2017 [pdf] [bibtex]

Dong, T., Haidar, A., Tomov, S., Dongarra, J.**"Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,"** *International Conference on Computational Science (ICCS'17)*, Zurich, Switzerland, pp. 1008-1018, June 12-14, 2017 [pdf] [bibtex]

Yamazaki, I., Nooshabadi, S., Tomov, S., Dongarra, J.**"Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,"** *IEEE Embedded Systems Letters*, IEEE, Vol. PP, No. 99, May 2, 2017 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,"** *Journal of Computational Science*, Elsevier, Vol. 20, 85-93, May, 2017 [bibtex]

Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Haidar, A., Karlin, I., Kolev, Tz., Masliah, I., Tomov, S.**"Small Tensor Operations on Advanced Architectures for High-order Applications,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-17-749, April 18, 2017 [pdf] [bibtex]

Haidar, A., Abdelfatah, A., Tomov, S., Dongarra, J.**"High-performance Cholesky Factorization for GPU-only Execution,"** *Proceedings of the General Purpose GPUs (GPGPU-10)*, ACM, Austin, TX, pp. 42-52, February 5, 2017 [pdf] [bibtex]

Baboulin, M., Dongarra, J., Remy, A., Tomov, S., Yamazaki, I.**"Solving dense symmetric indefinite systems using GPUs,"** *Concurrency and Computation: Practice and Experience*, Special Issues on Parallel Processing and Applied Mathematics (PPAM'15) eds. Vol. 29, Issue 9, 2017 [bibtex]

Lopez, M., Larrea, V., Joubert, W., Hernandez, O., Haidar, A., Tomov, S., Dongarra, J.**"Evaluation of Directive-based Performance Portable Programming Models,"** *International Journal of High Performance Computing and Networking (IJHPCN)*, (In Press), 2017 [bibtex]

Abdelfatah, A., Haidar, A., Tomov, S., Dongarra, J.**"Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-16-748, December 28, 2016 [pdf] [bibtex]

Haidar, A., Abdelfatah, A., Tomov, S., Dongarra, J.**"High-performance Cholesky factorization for GPU-only execution,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-16-747, December 26, 2016 [pdf] [bibtex]

Lopez, M., Larrea, V., Joubert, W., Hernandez, O., Haidar, A., Tomov, S., Dongarra, J.**"Towards Achieving Performance Portability Using Directives for Accelerators,"** *The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD)*, Salt Lake City, Utah, November 13-18, 2016 [pdf] [bibtex]

Haidar, A., Tomov, S., Arturov, K., Guney, M., Story, S., Dongarra, J.**"LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi,"** *IEEE High Performance Extreme Computing Conference (HPEC'16)*, Waltham, MA, September 13-15, 2016 [bibtex]

Haidar, A., Brock, B., Tomov, S., Guidry, M., Billings, J., Shyles, D., Dongarra, J.**"Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations,"** *2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16)*, September 13-15, 2016 [pdf] [bibtex]

Masliah, I., Abdelfattah, A., Haidar, A., Tomov, S., Baboulin, M., Falcou, J., Dongarra, J.**"High-performance matrix-matrix multiplications of very small matrices,"** *22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16)*, Grenoble, France, August 22-26, 2016 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Performance, Design, and Autotuning of Batched GEMM for GPUs,"** *The International Supercomputing Conference (ISC High Performance 2016)*, Frankfurt, Germany, June 19-23, 2016 [pdf] [bibtex]

Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Earl, C., Falcou, J., Haidar, A., Karlin, I., Kolev, Tz., Masliah, I., Tomov, S.**"High-Performance Tensor Contractions for GPUs,"** *International Conference on Computational Science (ICCS'16)*, San Diego, California, U.S.A., June 6-8, 2016 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs,"** *International Conference on Computational Science (ICCS'16)*, San Diego, California, U.S.A., June 6-8, 2016 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,"** *The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, IEEE*, Chicago, IL, USA, May 27, 2016 [pdf] [bibtex]

Newburn, CJ., Bansal, G., Wood, M., Crivelli, L., Planas, J., Duran, A., Souza, P., Borges, L., Luszczek, P., Tomov, S., Dongarra, J., Anzt, H., Gates, M., Haidar, A., Jia, Y., Kabir, K., Yamazaki, I., Labarta, J.**"Heterogeneous Streaming,"** *The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, IEEE*, Chicago, IL, USA, May 23, 2016 [pdf] [bibtex]

Abdelfattah, A., Haidar, A., Tomov, S., Dongarra, J.**"Performance, Design, and Autotuning of Batched GEMM for GPUs,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-16-739, February 1, 2016 [pdf] [bibtex]

Abdelfattah, A., Baboulin, M., Dobrev, V., Dongarra, J., Earl, C., Falcou, J., Haidar, A., Karlin, I., Kolev, Tz., Masliah, I., Tomov, S.**"High-Performance Tensor Contractions for GPUs,"** *University of Tennessee Computer Science Technical Report*, UT-EECS-16-738, January 21, 2016 [pdf] [bibtex]

Yamazaki, I., Tomov, S., and Dongarra, J.**"Non-GPU-resident Dense Symmetric Indefinite Factorization,"** *Concurrency and Computation: Practice and Experience*, 2016 [bibtex]

Haidar, A., Jia, Y., Luszczek, P., Tomov, S., YarKhan, A., Dongarra, J.**"Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators,"** *Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15)*, ACM, New York, NY, USA, No. 5, November 16, 2015 [pdf] [bibtex]

Mary, T., Yamazaki, I., Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.**"Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs,"** *The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 15)*, Austin, TX, Nov. 15, 2015 [bibtex]

Yamazaki, I., Tomov, S., Kurzak, J., Dongarra, J., Barlow, J.**"Mixed-precision Block Gram Schmidt Orthogonalization,"** *6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems*, Austin, TX, November, 2015 [bibtex]

Baboulin,, M., Dongarra, J., Remy, A., Tomov, S., Yamazaki, I.**"Dense Symmetric Indefinite Factorization on GPU acclerated architectures,"** *International Conference on Parallel Processing and Applied Mathematics (PPAM)*, Krakow, Poland, Sep. 6-9, 2015 [bibtex]

Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.**"Batched Matrix Computations on Hardware Accelerators,"** *EuroMPI/Asia 2015 Workshop*, Bordeaux, France, September, 2015 [bibtex]

Haidar, A., Tomov, S., Luszczek, P., Dongarra, J.**"MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing,"** *19th IEEE High Performance Extreme Computing Conference (HPEC 2015), Best Paper Award*, IEEE, Waltham, MA, September, 2015 [pdf] [bibtex]

YarKhan, A., Haidar, A., Cao, C., Luszczek, P., Tomov, S., Dongarra, J.**"Cholesky Across Accelerators,"** *17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015)*, IEEE, Elizabeth, NJ, August, 2015 [bibtex]

Kabir, K., Haidar, A., Tomov, S., and Dongarra, J.**"On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors,"** *ISC High Performance 2015*, Frankfurt, Germany, July 12-16, 2015 [pdf] [bibtex]

Haidar, A., Dong, T., Tomov, S., Luszczek, P., Dongarra, J.**"Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations,"** *ISC HPC*, Springer LNCS, Frankfurt, Germany, July 12-16, 2015 [pdf] [bibtex]

Kabir, K., Haidar, A., Tomov, S., and Dongarra, J.**"Performance Analysis and Optimisation of Two-Sided Factorization Algorithms for Heterogeneous Platform,"** *The International Conference on Computational Science (ICCS 2015)*, Reykjavík, Iceland, June 1-3, 2015 [pdf] [bibtex]

Kabir, K., Haidar, A., Tomov, S., and Dongarra, J.**"Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures,"** *The Spring Simulation Multi-Conference 2015 (SpringSim'15)*, Alexandria, VA, April 12-15, 2015 [pdf] [bibtex]

Haidar, A., Dong, T., Luszczek, P., Tomov, S., and Dongarra, J.**"Batched matrix computations on hardware accelerators based on GPUs,"** *International Journal of High Performance Computing Applications*, Sage Publications, Inc., February 9, 2015 [bibtex]

Haidar, A., Dong, T., Luszczek, P., Tomov, S., and Dongarra, J.**"Optimization for performance and energy for batched matrix computations on GPUs,"** *GPGPU 2015 Proceedings of the 8th Workshop on General Purpose Processing using GPUs*, ACM, San Francisco, CA, pp. 59-69, February 7, 2015 [bibtex]

Anzt, H., Tomov, S., Dongarra, J.**"Energy efficiency and performance frontiers for sparse computations on GPU supercomputers,"** *Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15)*, ACM, San Francisco, CA, February, 2015 [pdf] [bibtex]

Haidar, A., Dongarra, J., Kabir, K., Gates, M., Luszczek, P., Tomov, S., Jia, Y.**"HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,"** *Scientific Computing*, IO Press, Vol. 23, No. 1, January, 2015 [pdf] [bibtex]

Yamazaki, I., Tomov, S., Dongarra, J.**"Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations,"** *Scientific Programming*, 2015, 2015, 2015 [bibtex]

Yamazaki, I., Tomov, S., and Dongarra, J.**"Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs,"** *SIAM Journal on Scientific Computing*, Vol. 37, No. 3, C307-C330, 2015 [bibtex]

Anzt, H., Sawyer, W., Tomov, S., Luszczek, P., Dongarra, J.**"Acceleration of GPU-based Krylov solvers via Data Transfer Reduction,"** *IJHPCA special issue for ASHES workshop*, 2015 [bibtex]

Abalenkovs, M., Abdelfattah, A., Dongarra, J., Gates, M., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., Yamazaki, I., YarKhan, A.**"Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,"** *Supercomputing frontiers and innovations*, Vol. 2, No. 4, pp. 67-86, 2015 [pdf] [bibtex]

Yamazaki, I., Tomov, S., Dongarra, J.**"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES,"** *5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems*, New Orleans, LA, Nov. 17, 2014 [pdf] [bibtex]

Haidar, A., Cao, C., Yamazaki, I., Dongarra, J., Gates, M., Luszczek, P., Tomov, S.**"Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors,"** *Scala 2014*, ACM, New Orleans, LA, November 17, 2014 [pdf] [bibtex]

Yamazaki, I., Rajamanickam, S., Boman, E., Hoemmen, M., Heroux, M., Tomov, S.**"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,"** *The International Conference for High Performance Computing, Networking, Storage and Analysis (SC)*, New Orleans, LA, November, 2014 [bibtex]

Anzt, H., Tomov, S., Dongarra, J.**"Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product,"** *University of Tennessee Computer Science Technical Report*, University of Tennessee, Knoxville, TN, UT-EECS-14-731, October 17, 2014 [pdf] [bibtex]

Dong, T., Haidar, A., Tomov, S., Dongarra, J.**"A Fast Batched Cholesky Factorization on a GPU,"** *2014 International Conference on Parallel Processing (ICPP-2014)*, Minneapolis, MN, September, 2014 [pdf] [bibtex]

Dong, T., Haidar, A., Luszczek, P., Harris, J., Tomov, S., and Dongarra, J.**"LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU,"** *16th IEEE International Conference on High Performance Computing and Communications (HPCC)*, Paris, France, pp. 157-161, August 20-22, 2014 [pdf] [bibtex]

Dongarra, J., Gates, M., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., Yamazaki, I.**"Accelerating Numerical Dense Linear Algebra Calculations with GPUs,"** *Numerical Calculations with GPUs*, Volodymyr Kindratenko, eds., eds. Springer International Publishing, pp. 3-28, July, 2014 [pdf] [bibtex]

Anzt, H., Lukarski, D., Tomov, S., Dongarra, J.**"Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures,"** *VECPAR 2014*, Eugene, OR, June, 2014 [pdf] [bibtex]

Haidar, A., Luszczek, P., Tomov, S., Dongarra, J.**"Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments,"** *VECPAR 2014*, Eugene, OR, June, 2014 [pdf] [bibtex]

Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S., YarKhan, A.**"Model-Driven One-Sided Factorizations on Multicore Accelerated Systems,"** *International Journal on Supercomputing Frontiers and Innovations*, Vol. 1, No. 1, June, 2014 [pdf] [bibtex]

Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.**"clMAGMA: High Performance Dense Linear Algebra with OpenCL,"** *International Workshop on OpenCL*, Bristol University, England, May 12-13, 2014 [pdf] [bibtex]

Anzt, H., Tomov, S., Luszczek, P., Yamazaki, I., Dongarra, J., Sawyer, W.**"Optimizing Krylov Subspace Solvers on Graphics Processing Units,"** *Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014*, IEEE, Phoenix, AZ, May, 2014 [pdf] [bibtex]

Donfack, S., Tomov, S., Dongarra, J.**"Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,"** *Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014*, IEEE, Phoenix, AZ, May, 2014 [pdf] [bibtex]

Dong, T., Haidar, A., Tomov, S., Dongarra, J.**"Batched Cholesky Factorization on a GPU,"** *VECPAR 2014 (Submitted)*, Eugene, OR, January, 2014 [bibtex]

Du, P., Luszczek, P., Tomov, S., Dongarra, J.**"Soft Error Resilient QR Factorization for Hybrid System with GPGPU,"** *Journal of Computational Science*, Vassil Alexandrov eds. eds. Elsevier B.V., Vol. 4, No. 6, pp. 457-464, November, 2013 [pdf] [bibtex]

Dongarra, J., Gates, M., Haidar, A., Jia, Y., Kabir, K., Luszczek, P., Tomov, S.**"Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,"** *PPAM 2013*, Warsaw, Poland, September, 2013 [pdf] [bibtex]

Haidar, A., Tomov, S., Dongarra, J., Solca, R., Schulthess, T.**"A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,"** *International Journal of High Performance Computing Applications*, August, 2013 [pdf] [bibtex]

Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.**"A Block-Asynchronous Relaxation Method for Graphics Processing Units,"** *Journal of Parallel and Distributed Computing*, June, 2013 [pdf] [bibtex]

Haidar, A., Solca, R., Gates, M., Tomov, S., Schulthess, T., Dongarra, J.**"Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations,"** *International Supercomputing Conference ISC, Lecture Notes in Computer Science*, Leipzig, Germany, Vol. 7905, pp. 67-80, June, 2013 [pdf] [bibtex]

Chongxiao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.**"clMAGMA: High Performance Dense Linear Algebra with OpenCL,"** *University of Tennessee Computer Science Technical Report (Lawn 275)*, UT-CS-13-706, March, 2013 [pdf] [bibtex]

Baboulin, M., Dongarra, J., Herrmann, J., Tomov, S.**"Accelerating linear system solutions using randomization techniques,"** *ACM Transactions on Mathematical Software (TOMS)*, Vol. 39, No 2, February, 2013 [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Kurzak, J., Luszczek, P., Tomov, S., and J. Dongarra**"Scalable Dense Linear Algebra on Heterogeneous Hardware,"** *HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing*, IOS Press, 2013 [pdf] [bibtex]

Vetter, J., Glassbrook, R., Schwan, K., Yalamanchili, S., Horton, M., Gavrilovska, A., Slawinska, M., Meredith, J., Roth, P., Spafford, K., Tomov, S., Wynkoop, J.**"Keeneland: Computational Science using Heterogeneous GPU Computing,"** *Contemporary High Performance Computing: From Petascale Toward Exascale*, Jeffrey Vetter eds. eds. Taylor and Francis, CRC Computational Science Series, Boca Raton, FL, Chapter 7, 2013 [pdf] [bibtex]

Solcà, R., Haidar, A., Tomov, S., Dongarra, J., Schulthess, T.**"A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,"** *Supercomputing '12 (poster)*, Salt Lake City, Utah, November, 2012 [bibtex]

Agullo, E., Bosilca, G., Castagnède, C., Dongarra, J., Ltaief, H., Tomov, S.**"Matrices Over Runtime Systems at Exascale,"** *Supercomputing '12 (poster)*, Salt Lake City, Utah, November, 2012 [bibtex]

Dong, T., Kolev, T., Rieben, R., Dobrev, V., Tomov, S., Dongarra, J.**"Acceleration of the BLAST Hydro Code on GPU,"** *Supercomputing '12 (poster)*, Salt Lake City, Utah, November, 2012 [bibtex]

Donfack, S., Tomov, S., Dongarra, J.**"Performance evaluation of LU factorization through hardware counter measurements,"** *University of Tennessee Computer Science Technical Report*, ut-cs-12-700, October, 2012 [pdf] [bibtex]

Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.**"A Block-Asynchronous Relaxation Method for Graphics Processing Units,"** *Journal of Parallel and Distributed Computing (submitted)*, October, 2012 [pdf] [bibtex]

Du, P., Tomov, S., and Dongarra, J.**"Providing GPU Capability to LU and QR within the ScaLAPACK Framework,"** *University of Tennessee Computer Science Technical Report, UT-CS-12-699 (lawn272)*, UT-CS-12-699, September 12, 2012 [pdf] [bibtex]

Du, P., Tomov, S., Dongarra, J.**"Providing GPU Capability to LU and QR within the ScaLAPACK Framework,"** *University of Tennessee Computer Science Technical Report (also LAWN 272)*, UT-CS-12-699, September, 2012 [pdf] [bibtex]

Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.**"Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,"** *Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper)*, Rhodes Island, Greece, August, 2012 [pdf] [bibtex]

Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.**"From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,"** *Parallel Computing*, Vol. 38, No. 8, pp. 391-407, August, 2012 [bibtex]

Kasichayanula, K., Terpstra, D., Luszczek, P., Tomov, S., Moore, S., Peterson, G.**"Power Aware Computing on GPUs,"** *SAAHPC '12 (Best Paper Award)*, Argonne, IL, July 10-11, 2012 [pdf] [bibtex]

Yamazaki, I., Tomov, S., Dongarra, J.**"One-sided dense matrix factorizations on a multicore with multiple GPU accelerators,"** *The International Conference on Computational Science (ICCS)*, June 4, 2012 [bibtex]

Song, F., Tomov, S., Dongarra, J.**"Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,"** *26th ACM International Conference on Supercomputing (ICS 2012)*, ACM, San Servolo Island, Venice, Italy, June, 2012 [pdf] [bibtex]

Baboulin, M., Donfack, S., Dongarra, J., Grigori, L., Remi, A., Tomov, S.**"A class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines,"** *Proc. of the International Conference on Computational Science (ICCS) *, 9, 17-26, June, 2012 [bibtex]

Anzt, H., Tomov, S., Gates, M., Dongarra, J., Heuveline, V.**"Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,"** *ICCS 2012*, Omaha, NE, June, 2012 [pdf] [bibtex]

Vomel, C., Tomov, S., Dongarra, J.**"Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,"** *SIAM Journal on Scientific Computing*, 34 (2), C70-C82, April 12, 2012 [bibtex]

Baboulin, M., Dongarra, J., Herrmann, J., Tomov, S.**"Accelerating Linear System Solutions Using Randomization Techniques,"** *ACM Transactions on Mathematical Software (accepted) (also LAWN 246)*, March, 2012 [pdf] [bibtex]

Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.**"Dense Linear Algebra on Accelerated Multicore Hardware,"** *High Performance Scientific Computing: Algorithms and Applications*, Berry, M., et al. eds. Springer-Verlag, London, UK, 2012 [bibtex]

Kurzak, J., Luszczek, P., Tomov, S., Dongarra, J.**"Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,"** *LAWN 267*, 2012 [pdf] [bibtex]

Anzt, H., Tomov, S., Gates, M., Dongarra, J., Heuveline, V.**"Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,"** UT-CS-11-689, December 6, 2011 [pdf] [bibtex]

Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., Tomov, S.**"LU Factorization for Accelerator-based Systems,"** *IEEE/ACS AICCSA 2011*, Sharm-El-Sheikh, Egypt, December, 2011 [pdf] [bibtex]

Anzt, H., Tomov, S., Dongarra, J., Heuveline, V.**"A Block-Asynchronous Relaxation Method for Graphics Processing Units,"** *University of Tennessee Computer Science Technical Report*, UT-CS-11-687 / LAWN 258, November 30, 2011 [pdf] [bibtex]

Nath, R., Tomov, S., Dong, T., Dongarra, J.**"Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,"** *ACM/IEEE Conference on Supercomputing (SC’11)*, Seattle, WA, November 12-18, 2011 [pdf] [bibtex]

Malony, A., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Duncan Poole, P., Lamb, C.**"Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,"** *International Conference on Parallel Processing (ICPP'11)*, Taipei, Taiwan, September 13-16, 2011 [bibtex]

Baboulin, M., Dongarra, J., Herrmann, J., Tomov, S.**"Accelerating Linear System Solutions Using Randomization Techniques,"** *INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11)*, Waterloo, Ontario, Canada, July 25-29, 2011 [bibtex]

Horton, M., Tomov, S., Dongarra, J.**"A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures,"** *Symposium for Application Accelerators in High Performance Computing (SAAHPC'11)*, Knoxville, TN, July 19-20, 2011 [pdf] [bibtex]

Du, P., Luszczek, P., Tomov, S., Dongarra, J.**"Soft Error Resilient QR Factorization for Hybrid System,"** *UT-CS-11-675 (also LAPACK Working Note #252)*, ICL-CS-11-675, July 1, 2011 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Lemarier, P., Saengpatsa, N., Tomov, S., Dongarra, J.**"Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,"** *IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC)*, June 24, 2011 [pdf] [bibtex]

Fengguang, S., Tomov, S., Dongarra, J.**"Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,"** *University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250)*, June 16, 2011 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Saengpatsa, N., Tomov, S., Dongarra, J.**"A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,"** *IEEE International Parallel and Distributed Processing Symposium (submitted)*, Anchorage, AK, May 16-20, 2011 [bibtex]

Kurzak, J., Tomov, S., Dongarra, J.**"Autotuning GEMMs for Fermi,"** *University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245)*, April 18, 2011 [pdf] [bibtex]

Malony, A., Biersdorff, S., Shende, S., Jagode, H., Tomov, S., Juckeland, G., Dietrich, R., Poole, D., Lamb, C.**"Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,"** *ICPP 2011 (submitted)*, Taipei, Taiwan, 2011 [pdf] [bibtex]

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.**"A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,"** *in GPU Computing Gems, Jade Edition*, Hwu, W. eds. Elsevier, 2, 473-484, 2011 [bibtex]

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.**"GPU Computing Gems, Jade Edition,"** *ISBN: 9780123859631*, Wen-mei W. Hwu eds. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 473-484 (Chapter 34), 2011 [bibtex]

Nath, R., Tomov, S., Dongarra, J.**"Blas for GPUs, Scientific Computing with Multicore and Accelerators,"** *Chapman & Hall/CRC Computational Science*, Kurzak, J., Bader, D., Dongarra, J. eds. Chapman & Hall/CRC Computational Science, December 7, 2010 [pdf] [bibtex]

Tomov, S., Dongarra, J.**"Dense Linear Algebra for Hybrid GPU-based Systems, Scientific Computing with Multicore and Accelerators,"** *Chapman & Hall/CRC Computational Science*, Kurzak, J., Bader, D., Dongarra, J. eds. Chapman & Hall/CRC Computational Science, December 7, 2010 [bibtex]

Tomov, S., Faverge, M., Luszczek, P., Dongarra, J.**"Using MAGMA with PGI Fortran,"** *PGI Insider*, November 15, 2010 [htm] [bibtex]

Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Ltaief, H., Thibault, S., Tomov, S.**"QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators,"** *Proceedings of IPDPS 2011*, Anchorage, AK, ICL-UT-10-04, October 1, 2010 [pdf] [bibtex]

Du, P., Luszczek, P., Tomov, S., Dongarra, J.**"Mixed-Tool Performance Analysis on Hybrid Multicore Architectures,"** *First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010)*, San Diego, CA, Sept. 13-16, 2010 [pdf] [bibtex]

Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.**"From CUDA to OpenCL: Towards a Performance-portable Solution for Multiplatform GPU Programming,"** *Parallel Computing (submitted)*, August, 2010 [bibtex]

Vomel, C., Tomov, S., Dongarra, J.**"Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems,"** *SIAM Journal on Scientific Computing (submitted)*, August, 2010 [bibtex]

Nath, R., Tomov, S., Dongarra, J.**"An Improved MAGMA GEMM for Fermi GPUs,"** *University of Tennessee Computer Science Technical Report*, UT-CS-10-655 (also LAPACK working note 227), July 29, 2010 [pdf] [bibtex]

Ltaief, H., Tomov, S., Nath, R., Du, P., Dongarra, J.**"A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators,"** *Proc. of VECPAR'10 (to appear)*, Berkeley, CA, June 22-25, 2010 [pdf] [bibtex]

Nath, R., Tomov, S., Dongarra, J.**"Accelerating GPU Kernels for Dense Linear Algebra,"** *Proc. of VECPAR'10*, Berkeley, CA, June 22-25, 2010 [pdf] [bibtex]

Bernholc, J., Hodak, M., Lu, W., Moore, S., Tomov, S.**"Scalability Study of a Quantum Simulation Code,"** *PARA 2010*, Reykjavik, Iceland, June 6-9, 2010 [bibtex]

Tomov, S., Lu., W., Bernholc, J., Moore, S., Dongarra, J.**"Performance Evaluation for Petascale Quantum Simulation Tools,"** *Proceedings of the Cray Users' Group Meeting*, Atlanta, GA, May 4, 2010 [bibtex]

Ltaief, H., Tomov, S., Nath, R., Dongarra, J.**"Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,"** *IEEE Transaction on Parallel and Distributed Systems (submitted)*, March 26, 2010 [pdf] [bibtex]

Tomov, S., Nath, R., Ltaief, H., Dongarra, J.**"Dense Linear Algebra Solvers for Multicore with GPU Accelerators,"** *Proc. of IPDPS'10*, Atlanta, GA, January 15, 2010 [pdf] [bibtex]

Tomov, S., Nath, R., Dongarra, J.**"Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing,"** *Parallel Computing*, vol. 36, number 12, pp. 645-654, June 19, 2010 [pdf] [bibtex]

Nath, R., Tomov, S., Dongarra, J.**"An Improved MAGMA GEMM for Fermi GPUs,"** *International Journal of High Performance Computing*, vol. 24, no. 4, 511-515, November 18, 2010 [bibtex]

Tomov, S., Dongarra, J., Baboulin, M.**"Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,"** *Parallel Computing*, Vol. 36, Number 5-6, pp. 232-240, 2010 [pdf] [bibtex]

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., and Tomov, S.**"Faster, Cheaper, Better - a Hybridization Methodology to Develop Linear Algebra Software for GPUs,"** *LAPACK Working Note 230*, 2010 [pdf] [bibtex]

Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.**"From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,"** *Parallel Computing (submitted)*, 2010 [bibtex]

Li, Y., Dongarra, J., Tomov, S.**"A note on auto-tuning GEMM for GPUs,"** *Proc. of ICCS'09*, Baton Rouge, LA, UT-CS-09-635, May 25-27, 2009 [pdf] [bibtex]

Li, Y., Dongarra, J., Tomov, S.**"A Note on Auto-tuning GEMM for GPUs,"** *Computational Science – ICCS 2009, Proceedings of the 9th International Conference, Lecture Notes in Computer Science: Theoretical Computer Science and General Issues*, Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. eds. Baton Rouge, LA, Parts I-II, Vols. 5544-5545, pp. 884-892, May 25-27, 2009 [bibtex]

Tomov, S., Dongarra, J.**"Accelerating the Reduction to Upper Hessenberg Form Through Hybrid GPU-based Computing,"** *University of Tennessee Computer Science Technical Report, UT-CS-09-642 (also LAPACK Working Note 219)*, May 24, 2009 [pdf] [bibtex]

Tomov, S., Lu, W., Bernholc, J., Moore, S., Dongarra, J.**"Performance evaluation for petascale quantum simulation tools,"** *Proceedings of CUG09*, Atlanta, GA, May 4-7, 2009 [pdf] [bibtex]

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.**"Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects,"** *Journal of Physics: Conference Series*, Vol. 180, 2009 [pdf] [bibtex]

Canning, A., Dongarra, J., Langou, J., Marques, O., Tomov, S., Voemel, C., Wang, L.-W.**"Interior State Computation of Nano Structures,"** *PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing*, Trondheim, Norway, May 13-16, 2008 [pdf] [bibtex]

Baboulin, M., Tomov, S., Dongarra, J.**"Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures,"** *PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing*, Trondheim Norway, May 13-16, 2008 [bibtex]

Baboulin, M., Dongarra, J., Tomov, S.**"Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures,"** *University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200)*, May 6, 2008 [pdf] [bibtex]

Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.**"Exploring New Architectures in Accelerating CFD for Air Force Applications,"** *Proceedings of the DoD HPCMP User Group Conference*, Seattle, Washington, July 14-17, 2008 [pdf] [bibtex]

Tomov, S., Dongarra, J., Baboulin, M.**"Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,"** *University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210)*, October 17, 2008 [pdf] [bibtex]

Vomel, C., Tomov, S., Marques, O., Canning, A., Wang, L.-W., Dongarra, J.**"State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems,"** *Journal of Computational Physics*, Vol. 227, Issue15, pp. 7113-7124, July, 2008 [bibtex]

Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., Luszczek, P., Tomov, S.**"Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,"** *in High Performance Computing and Grids in Action*, Grandinetti, L. eds. IOS Press, Amsterdam, 2008 [pdf] [bibtex]

Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.**"Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,"** *ACM Transactions on Mathematical Software*, Vol 34, Number 4, pp. 17-22, 2008 [pdf] [bibtex]

Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, Jn., Luszczek, P., Tomov, S.**"Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,"** *In High Performance Computing and Grids in Action (to appear)*, Lucio Grandinetti eds. IOS Press, Amsterdam, 2007 [pdf] [bibtex]

Vo¨mel, C., Tomov, S., Wang, L-W., Marques, O., Dongarra, J.**"The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot,"** *Journal of Computational Physics*, Volume 223, pp. 774-782, 2007 [pdf] [bibtex]

Demmel, J., Dongarra, J., Parlett, B., Kahan, W., Gu, M., Bindel, D., Hida, Y., Li, X., Marques, O., Riedy, E. J., Voemel, C., Langou, J., Luszczek, P., Kurzak, J., Buttari, A., Langou, J., Tomov, S.**"Prospectus for the Next LAPACK and ScaLAPACK Libraries,"** *PARA 2006*, Umea, Sweden, June, 2006 [pdf] [bibtex]

Canning, A., Dongarra, J., Langou, J., Marques, O., Tomov, S., Voemel, C., Wang, L-W.**"Towards bulk based preconditioning for quantum dot computations,"** *IEEE/ACM Proceedings of HPCNano SC06 (to appear)*, 2006 [pdf] [bibtex]

Canning, A., Dongarra, J., Langou, J., Marques, O., Tomov, S., Voemel, C., Wang, L-W.**"Performance evaluation of eigensolvers in nano-structure computations,"** *IEEE/ACM Proceedings of HPCNano SC06 (to appear)*, 2006 [pdf] [bibtex]

Voemel, C., Tomov, S., Wang, L-W., Marques, O., Dongarra, J.**"The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot,"** *Journal of Computational Physics (submitted)*, 2006 [pdf] [bibtex]

Zunger, A., Franceschetti, A., Bester, G., Jones, W. B., Kim, K., Graf, P. A., Wang, L-W., Canning, A., Marques, O., Voemel, C., Dongarra, J., Langou, J., Tomov, S.**"Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures,"** *J. Phys.: Conf. Ser. 46*, doi:10.1088/1742-6596/46/1/040, 292-298, 2006 [pdf] [bibtex]

Tomov, S., Langou, J., Dongarra, J., Canning, A., Wang, L-W.**"Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,"** *International Journal of Computational Science and Engineering*, Volume 2, Number 3/ 4, pp. 205-212, 2006 [pdf] [bibtex]

Tomov, S., Langou, J., Canning, A., Wang, L.-W., Dongarra, J.**"Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures,"** *Proceedings of 5th International Conference on Computational Science (ICCS)*, Sunderman, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. eds. Springer's Lecture Notes in Computer Science, Atlanta, GA, USA, Part III, pp. 317-325, May, 22-25, 2005 [pdf] [bibtex]

Tomov, S., Langou, J., Canning, A., Wang, L-W., Dongarra, J.**"Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,"** *International Journal of Computational Science and Engineering (to appear)*, June, 2005 [pdf] [bibtex]

