Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure,” 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.“
DTE: PaRSEC Enabled Libraries and Applications : 2021 Exascale Computing Project Annual Meeting, April 2021.
Dynamic DAG scheduling under memory constraints for shared-memory platforms,” Int. J. of Networking and Computing, vol. 11, no. 1, pp. 27-49, 2021.“
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models,” 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, IEEE, May 2020. DOI: 10.1109/CCGrid49817.2020.00-76“
Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs,” 22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.“
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,” 2020 IEEE High Performance Extreme Computing Virtual Conference: IEEE, September 2020.“
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,” Innovative Computing Laboratory Technical Report, no. ICL-UT-20-12: University of Tennessee, August 2020.“
Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK,” Computer Modeling and Intelligent Systems CMIS-2020, Zaporizhzhoa, March 2020.“
DTE: PaRSEC Enabled Libraries and Applications (Poster) , Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
DTE: PaRSEC Systems and Interfaces (Poster) , Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
Data Logistics: Toolkit and Applications,” 5th EAI International Conference on Smart Objects and Technologies for Social Good, Valencia, Spain, September 2019.“
Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.“
Distributed-Memory Lattice H-Matrix Factorization,” The International Journal of High Performance Computing Applications, vol. 33, issue 5, pp. 1046–1063, August 2019. DOI: 10.1177/1094342019861139“
Does your tool support PAPI SDEs yet? , Tahoe City, CA, 13th Scalable Tools Workshop, July 2019.
Data Movement Interfaces to Support Dataflow Runtimes,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-03: University of Tennessee, May 2018.“
The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,” International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018. DOI: 10.1007/978-3-319-93698-7_45“
Distributed Termination Detection for HPC Task-Based Environments,” Innovative Computing Laboratory Technical Report, no. ICL-UT-18-14: University of Tennessee, June 2018.“
Do moldable applications perform better on failure-prone HPC platforms?,” 11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.“
Dataflow Programming Paradigms for Computational Chemistry Methods,” Innovative Computing Laboratory Technical Report, no. ICL-UT-17-01, Knoxville, TN, University of Tennessee, May 2017.“
Design and Implementation of the PULSAR Programming System for Large Scale Computing,” Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017. DOI: 10.14529/jsfi170101“
The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems,” International Conference on Computational Science (ICCS 2017), Zürich, Switzerland, Elsevier, June 2017. DOI: DOI:10.1016/j.procs.2017.05.138“
Designing SLATE: Software for Linear Algebra Targeting Exascale,” SLATE Working Notes, no. 03, ICL-UT-17-06: Innovative Computing Laboratory, University of Tennessee, October 2017.“
Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime,” ScalA17, Denver, ACM, September 2017. DOI: 10.1145/3148226.3148233“
Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures,” Lecture Notes in Computer Science, vol. 9573: Springer International Publishing, pp. 86-95, September 2015, 2016. DOI: 10.1007/978-3-319-32149-3_9“
On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,” The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, Chicago, IL, IEEE, May 2016.“
Domain Overlap for Iterative Sparse Triangular Solves on GPUs,” Software for Exascale Computing - SPPEXA, vol. 113: Springer International Publishing, pp. 527–545, September 2016. DOI: 10.1007/978-3-319-40528-5_24“
A Data Flow Divide and Conquer Algorithm for Multicore Architecture,” 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.“
On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation , Oak Ridge, TN, Joint Institute for Computational Sciences Seminar Series, Presentation, September 2015.
On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors,” ISC High Performance 2015, Frankfurt, Germany, July 2015.“
Design for a Soft Error Resilient Dynamic Task-based Runtime,” 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.“
Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES,” 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, November 2014.“
Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,” Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.“
Design for a Soft Error Resilient Dynamic Task-based Runtime,” ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, November 2014.“
Designing LU-QR Hybrid Solvers for Performance and Stability,” IPDPS 2014, Phoenix, AZ, IEEE, May 2014. DOI: 10.1109/IPDPS.2014.108“
Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,” The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.“
Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,” Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, May 2014.“
Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,” Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.“
Designing LU-QR hybrid solvers for performance and stability,” University of Tennessee Computer Science Technical Report (also LAWN 282), no. ut-eecs-13-719: University of Tennessee, October 2013.“
Diagnosis and Optimization of Application Prefetching Performance,” Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013. DOI: 10.1145/2464996.2465014“
Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013.“
DAGuE: A generic distributed DAG Engine for High Performance Computing.,” Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.“
Dense Linear Algebra on Accelerated Multicore Hardware,” High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.“
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,” SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.“
DAGuE: A Generic Distributed DAG Engine for High Performance Computing,” Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1151-1158, 00 2011.“
The Design of an Auto-tuning I/O Framework on Cray XT5 System,” Cray Users Group Conference (CUG'11) (Best Paper Finalist), Fairbanks, Alaska, May 2011.“
DAGuE: A generic distributed DAG engine for high performance computing,” Innovative Computing Laboratory Technical Report, no. ICL-UT-10-01, April 2010.“
Dense Linear Algebra for Hybrid GPU-based Systems,” Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.“
Dense Linear Algebra Solvers for Multicore with GPU Accelerators , Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
Dense Linear Algebra Solvers for Multicore with GPU Accelerators,” Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, Atlanta, GA, pp. 1-8, 2010. DOI: 10.1109/IPDPSW.2010.5470941“