Publications
Virtual Systolic Array for QR Decomposition,”
15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013), Boston, MA, IEEE, May 2013.
(749.84 KB)
“Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic,”
2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, IEEE, September 2017.
(1.67 MB)
“SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library,”
International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Denver, CO, ACM, November 2019.
(2.01 MB)
“Search Space Pruning Constraints Visualization,”
VISSOFT'14: 2nd IEEE Working Conference on Software Visualization, Victoria, BC, Canada, IEEE, September 2014.
(1.32 MB)
“Search Space Generation and Pruning System for Autotuners,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(555.44 KB)
“Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds,”
IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017), Boston, MA, IEEE, December 2017.
(6.71 MB)
“Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
“Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
“Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(301.01 KB)
“Mixed-precision Block Gram Schmidt Orthogonalization,”
6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Austin, TX, ACM, November 2015.
(235.69 KB)
“Massively Parallel Automated Software Tuning,”
48th International Conference on Parallel Processing (ICPP 2019), Kyoto, Japan, ACM Press, August 2019.
(911.88 KB)
“An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
Supercomputing 2013, Denver, CO, November 2013.
“Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,”
Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(398.16 KB)
“ A Data Flow Divide and Conquer Algorithm for Multicore Architecture,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(535.44 KB)
“Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,”
Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017.
“Access-averse Framework for Computing Low-rank Matrix Approximations,”
First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, October 2014.
“Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs,”
2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, IEEE, November 2015.
(1.02 MB)
“Scalable Dense Linear Algebra on Heterogeneous Hardware,”
HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.
(760.32 KB)
“Bringing High Performance Computing to Big Data Algorithms,”
Handbook of Big Data Technologies: Springer, 2017.
(1.22 MB)
“Accelerating Numerical Dense Linear Algebra Calculations with GPUs,”
Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
(1.06 MB)
“Pages
- « first
- ‹ previous
- 1
- 2
- 3