Publications
A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 14, pp. 3702-3723, September 2015.
DOI: 10.1002/cpe.3403 (8.16 MB)
“Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores,”
International conference on Supercomputing, Munich, Germany, ACM, pp. 333-342, June 2014.
DOI: 10.1145/2597652.2597670 (2.9 MB)
“Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,”
26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.
(5.88 MB)
“A Scalable Framework for Heterogeneous GPU-Based Clusters,”
The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), Pittsburgh, PA, USA, ACM, June 2012.
(3.39 MB)
“Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.
(5.93 MB)
“Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems,”
SC'10, New Orleans, LA, ACM SIGARCH/ IEEE Computer Society, November 2010.
(3.42 MB)
“Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems,”
University of Tennessee Computer Science Technical Report, vol. –10-653, April 2010.
(3.42 MB)
“Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems,”
IEEE Cluster 2009, New Orleans, August 2009.
(395.53 KB)
“Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems,”
International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09), Portland, OR, November 2009.
(502.49 KB)
“A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling,”
The International Conference on Computational Science 2009 (ICCS 2009), vol. 5544, Baton Rouge, LA, pp. 195-204, May 2009.
(228.45 KB)
“Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms,”
University of Tennessee Computer Science Technical Report, UT-CS-08-626, January 2008.
(650.75 KB)
“Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications,”
Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming, vol. 4315: Springer Berlin / Heidelberg, 00 2008.
(350.9 KB)
“Feedback-Directed Thread Scheduling with Memory Considerations,”
IEEE International Symposium on High Performance Distributed Computing, Monterey Bay, CA, June 2007.
(297.24 KB)
“L2 Cache Modeling for Scientific Applications on Chip Multi-Processors,”
Proceedings of the 2007 International Conference on Parallel Processing, Xi'an, China, IEEE Computer Society, January 2007.
(654.11 KB)
“Experiments with Strassen's Algorithm: From Sequential to Parallel,”
18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted), Dallas, Texas, January 2006.
(514.33 KB)
“Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors,”
University of Tennessee Computer Science Technical Report, no. UT-CS-06-583, January 2006.
(652.93 KB)
“Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications,”
Second International Workshop on OpenMP, Reims, France, January 2006.
(350.9 KB)
“Automatic Experimental Analysis of Communication Patterns in Virtual Topologies,”
In Proceedings of the International Conference on Parallel Processing, Oslo, Norway, IEEE Computer Society, June 2005.
(227.13 KB)
“An Algebra for Cross-Experiment Performance Analysis,”
2004 International Conference on Parallel Processing (ICCP-04), Montreal, Quebec, Canada, August 2004.
(166.12 KB)
“Automating the Large-Scale Collection and Analysis of Performance,”
5th LCI International Conference on Linux Clusters: The HPC Revolution, Austin, Texas, May 2004.
(511.6 KB)
“CUBE User Manual,”
ICL Technical Report, no. ICL-UT-04-01, February 2004.
(429.12 KB)
“