ICL Publications
2025
Evolution of the computational science community: The dynamics of topics and collaborations in 24 years of ICCS and JoCS publications,”
Journal of Computational Science, vol. 89, July 2025.
DOI: 10.1016/j.jocs.2025.102609
“
Accelerating Homotopy Continuation with GPUs: Application to Trifocal Pose Estimation,”
2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Milano, Italy, IEEE, July 2025.
DOI: 10.1109/IPDPS64566.2025.00110
“
Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic
: arXiv, June 2025.
SpikeRL: A Scalable and Energy-efficient Framework for Deep Spiking Reinforcement Learning
: arXiv, February 2025.
2024
Advancements of PAPI for the exascale generation,”
The International Journal of High Performance Computing Applications, December 2024.
DOI: 10.1177/10943420241303884
“
Hardware Trends Impacting Floating-Point Computations In Scientific Applications
: arXiv, December 2024.
Accelerating Fusion Plasma Collision Operator Solves with Portable Batched Iterative Solvers on GPUs,”
ISC High Performance 2024 International Workshops , vol. 15058, Hamburg, Germany, Springer, Cham, pp. 127 - 140, December 2024.
DOI: 10.1007/978-3-031-73716-9
“
Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applications,”
Future Generation Computer Systems, vol. 160, pp. 359 - 374, November 2024.
DOI: 10.1016/j.future.2024.06.004
“
Interface for Sparse Linear Algebra Operations
, November 2024.
DOI: 10.48550/arXiv.2411.13259
PaRSEC: Scalability, flexibility, and hybrid architecture support for task-based applications in ECP,”
The International Journal of High Performance Computing Applications, October 2024.
DOI: 10.1177/10943420241290520
“
Evolution of the SLATE linear algebra library,”
The International Journal of High Performance Computing Applications, September 2024.
DOI: 10.1177/10943420241286531
“
Numerical eigen-spectrum slicing, accurate orthogonal eigen-basis, and mixed-precision eigenvalue refinement using OpenMP data-dependent tasks and accelerator offload,”
The International Journal of High Performance Computing Applications, vol. 303, issue 136, September 2024.
DOI: 10.1177/10943420241281050
“
Ginkgo - A math library designed to accelerate Exascale Computing Project science applications,”
The International Journal of High Performance Computing Applications, August 2024.
DOI: 10.1177/10943420241268323
“
The co-evolution of computational physics and high-performance computing,”
Nature Reviews Physics, August 2024.
DOI: 10.1038/s42254-024-00750-z
“
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?,”
Future Generation Computer Systems, July 2024.
DOI: 10.1016/j.future.2024.07.022
“
Multi-GPU work sharing in a task-based dataflow programming model,”
Future Generation Computer Systems, vol. 156, pp. 313 - 324, July 2024.
DOI: 10.1016/j.future.2024.03.017
“
XaaS: Acceleration as a Service to Enable Productive High-Performance Cloud Computing,”
Computing in Science & Engineering, vol. 26, issue 3, pp. 40 - 51, July 2024.
DOI: 10.1109/MCSE.2024.3382154
“
Taking the MPI standard and the open MPI library to exascale,”
The International Journal of High Performance Computing Applications, July 2024.
DOI: 10.1177/10943420241265936
“
Computation at the Cutting Edge of Science,”
Journal of Computational Science, June 2024.
DOI: 10.1016/j.jocs.2024.102379
“
MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures,”
The International Journal of High Performance Computing Applications, June 2024.
DOI: 10.1177/10943420241261960
“
Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Applications,”
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, IEEE, May 2024.
DOI: 10.1109/IPDPSW63119.2024.00193
“
Automated Data Analysis for Defining Performance Metrics from Raw Hardware Events,”
2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, IEEE, May 2024.
DOI: 10.1109/IPDPSW63119.2024.00134
“
Then and Now: Improving Software Portability, Productivity, and 100× Performance,”
Computing in Science & Engineering, pp. 1 - 10, April 2024.
DOI: 10.1109/MCSE.2024.3387302
“
CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT)
: arXiv, February 2024.
Economical Quasi-Newton Unitary Optimization of Electronic Orbitals,”
Physical Chemistry Chemical Physics, December 2023, 2024.
DOI: 10.1039/D3CP05557D
“
Evaluating PaRSEC Through Matrix Computations in Scientific Applications,”
Asynchronous Many-Task Systems and Applications - Second International Workshop, WAMTA 2024, Knoxville, TN, USA, February 14-16, 2024, Proceedings, vol. 14626: Springer, pp. 22–33, 2024.
DOI: 10.1007/978-3-031-61763-8_3
(600.76 KB)
“

Towards Scalable and Efficient Spiking Reinforcement Learning for Continuous Control Tasks,”
2024 International Conference on Neuromorphic Systems (ICONS), Arlington, VA, USA, IEEE, 2024.
DOI: 10.1109/ICONS62911.2024.00057
“
2023
Using Ginkgo's memory accessor for improving the accuracy of memory‐bound low precision BLAS,”
Software: Practice and Experience, vol. 532, issue 1, pp. 81 - 98, January Jan.
DOI: 10.1002/spe.v53.110.1002/spe.3041
“
Generalizing Random Butterfly Transforms to Arbitrary Matrix Sizes
: arXiv, December 2023.
Reducing Data Motion and Energy Consumption of Geospatial Modeling Applications Using Automated Precision Conversion,”
2023 IEEE International Conference on Cluster Computing (CLUSTER), Santa Fe, NM, USA, IEEE, November 2023.
DOI: 10.1109/CLUSTER52292.2023.00035
“
GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3624247
“
Parallel Symbolic Cholesky Factorization,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3624253
“
Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators,”
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3624248
“
Elastic deep learning through resilient collective operations,”
SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023.
DOI: 10.1145/3624062.3626080
“
Performance Insights into Device-initiated RMA Using Kokkos Remote Spaces,”
2023 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), Santa Fe, NM, USA, IEEE, November 2023.
DOI: 10.1109/CLUSTERWorkshops61457.2023.00028
“
Direct Determination of Optimal Real-Space Orbitals for Correlated Electronic Structure of Molecules,”
Journal of Chemical Theory and Computation, vol. 19, issue 20, pp. 7230 - 7241, October 2023.
DOI: 10.1021/acs.jctc.3c00732
“
Earth Virtualization Engines - A Technical Perspective
, September 2023.
Improving the Scaling of an Asynchronous Many-Task Runtime with a Lightweight Communication Engine,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, September 2023.
DOI: 10.1145/3605573.3605642
“
Synchronizing MPI Processes in Space and Time,”
EUROMPI '23: 30th European MPI Users' Group Meeting, Bristol, United Kingdom, ACM, September 2023.
DOI: 10.1145/3615318.3615325
“
Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors,”
ACM Transactions on Mathematical Software, vol. 49, issue 3, pp. 1 - 29, September 2023.
DOI: 10.1145/3595178
“
When to checkpoint at the end of a fixed-length reservation?,”
Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop, Denver, United States, August 2023.
“
Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors,”
Concurrency and Computation: Practice and Experience, August 2023.
DOI: 10.1002/cpe.7871
“
Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements,”
2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), St. Petersburg, Florida, IEEE, August 2023.
DOI: 10.1109/IPDPSW59300.2023.00070
(1.81 MB)
“

O(N) distributed direct factorization of structured dense matrices using runtime systems,”
52nd International Conference on Parallel Processing (ICPP 2023), Salt Lake City, Utah, ACM, August 2023.
DOI: 10.1145/3605573.3605606
“
Three-precision algebraic multigrid on GPUs,”
Future Generation Computer Systems, July 2023.
DOI: 10.1016/j.future.2023.07.024
“
Using Additive Modifications in LU Factorization Instead of Pivoting,”
37th ACM International Conference on Supercomputing (ICS'23), Orlando, FL, ACM, June 2023.
DOI: 10.1145/3577193.3593731
(624.18 KB)
“

Memory Traffic and Complete Application Profiling with PAPI Multi-Component Measurements
, St. Petersburg, FL, 28th HIPS Workshop, May 2023.
(3.99 MB)

Mixed Precision Algebraic Multigrid on GPUs,”
Parallel Processing and Applied Mathematics (PPAM 2022), vol. 13826, Cham, Springer International Publishing, April 2023.
DOI: 10.1007/978-3-031-30442-2_9
“
Combining multitask and transfer learning with deep Gaussian processes for autotuning-based performance engineering,”
The International Journal of High Performance Computing Applications, March 2023.
DOI: 10.1177/10943420231166365
“
Revisiting I/O bandwidth-sharing strategies for HPC applications,”
INRIA Research Report, no. RR-9502: INRIA, March 2023.
“
MPI Continuations And How To Invoke Them,”
Sustained Simulation Performance 2021, Cham, Springer International Publishing, pp. 67 - 83, February 2023.
DOI: 10.1007/978-3-031-18046-010.1007/978-3-031-18046-0_5
“
AI Benchmarking for Science: Efforts from the MLCommons Science Working Group,”
Lecture Notes in Computer Science, vol. 13387: Springer International Publishing, pp. 47 - 64, January 2023.
DOI: 10.1007/978-3-031-23220-610.1007/978-3-031-23220-6_4
“
Preconditioners for Batched Iterative Linear Solvers on GPUs,”
Smoky Mountains Computational Sciences and Engineering Conference, vol. 169075: Springer Nature Switzerland, pp. 38 - 53, January 2023.
DOI: 10.1007/978-3-031-23606-810.1007/978-3-031-23606-8_3
“
HPC Forecast: Cloudy and Uncertain,”
Communications of the ACM, vol. 66, issue 2, pp. 82 - 90, January 2023.
DOI: 10.1145/3552309
“
PAQR: Pivoting Avoiding QR factorization,”
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, FL, USA, IEEE, 2023.
DOI: 10.1109/IPDPS54959.2023.00040
“
2022
Composition of Algorithmic Building Blocks in Template Task Graphs,”
2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM), Dallas, TX, USA, IEEE, January 2023, 2022.
DOI: 10.1109/PAW-ATM56565.2022.00008
(1015.99 KB)
“

Implicit Actions and Non-blocking Failure Recovery with MPI,”
2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), Dallas, TX, USA, IEEE, January 2023, 2022.
DOI: 10.1109/FTXS56515.2022.00009
“
Performance Application Programming Interface,”
Accelerated Computing with HIP: Sun, Baruah and Kaeli, December 2022.
“
A Python Library for Matrix Algebra on GPU and Multicore Architectures,”
2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, IEEE, December 2022.
DOI: 10.1109/MASS56207.2022.00121
(414.36 KB)
“

Evaluations of molecular modeling and machine learning for predictive capabilities in binding of lanthanum and actinium with carboxylic acids,”
Journal of Radioanalytical and Nuclear Chemistry, December 2022.
DOI: 10.1007/s10967-022-08620-7
“
The evolution of mathematical software,”
Communications of the ACM, vol. 65227, issue 12, pp. 66 - 72, December 2022.
DOI: 10.1145/3554977
“
Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era,”
2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Dallas, TX, USA, IEEE, November 2022.
DOI: 10.1109/P3HPC56579.2022.00009
“
Randomized Numerical Linear Algebra: A Perspective on the Field with an Eye to Software,”
University of California, Berkeley EECS Technical Report, no. UCB/EECS-2022-258: University of California, Berkeley, November 2022.
DOI: 10.48550/arXiv.2302.11474
(1.05 MB)
(1.54 MB)
“


Reshaping Geostatistical Modeling and Prediction for Extreme-Scale Environmental Applications,”
2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22), Dallas, TX, IEEE Press, November 2022.
“
Threshold Pivoting for Dense LU Factorization,”
ScalAH22: 13th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems , Dallas, Texas, IEEE, November 2022.
DOI: 10.1109/ScalAH56622.2022.00010
(721.77 KB)
“

Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers,”
2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC22), Dallas, TX, IEEE Computer Society, pp. 354-367, November 2022.
(1.57 MB)
“

Extending MAGMA Portability with OneAPI,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, November 2022.
(999.19 KB)
“

Extending MAGMA Portability with OneAPI
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), ACM Student Research Competition, November 2022.
(1.33 MB)

Providing performance portable numerics for Intel GPUs,”
Concurrency and Computation: Practice and Experience, vol. 17, October 2022.
DOI: 10.1002/cpe.7400
(3.16 MB)
“

Deep Gaussian process with multitask and transfer learning for performance optimization,”
2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-7, September 2022.
DOI: 10.1109/HPEC55821.2022.9926396
“
Lossy all-to-all exchange for accelerating parallel 3-D FFTs on hybrid architectures with GPUs,”
2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 152-160, September 2022.
DOI: 10.1109/CLUSTER51413.2022.00029
“
Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG,”
2022 IEEE International Conference on Cluster Computing (CLUSTER), Heidelberg, Germany, IEEE, September 2022.
DOI: 10.1109/CLUSTER51413.2022.00026
“
Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach,”
2022 IEEE International Conference on Cluster Computing (CLUSTER 2022), Heidelberg, Germany, September 2022.
“
Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance,”
2022 IEEE High Performance Extreme Computing Conference (HPEC): IEEE, September 2022.
DOI: 10.1109/HPEC55821.2022.9926401
“
Evaluating Data Redistribution in PaRSEC,”
IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 8, pp. 1856-1872, August 2022.
DOI: 10.1109/TPDS.2021.3131657
(3.19 MB)
“

Checkpointing à la Young/Daly: An Overview,”
IC3-2022: Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, Noida, India, ACM Press, pp. 701-710, August 2022.
DOI: 10.1145/3549206
(639.77 KB)
“

Performance Analysis of Parallel FFT on Large Multi-GPU Systems,”
2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lyon, France, IEEE, August 2022.
DOI: 10.1109/IPDPSW55747.2022.00072
“
A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), July 2022.
DOI: 10.1109/IPDPS53621.2022.00047
(1.03 MB)
“

Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment,”
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, IEEE, July 2022.
DOI: 10.1109/IPDPS53621.2022.00086
“
Computational science for a better future,”
Journal of Computational Science, vol. 62, pp. 101745, July 2022.
DOI: 10.1016/j.jocs.2022.101745
“
Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations,”
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, IEEE, July 2022.
DOI: 10.1109/IPDPS53621.2022.00024
(1.26 MB)
“

Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale,”
ICL Technical Report, no. ICL-UT-22-07: Innovative Computing Laboratory, July 2022.
(5.91 MB)
“

Porting Sparse Linear Algebra to Intel GPUs,”
Euro-Par 2021: Parallel Processing Workshops, vol. 13098, Lisbon, Portugal, Springer International Publishing, pp. 57 - 68, June 2022.
DOI: 10.1007/978-3-031-06156-1_5
“
PAQR: Pivoting Avoiding QR factorization,”
ICL Technical Report, no. ICL-UT-22-06, June 2022.
(364.85 KB)
“

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units,”
Concurrency and Computation: Practice and Experience, vol. 34, issue 14, June 2022.
DOI: 10.1002/cpe.6515
(749.82 KB)
“

Batch QR Factorization on GPUs: Design, Optimization, and Tuning,”
Lecture Notes in Computer Science, vol. 13350, Cham, Springer International Publishing, June 2022.
DOI: 10.1007/978-3-031-08751-6_5
“
Report on the Oak Ridge National Laboratory's Frontier System,”
ICL Technical Report, no. ICL-UT-22-05, May 2022.
(16.87 MB)
“

Mixed precision and approximate 3D FFTs: Speed for accuracy trade-off with GPU-aware MPI and run-time data compression,”
ICL Technical Report, no. ICL-UT-22-04, May 2022.
(706.14 KB)
“

Compressed basis GMRES on high-performance graphics processing units,”
The International Journal of High Performance Computing Applications, May 2022.
DOI: 10.1177/10943420221115140
(13.52 MB)
“

Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC,”
IEEE Transactions on Parallel and Distributed Systems, vol. 33, issue 4, pp. 964 - 976, April 2022.
DOI: 10.1109/TPDS.2021.3084071
“
Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing,”
ACM Transactions on Mathematical Software, vol. 48, issue 12, pp. 1 - 33, March 2022.
DOI: 10.1145/3480935
(4.2 MB)
“

Reinventing High Performance Computing: Challenges and Opportunities,”
ICL Technical Report, no. ICL-UT-22-03, March 2022.
(1.36 MB)
“

Resiliency in numerical algorithm design for extreme scale simulations,”
The International Journal of High Performance Computing Applications, vol. 36371337212766180823, issue 2, pp. 251 - 285, March 2022.
DOI: 10.1177/10943420211055188
“
Using long vector extensions for MPI reductions,”
Parallel Computing, vol. 109, pp. 102871, March 2022.
DOI: 10.1016/j.parco.2021.102871
“
FFT Benchmark Performance Experiments on Systems Targeting Exascale,”
ICL Technical Report, no. ICL-UT-22-02, March 2022.
(5.87 MB)
“

Optimal Checkpointing Strategies for Iterative Applications,”
IEEE Transactions on Parallel Distributed Systems, vol. 33, issue 3, pp. 507-522, March 2022.
DOI: 10.1109/TPDS.2021.3099440
(1.47 MB)
“

OpenMP application experiences: Porting to accelerated nodes,”
Parallel Computing, vol. 109, March 2022.
DOI: 10.1016/j.parco.2021.102856
“
Ginkgo—A math library designed for platform portability,”
Parallel Computing, vol. 111, pp. 102902, February 2022.
DOI: 10.1016/j.parco.2022.102902
“
Comparing Distributed Termination Detection Algorithms for Modern HPC Platforms,”
International Journal of Networking and Computing, vol. 12, issue 1, pp. 26 - 46, January 2022.
DOI: 10.15803/ijnc.12.1_26
“
Approximate Computing for Scientific Applications,”
Approximate Computing Techniques, 322: Springer International Publishing, pp. 415 - 465, January 2022.
DOI: 10.1007/978-3-030-94705-7_14
“
Communication Avoiding LU with Tournament Pivoting in SLATE,”
SLATE Working Notes, no. 18, ICL-UT-22-01, January 2022.
(3.74 MB)
“

Prediction of Optimal Solvers for Sparse Linear Systems Using Deep Learning,”
2022 SIAM Conference on Parallel Processing for Scientific Computing (PP), Philadelphia, PA, Society for Industrial and Applied Mathematics, pp. 14 - 24.
DOI: 10.1137/1.978161197714110.1137/1.9781611977141.2
“
2021
Callback-based completion notification using MPI Continuations,”
Parallel Computing, vol. 21238566, issue 0225, pp. 102793, May Jan.
DOI: 10.1016/j.parco.2021.102793
“
Materials fingerprinting classification,”
Computer Physics Communications, pp. 108019, May Jan.
DOI: 10.1016/j.cpc.2021.108019
(3.8 MB)
“

Accelerating Multi - Process Communication for Parallel 3-D FFT,”
2021 Workshop on Exascale MPI (ExaMPI), St. Louis, MO, USA, IEEE, December 2021.
DOI: 10.1109/ExaMPI54564.2021.00011
“
An international survey on MPI users,”
Parallel Computing, vol. 108, December 2021.
DOI: 10.1016/j.parco.2021.102853
(1.49 MB)
“

Rare Earth Elements and Critical Materials: Uses and Availability,”
Rare Earth Elements and Actinides: Progress in Computational Science Applications, vol. 1388, Washington, DC, American Chemical Society, pp. 63-74, October 2021.
DOI: 10.1021/bk-2021-1388.ch003
“
Rare Earth Elements and Actinides: Progress in Computational Science Applications,”
ACS Symposium Series, vol. 1388, Washington, DC, American Chemical Society, October 2021.
DOI: DOI: 10.1021/bk-2021-1388
“
An Introduction to High Performance Computing and Its Intersection with Advances in Modeling Rare Earth Elements and Actinides,”
Rare Earth Elements and Actinides: Progress in Computational Science Applications, vol. 1388, Washington, DC, American Chemical Society, pp. 3-53, October 2021.
DOI: 10.1021/bk-2021-1388.ch001
“
A More Portable HeFFTe: Implementing a Fallback Algorithm for Scalable Fourier Transforms,”
ICL Technical Report, no. ICL-UT-21-04: University of Tennessee, August 2021.
(493.17 KB)
“

Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices,”
ICL Technical Report, no. ICL-UT-21-05, August 2021.
(3.93 MB)
“

MAGMA: Evolution and Revolution
, Knoxville, TN, ICL Lunch Talk Seminar, July 2021.
(8.88 MB)

Interim Report on Benchmarking FFT Libraries on High Performance Systems,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-21-03: University of Tennessee, July 2021.
(2.68 MB)
“

Accelerating Restarted GMRES with Mixed Precision Arithmetic,”
IEEE Transactions on Parallel and Distributed Systems, June 2021.
DOI: 10.1109/TPDS.2021.3090757
(572.4 KB)
“

Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
(1.08 MB)
“

Distributed-Memory Multi-GPU Block-Sparse Tensor Contraction for Electronic Structure,”
35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021), Portland, OR, IEEE, May 2021.
“
Gingko: A Sparse Linear Algebrea Library for HPC
: 2021 ECP Annual Meeting, April 2021.
(893.04 KB)

P1673R3: A Free Function Linear algebra Interface Based on the BLAS,”
ISO JTC1 SC22 WG22, no. P1673R3: ISO, April 2021.
(858.89 KB)
“

DTE: PaRSEC Enabled Libraries and Applications
: 2021 Exascale Computing Project Annual Meeting, April 2021.
(3.24 MB)

SLATE Performance Improvements: QR and Eigenvalues,”
SLATE Working Notes, no. 17, ICL-UT-21-02, April 2021.
(2 MB)
“

SLATE Port to AMD and Intel Platforms,”
SLATE Working Notes, no. 16, ICL-UT-21-01, April 2021.
(890.75 KB)
“

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines,”
ACM Transactions on Mathematical Software (TOMS), vol. 47, no. 3, pp. 1–23, 2021.
DOI: 10.1145/3431921
“
Lecture Notes in Computer Science: High Performance Computing
, vol. 12761: Springer International Publishing, 2021.
DOI: 10.1007/978-3-030-90539-2
Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems,”
IEEE Access, 2021.
DOI: 10.1109/ACCESS.2021.3106054
(1.35 MB)
“

Resilient scheduling heuristics for rigid parallel jobs,”
Int. J. of Networking and Computing, vol. 11, no. 1, pp. 2-26, 2021.
(8.67 MB)
“

Effortless Monitoring of Arithmetic Intensity with PAPI’s Counter Analysis Toolkit,”
Tools for High Performance Computing 2018/2019: Springer, pp. 195–218, 2021.
DOI: 10.1007/978-3-030-66057-4_11
“
Task-graph scheduling extensions for efficient synchronization and communication,”
Proceedings of the ACM International Conference on Supercomputing, pp. 88–101, 2021.
DOI: 10.1145/3447818.3461616
“
Budget-aware scheduling algorithms for scientific workflows with stochastic task weights on IaaS Cloud platforms,”
Concurrency and Computation: Practice and Experience, vol. 33, no. 17, pp. e6065, 2021.
DOI: 10.1002/cpe.6065
(1.99 MB)
“

Evaluating Task Dropping Strategies for Overloaded Real-Time Systems (Work-In-Progress),”
42nd Real Time Systems Symposium (RTSS): IEEE Computer Society Press, 2021.
(217.13 KB)
“

Accelerating FFT towards Exascale Computing
: NVIDIA GPU Technology Conference (GTC2021), 2021.
(27.23 MB)

Revisiting Credit Distribution Algorithms for Distributed Termination Detection,”
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): IEEE, pp. 611–620, 2021.
DOI: 10.1109/IPDPSW52791.2021.00095
“
A survey of numerical linear algebra methods utilizing mixed-precision arithmetic,”
The International Journal of High Performance Computing Applications, vol. 35, no. 4, pp. 344–369, 2021.
DOI: 10.1177/10943420211003313
“
Translational process: Mathematical software perspective,”
Journal of Computational Science, vol. 52, pp. 101216, 2021.
DOI: 10.1016/j.jocs.2020.101216
“
Max-Stretch Minimization on an Edge-Cloud Platform,”
IPDPS'2021, the 34th IEEE International Parallel and Distributed Processing Symposium: IEEE Computer Society Press, 2021.
(4.94 MB)
“

Scalability Issues in FFT Computation,”
International Conference on Parallel Computing Technologies: Springer, pp. 279–287, 2021.
DOI: 10.1007/978-3-030-86359-3_21
“
20 years of computational science: Selected papers from 2020 International Conference on Computational Science,”
Journal of Computational Science, vol. 53, pp. 101395–101395, 2021.
DOI: 10.1016/j.jocs.2021.101395
“
libCEED: Fast algebra for high-order element-based discretizations,”
Journal of Open Source Software, vol. 6, no. 63, pp. 2945, 2021.
DOI: 10.21105/joss.02945
“
Quo Vadis MPI RMA? Towards a More Efficient Use of MPI One-Sided Communication,”
EuroMPI'21, Garching, Munich Germany, 2021.
(835.27 KB)
“

Efficient exascale discretizations: High-order finite element methods,”
The International Journal of High Performance Computing Applications, pp. 10943420211020803, 2021.
DOI: 10.1177/10943420211020803
“
GPU algorithms for Efficient Exascale Discretizations,”
Parallel Computing, vol. 108, pp. 102841, 2021.
DOI: 10.1016/j.parco.2021.102841
“
Dynamic DAG scheduling under memory constraints for shared-memory platforms,”
Int. J. of Networking and Computing, vol. 11, no. 1, pp. 27-49, 2021.
(574.64 KB)
“

2020
Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC),”
LAPACK Working Notes, no. 297, ICL-UT-20-07: University of Tennessee.
(1.41 MB)
“

Performance Application Programming Interface for Extreme-Scale Environments (PAPI-EX) (Poster)
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, 20 2020.
(2.53 MB)

Integrating Deep Learning in Domain Science at Exascale (MagmaDNN)
, virtual, DOD HPCMP seminar, December 2020.
(11.12 MB)

MAGMA Templates for Scalable Linear Algebra on Emerging Architectures,”
The International Journal of High Performance Computing Applications, vol. 34, issue 6, pp. 645-658, November 2020.
DOI: 10.1177/1094342020938421
“
Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance,”
International Conference for High Performance Computing Networking, Storage, and Analysis (SC20): ACM, November 2020.
(644.92 KB)
“

Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations,”
2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS): IEEE, November 2020.
(1.9 MB)
“

Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), Atlanta, GA, IEEE, November 2020.
(184.6 KB)
“

Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems,”
Proceedings of the Royal Society A, vol. 476, issue 2243, November 2020.
DOI: 10.1098/rspa.2020.0110
(2.24 MB)
“

The Template Task Graph (TTG) - An Emerging Practical Dataflow Programming Paradigm for Scientific Simulation at Extreme Scale,”
2020 IEEE/ACM 5th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2): IEEE, November 2020.
DOI: 10.1109/ESPM251964.2020.00011
(139.6 KB)
“

Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions,”
Journal of Parallel and Distributed Computing, vol. 145, pp. 188-201, November 2020.
DOI: 10.1016/j.jpdc.2020.07.001
(1.3 MB)
“

High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
(1.3 MB)
“

heFFTe: Highly Efficient FFT for Exascale (Poster)
: NVIDIA GPU Technology Conference (GTC2020), October 2020.
(866.88 KB)

A Set of Batched Basic Linear Algebra Subprograms,”
ACM Transactions on Mathematical Software, October 2020.
“
SLATE Performance Report: Updates to Cholesky and LU Factorizations,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-14: University of Tennessee, October 2020.
(1.64 MB)
“

Using Advanced Vector Extensions AVX-512 for MPI Reduction,”
EuroMPI/USA '20: 27th European MPI Users' Group Meeting, Austin, TX, September 2020.
DOI: 10.1145/3416315.3416316
(634.45 KB)
“

Effortless Monitoring of Arithmetic Intensity with PAPI's Counter Analysis Toolkit,”
13th International Workshop on Parallel Tools for High Performance Computing, Dresden, Germany, Springer International Publishing, September 2020.
(738.47 KB)
“

Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
2020 IEEE High Performance Extreme Computing Virtual Conference: IEEE, September 2020.
(476.36 KB)
“

Mixed Precision LU Factorization on GPU Tensor Cores: Reducing Data Movement and Memory Footprint,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-13: University of Tennessee, September 2020.
(409 KB)
“

Scalable Data Generation for Evaluating Mixed-Precision Solvers,”
2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, IEEE, September 2020.
DOI: 10.1109/HPEC43674.2020.9286145
(1.3 MB)
“

HAN: A Hierarchical AutotuNed Collective Communication Framework,”
IEEE Cluster Conference, Kobe, Japan, Best Paper Award, IEEE Computer Society Press, September 2020.
(764.05 KB)
“

A Report of the MPI International Survey (Poster)
, Austin, TX, EuroMPI/USA '20: 27th European MPI Users' Group Meeting, September 2020.
Translational Process: Mathematical Software Perspective,”
Journal of Computational Science, September 2020.
DOI: 10.1016/j.jocs.2020.101216
(752.59 KB)
“

Predicting MPI Collective Communication Performance Using Machine Learning,”
2020 IEEE International Conference on Cluster Computing (CLUSTER), Kobe, Japan, IEEE, September 2020.
DOI: 10.1109/CLUSTER49012.2020.00036
(619.68 KB)
“

Using Advanced Vector Extensions AVX-512 for MPI Reduction (Poster)
, Austin, TX, EuroMPI/USA '20: 27th European MPI Users' Group Meeting, September 2020.
(708.68 KB)

Flexible Data Redistribution in a Task-Based Runtime System,”
IEEE International Conference on Cluster Computing (Cluster 2020), Kobe, Japan, IEEE, September 2020.
DOI: 10.1109/CLUSTER49012.2020.00032
(354.8 KB)
“

Evaluating Asynchronous Schwarz Solvers on GPUs,”
International Journal of High Performance Computing Applications, August 2020.
DOI: 10.1177/1094342020946814
“
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-12: University of Tennessee, August 2020.
(476.36 KB)
“

Integrating Deep Learning in Domain Sciences at Exascale,”
2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020), August 2020.
“
Ginkgo: A High Performance Numerical Linear Algebra Library,”
Journal of Open Source Software, vol. 5, issue 52, August 2020.
DOI: 10.21105/joss.02260
(721.84 KB)
“

ASCR@40: Four Decades of Department of Energy Leadership in Advanced Scientific Computing Research
: Advanced Scientific Computing Advisory Committee (ASCAC), US Department of Energy, August 2020.
Improving the Performance of the GMRES Method using Mixed-Precision Techniques,”
Smoky Mountains Computational Sciences & Engineering Conference (SMC2020), August 2020.
(600.33 KB)
“

Translational Process: Mathematical Software Perspective,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-11, August 2020.
(752.59 KB)
“

Multiprecision Block-Jacobi for Iterative Triangular Solves,”
European Conference on Parallel Processing (Euro-Par 2020): Springer, August 2020.
DOI: 10.1007/978-3-030-57675-2_34
“
Robustness of the Young/Daly Formula for Stochastic Iterative Applications,”
49th International Conference on Parallel Processing (ICPP 2020), Edmonton, AB, Canada, ACM Press, August 2020.
(1.11 MB)
“

Integrating Deep Learning in Domain Sciences at Exascale,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-10: University of Tennessee, August 2020.
(1.09 MB)
“

hipMAGMA v2.0
: Zenodo, July 2020.
DOI: 10.5281/zenodo.3928667
A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic,”
SLATE Working Notes, no. 15, ICL-UT-20-08: University of Tennessee, July 2020.
(3.98 MB)
“

How to Build Your Own Deep Neural Network
: PEARC20, July 2020.
(18.8 MB)

SLATE Users' Guide,”
SLATE Working Notes, no. 10, ICL-UT-19-01: Innovative Computing Laboratory, University of Tennessee, July 2020.
(1.51 MB)
“

Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part V,”
Lecture Notes in Computer Science, 1, no. 12141: Springer International Publishing, pp. 618, June 2020.
DOI: 10.1007/978-3-030-50426-7
“
Sparse Linear Algebra on AMD and NVIDIA GPUs—The Race is On,”
ISC High Performance: Springer, June 2020.
DOI: 10.1007/978-3-030-50743-5_16
(5.63 MB)
“

Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV,”
Lecture Notes in Computer Science, 1, no. 12140: Springer International Publishing, pp. 668, June 2020.
DOI: 10.1007/978-3-030-50423-6
“
Twenty Years of Computational Science,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
(149.66 KB)
“

Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part III,”
Lecture Notes in Computer Science, 1, no. 12139: Springer International Publishing, pp. 648, June 2020.
DOI: 10.1007/978-3-030-50420-5
“
Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part II,”
Lecture Notes in Computer Science, 1, no. 12138: Springer International Publishing, pp. 697, June 2020.
DOI: 10.1007/978-3-030-50417-5
“
Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VII,”
Lecture Notes in Computer Science, 1, no. 12143: Springer International Publishing, pp. 775, June 2020.
DOI: 10.1007/978-3-030-50436-6
“
FFT-Based Gradient Sparsification for the Distributed Training of Deep Neural Networks,”
9th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20), Stockholm, Sweden, ACM, June 2020.
DOI: 10.1145/3369583.3392681
(4.72 MB)
“

ASCR@40: Highlights and Impacts of ASCR’s Programs
: US Department of Energy’s Office of Advanced Scientific Computing Research, June 2020.
DOI: 10.2172/1631812
heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
DOI: 10.1007/978-3-030-50371-0_19
(2.62 MB)
“

Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications,”
Platform for Advanced Scientific Computing Conference (PASC20), Geneva, Switzerland, ACM, June 2020.
DOI: 10.1145/3394277.3401846
(2.71 MB)
“

Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part I,”
Lecture Notes in Computer Science, 1, no. 12137: Springer International Publishing, pp. 707, June 2020.
DOI: 10.1007/978-3-030-50371-0
“
Computational Science – ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part VI,”
Lecture Notes in Computer Science, 1, no. 12142: Springer International Publishing, pp. 667, June 2020.
DOI: 10.1007/978-3-030-50433-5
“
Report on the Fujitsu Fugaku System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-06: University of Tennessee, June 2020.
(3.3 MB)
“

Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices using GPUs,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, Springer, Cham, June 2020.
DOI: 10.1007/978-3-030-50417-5_18
(702.38 KB)
“

Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures,”
Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020), May 2020.
(188.51 KB)
“

Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime,”
2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), New Orleans, LA, IEEE, May 2020.
DOI: 10.1109/IPDPSW50202.2020.00127
(1.33 MB)
“

Revisiting Dynamic DAG Scheduling under Memory Constraints for Shared-Memory Platforms,”
22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(317.93 KB)
“

Using Arm Scalable Vector Extension to Optimize Open MPI,”
20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020), Melbourne, Australia, IEEE/ACM, May 2020.
DOI: 10.1109/CCGrid49817.2020.00-71
(359.95 KB)
“

Reservation and Checkpointing Strategies for Stochastic Jobs,”
34th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(692.4 KB)
“

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models,”
20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), Melbourne, VIC, Australia, IEEE, May 2020.
DOI: 10.1109/CCGrid49817.2020.00-76
(424.19 KB)
“

CEED ECP Milestone Report: Improve Performance and Capabilities of CEED-Enabled ECP Applications on Summit/Sierra,”
ECP Milestone Reports: Zenodo, May 2020.
DOI: 10.5281/zenodo.3860804
(28.12 MB)
“

Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-05: University of Tennessee, May 2020.
(1.03 MB)
“

Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs,”
22nd Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2020), New Orleans, LA, IEEE Computer Society Press, May 2020.
(696.21 KB)
“

Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution,”
Future Generation Computer Systems, vol. 106, pp. 467-481, May 2020.
DOI: 10.1016/j.future.2020.01.026
(2.06 MB)
“

Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD,”
Concurrency and Computation: Practice and Experience, April 2020.
DOI: 10.1002/cpe.5754
(1.43 MB)
“

Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-04: University of Tennessee, Knoxville, March 2020.
(188.51 KB)
“

Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part II,”
Lecture Notes in Computer Science, no. 12044: Springer International Publishing, pp. 503, March 2020.
DOI: 10.1007/978-3-030-43222-5
“
Docker Container based PaaS Cloud Computing Comprehensive Benchmarks using LAPACK,”
Computer Modeling and Intelligent Systems CMIS-2020, Zaporizhzhoa, March 2020.
(451.33 KB)
“

Parallel Processing and Applied Mathematics: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I,”
Lecture Notes in Computer Science, 1, no. 12043: Springer International Publishing, pp. 581, March 2020.
DOI: 10.1007/978-3-030-43229-4
“
Load-Balancing Sparse Matrix Vector Product Kernels on GPUs,”
ACM Transactions on Parallel Computing, vol. 7, issue 1, March 2020.
DOI: 10.1145/3380930
(5.67 MB)
“

hipMAGMA v1.0
: Zenodo, March 2020.
DOI: 10.5281/zenodo.3908549
Using Quantized Integer in LU Factorization with Partial Pivoting (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(6.65 MB)

Improved Energy-Aware Strategies for Periodic Real-Time Tasks under Reliability Constraints,”
40th IEEE Real-Time Systems Symposium (RTSS 2019), York, UK, IEEE Press, February 2020.
“
DTE: PaRSEC Enabled Libraries and Applications (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(979.27 KB)

MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, February 2020.
(2.28 MB)

SLATE Tutorial
, Houston, TX, 2020 ECP Annual Meeting, February 2020.
(12.14 MB)

heFFTe: Highly Efficient FFT for Exascale (Poster)
, Seattle, WA, SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP20), February 2020.
(1.54 MB)

Clover: Computational Libraries Optimized via Exascale Research
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(872 KB)

xSDK4ECP: Extreme-scale Scientific Software Development Kit for ECP (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(1.54 MB)

heFFTe: Highly Efficient FFT for Exascale (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(6.2 MB)

Redesigning PAPI's High-Level API,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-03: University of Tennessee, February 2020.
(356.41 KB)
“

The PLASMA Library on CORAL Systems and Beyond (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(550.86 KB)

SLATE: Software for Linear Algebra Targeting Exascale (POSTER)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(546.56 KB)

PULSE: PAPI Unifying Layer for Software-Defined Events (Poster)
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, February 2020.
(1.86 MB)

Exa-PAPI: The Exascale Performance API with Modern C++
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(556.78 KB)

Overhead of Using Spare Nodes,”
The International Journal of High Performance Computing Applications, February 2020.
DOI: 10.1177%2F1094342020901885
(2.15 MB)
“

Ginkgo: A Node-Level Sparse Linear Algebra Library for HPC (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(699 KB)

DTE: PaRSEC Systems and Interfaces (Poster)
, Houston, TX, 2020 Exascale Computing Project Annual Meeting, February 2020.
(840.54 KB)

Formulation of Requirements for New PAPI++ Software Package: Part I: Survey Results,”
PAPI++ Working Notes, no. 1, ICL-UT-20-02: Innovative Computing Laboratory, University of Tennessee Knoxville, January 2020.
(1.49 MB)
“

Project-Based Research and Training in High Performance Data Sciences, Data Analytics, and Machine Learning,”
The Journal of Computational Science Education, vol. 11, issue 1, pp. 36-44, January 2020.
DOI: 10.22369/issn.2153-4136/11/1/7
(4.4 MB)
“

Performance Tuning SLATE,”
SLATE Working Notes, no. 14, ICL-UT-20-01: Innovative Computing Laboratory, University of Tennessee, January 2020.
(1.29 MB)
“

FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs,”
ECP Milestone Report, no. FFT-ECP STML13-27: Innovative Computing Laboratory, University of Tennessee, January 2020.
(9.71 MB)
“

Interoperable Convergence of Storage, Networking, and Computation,”
Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC), no. 2: Springer International Publishing, pp. 667-690, 2020.
(1.8 MB)
“

Harnessing the Computing Continuum for Programming Our World,”
Fog Computing: Theory and Practice: John Wiley & Sons, Inc., 2020.
DOI: 10.1002/9781119551713.ch7
(1.4 MB)
“

Numerical Algorithms for High-Performance Computational Science,”
Philosophical Transactions of the Royal Society A, vol. 378, issue 2166, 2020.
DOI: 10.1098/rsta.2019.0066
(724.37 KB)
“

Energy-Aware Strategies for Reliability-Oriented Real-Time Task Allocation on Heterogeneous Platforms,”
49th International Conference on Parallel Processing (ICPP 2020), Edmonton, AB, Canada, ACM Press, 2020.
(804.96 KB)
“

2019
Parallel Selection on GPUs,”
Parallel Computing, vol. 91, March 2020, 2019.
DOI: 10.1016/j.parco.2019.102588
(1.43 MB)
“

Evaluation of Directive-Based Performance Portable Programming Models,”
International Journal of High Performance Computing and Networking, vol. 14, issue 2, pp. 165-182.
DOI: http://dx.doi.org/10.1504/IJHPCN.2017.10009064
(1.12 MB)
“

SLATE Developers' Guide,”
SLATE Working Notes, no. 11, ICL-UT-19-02: Innovative Computing Laboratory, University of Tennessee, December 2019.
(1.68 MB)
“

Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization,”
PAW-ATM Workshop at SC19, Denver, CO, ACM, November 2019.
(4.51 MB)
“

Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019.
DOI: 10.1177/1094342019846956
(930.28 KB)
“

Replication is More Efficient Than You Think,”
The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019.
(975.69 KB)
“

SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library,”
International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Denver, CO, ACM, November 2019.
DOI: 10.1145/3295500.3356223
(2.01 MB)
“

PAPI Software-Defined Events for in-Depth Performance Analysis,”
The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1113-1127, November 2019.
(442.39 KB)
“

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training,”
2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), Denver, CO, IEEE, November 2019.
DOI: 10.1109/MLHPC49564.2019.00006
(696.89 KB)
“

Toward a Modular Precision Ecosystem for High-Performance Computing,”
The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1069-1078, November 2019.
DOI: 10.1177/1094342019846547
(1.93 MB)
“

Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility,”
Proceedings in Applied Mathematics and Mechanics, vol. 19, issue 1, November 2019.
DOI: 10.1002/pamm.201900490
“
SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library
, Denver, CO, International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), November 2019.
(16.19 MB)

Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,”
Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.
(1.6 MB)
“

Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC,”
ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
(260.69 KB)
“

Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,”
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.
(440.7 KB)
“

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools,”
Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19, Denver, CO, ACM, November 2019.
(429.55 KB)
“

A Generic Approach to Scheduling and Checkpointing Workflows,”
International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1255-1274, November 2019.
DOI: 10.1177/1094342019866891
(555.01 KB)
“

Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs,”
ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
(523.87 KB)
(3.42 MB)
“


CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
: Zenodo, October 2019.
DOI: 10.5281/zenodo.3477618
(8.31 MB)

New Robust ScaLAPACK Routine for Computing the QR Factorization with Column Pivoting,”
LAPACK Working Note, no. LAWN 296, ICL-UT-19-14: University of Tennessee, October 2019.
(454.83 KB)
“

FFT-ECP Implementation Optimizations and Features Phase,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.
(4.14 MB)
“

A Collection of White Papers from the BDEC2 Workshop in San Diego, CA,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-13: University of Tennessee, October 2019.
(8.25 MB)
“

BDEC2 Platform White Paper,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-11: University of Tennessee, September 2019.
(30.16 KB)
“

SLATE Working Note 13: Implementing Singular Value and Symmetric/Hermitian Eigenvalue Solvers,”
SLATE Working Notes, no. 13, ICL-UT-19-07: Innovative Computing Laboratory, University of Tennessee, September 2019.
(3.47 MB)
“

Data Logistics: Toolkit and Applications,”
5th EAI International Conference on Smart Objects and Technologies for Social Good, Valencia, Spain, September 2019.
(6.71 MB)
“

Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms,”
IEEE Cluster 2019, Albuquerque, New Mexico, IEEE Computer Society Press, September 2019.
(651 KB)
“

Characterization of Power Usage and Performance in Data-Intensive Applications using MapReduce over MPI,”
2019 International Conference on Parallel Computing (ParCo2019), Prague, Czech Republic, September 2019.
“
Progressive Optimization of Batched LU Factorization on GPUs,”
IEEE High Performance Extreme Computing Conference (HPEC’19), Waltham, MA, IEEE, September 2019.
(299.38 KB)
“

PAPI's new Software-Defined Events for in-depth Performance Analysis
, Dresden, Germany, 13th Parallel Tools Workshop, September 2019.
(3.14 MB)

Runtime Level Failure Detection and Propagation in HPC Systems,”
European MPI Users' Group Meeting (EuroMPI '19), Zürich, Switzerland, ACM, September 2019.
DOI: 10.1145/3343211.3343225
(1.11 MB)
“

An Empirical View of SLATE Algorithms on Scalable Hybrid System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-08: University of Tennessee, Knoxville, September 2019.
(441.16 KB)
“

GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,”
EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.
(2.25 MB)
“

Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators,”
IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist, Waltham, MA, IEEE, September 2019.
(470.21 KB)
“

Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs,”
IEEE Cluster, Albuquerque, NM, IEEE, September 2019.
(220.84 KB)
“

Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators,”
Euro-Par 2019: Parallel Processing, vol. 11725: Springer, pp. 495–506, August 2019.
DOI: 10.1007/978-3-030-29400-7_35
“
Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring,”
2019 European Conference on Parallel Processing (Euro-Par 2019), Göttingen, Germany, Springer, August 2019.
DOI: 10.1007/978-3-030-29400-7_4
(1.07 MB)
“

Distributed-Memory Lattice H-Matrix Factorization,”
The International Journal of High Performance Computing Applications, vol. 33, issue 5, pp. 1046–1063, August 2019.
DOI: 10.1177/1094342019861139
(1.14 MB)
“

Performance of Asynchronous Optimized Schwarz with One-sided Communication,”
Parallel Computing, vol. 86, pp. 66-81, August 2019.
DOI: 10.1016/j.parco.2019.05.004
(3.09 MB)
“

Massively Parallel Automated Software Tuning,”
48th International Conference on Parallel Processing (ICPP 2019), Kyoto, Japan, ACM Press, August 2019.
DOI: 10.1145/3337821.3337908
(911.88 KB)
“

What it Takes to keep PAPI Instrumental for the HPC Community
, Collegeville, MN, The 2019 Collegeville Workshop on Sustainable Scientific Software (CW3S19), July 2019.
(3.29 MB)

Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms,”
Parallel Computing, vol. 85, pp. 1–12, July 2019.
DOI: 10.1016/j.parco.2019.02.002
(865.18 KB)
“

Does your tool support PAPI SDEs yet?
, Tahoe City, CA, 13th Scalable Tools Workshop, July 2019.
(3.09 MB)

OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.48 MB)
“

MagmaDNN: Accelerated Deep Learning Using MAGMA,”
Practice and Experience in Advanced Research Computing (PEARC ’19), Chicago, IL, ACM, July 2019.
(1.09 MB)
“

What it Takes to keep PAPI Instrumental for the HPC Community,”
1st Workshop on Sustainable Scientific Software (CW3S19), Collegeville, Minnesota, July 2019.
(50.57 KB)
“

Is your scheduling good? How would you know?
, Bordeaux, France, 14th Scheduling for Large Scale Systems Workshop, June 2019.
(2.5 MB)

MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
DOI: 10.1007/978-3-030-34356-9_37
(1.37 MB)
(8.72 MB)
“


Towards Continuous Benchmarking,”
Platform for Advanced Scientific Computing Conference (PASC 2019), Zurich, Switzerland, ACM Press, June 2019.
DOI: 10.1145/3324989.3325719
(1.51 MB)
“

Least Squares Solvers for Distributed-Memory Machines with GPU Accelerators,”
ACM International Conference on Supercomputing (ICS '19), Phoenix, Arizona, ACM, pp. 117–126, June 2019.
DOI: https://dl.acm.org/doi/abs/10.1145/3330345.3330356
(1.63 MB)
“

PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP,”
ACM Transactions on Mathematical Software, vol. 45, issue 2, June 2019.
DOI: 10.1145/3264491
(7.5 MB)
“

SLATE Working Note 12: Implementing Matrix Inversions,”
SLATE Working Notes, no. 12, ICL-UT-19-04: Innovative Computing Laboratory, University of Tennessee, June 2019.
(1.95 MB)
“

Scheduling Independent Stochastic Tasks under Deadline and Budget Constraints,”
International Journal of High Performance Computing Applications, vol. 34, issue 2, pp. 246-264, June 2019.
DOI: 10.1177/1094342019852135
(427.92 KB)
“

Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
(1016.52 KB)
“

Are we Doing the Right Thing? – A Critical Analysis of the Academic HPC Community,”
2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, IEEE, May 2019.
DOI: 10.1109/IPDPSW.2019.00122
(622.32 KB)
“

Matrix Powers Kernels for Thick-Restart Lanczos with Explicit External Deflation,”
International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(480.73 KB)
“

Software-Defined Events through PAPI,”
2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, IEEE, May 2019.
DOI: 10.1109/IPDPSW.2019.00069
(446.41 KB)
“

A Collection of White Papers from the BDEC2 Workshop in Poznan, Poland,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-10: University of Tennessee, Knoxville, May 2019.
(5.82 MB)
“

Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,”
33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
(675.5 KB)
“

Reservation Strategies for Stochastic Jobs,”
33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019), Rio de Janeiro, Brazil, IEEE Computer Society Press, May 2019.
(808.93 KB)
“

Computing Dense Tensor Decompositions with Optimal Dimension Trees,”
Algorithmica, vol. 81, issue 5, pp. 2092–2121, May 2019.
DOI: 10.1007/s00453-018-0525-3
(638.4 KB)
“

ParILUT – A Parallel Threshold ILU for GPUs,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
DOI: 10.1109/IPDPS.2019.00033
(505.95 KB)
“

Approximate and Exact Selection on GPUs,”
2019 IEEE International Parallel and Distributed Processing Symposium Workshops, Rio de Janeiro, Brazil, IEEE, May 2019.
DOI: 10.1109/IPDPSW.2019.00088
(440.71 KB)
“

Solving Linear Diophantine Systems on Parallel Architectures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 30, issue 5, pp. 1158-1169, May 2019.
DOI: http://dx.doi.org/10.1109/TPDS.2018.2873354
(802.97 KB)
“

CEED ECP Milestone Report: Public release of CEED 2.0
: Zenodo, April 2019.
DOI: 10.5281/zenodo.2641316
(4.98 MB)

Design and Implementation for FFT-ECP on Distributed Accelerated Systems,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.
(3.19 MB)
“

SLATE Mixed Precision Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-03: University of Tennessee, April 2019.
(1.04 MB)
“

Understanding Native Event Semantics
, Knoxville, TN, 9th JLESC Workshop, April 2019.
(2.33 MB)

Adaptive Precision in Block-Jacobi Preconditioning for Iterative Sparse Linear System Solvers,”
Concurrency and Computation: Practice and Experience, vol. 31, no. 6, pp. e4460, March 2019.
DOI: 10.1002/cpe.4460
(341.54 KB)
“

Race to Exascale,”
Computing in Science and Engineering, vol. 21, issue 1, pp. 4-5, March 2019.
DOI: 10.1109/MCSE.2018.2882574
(106.97 KB)
“

Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
, San Jose, CA, GPU Technology Conference (GTC), March 2019.
(2.47 MB)

Local Rollback for Resilient MPI Applications with Application-Level Checkpointing and Message Logging,”
Future Generation Computer Systems, vol. 91, pp. 450-464, February 2019.
DOI: 10.1016/j.future.2018.09.041
(1.16 MB)
“

A Collection of Presentations from the BDEC2 Workshop in Kobe, Japan,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-09: University of Tennessee, Knoxville, February 2019.
(58.85 MB)
“

Counter Inspection Toolkit: Making Sense out of Hardware Performance Events,”
11th International Workshop on Parallel Tools for High Performance Computing, Dresden, Germany, Cham, Switzerland: Springer, February 2019.
DOI: 10.1007/978-3-030-11987-4_2
(216.39 KB)
“

FFT-ECP Fast Fourier Transform
, Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
(1.51 MB)

MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs
: University of Tennessee, January 2019.
DOI: 10.13140/RG.2.2.14906.64961
(7.84 MB)

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Parallel Computing, vol. 81, pp. 1–21, January 2019.
DOI: 10.1016/j.parco.2018.10.003
(3.27 MB)
“

A Customized Precision Format Based on Mantissa Segmentation for Accelerating Sparse Linear Algebra,”
Concurrency and Computation: Practice and Experience, vol. 40319, issue 262, January 2019.
DOI: 10.1002/cpe.5418
“
Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,”
Parallel Computing, vol. 81, pp. 131-146, January 2019.
DOI: 10.1016/j.parco.2017.12.006
(1.9 MB)
“

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 2-27.
(754.6 KB)
“

System Software for Many-Core and Multi-Core Architectures,”
Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project, Singapore, Springer Singapore, pp. 59–75, 2019.
DOI: 10.1007/978-981-13-1924-2_4
“
A Generic Approach to Scheduling and Checkpointing Workflows,”
Int. Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1255-1274, 2019.
(555.01 KB)
“

Checkpointing Strategies for Shared High-Performance Computing Platforms,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 28–52, 2019.
(490.5 KB)
“

2018
Autotuning Techniques for Performance-Portable Point Set Registration in 3D,”
Supercomputing Frontiers and Innovations, vol. 5, no. 4, December 2018.
DOI: 10.14529/jsfi180404
(720.15 KB)
“

Coping with Silent and Fail-Stop Errors at Scale by Combining Replication and Checkpointing,”
Journal of Parallel and Distributed Computing, vol. 122, pp. 209–225, December 2018.
DOI: 10.1016/j.jpdc.2018.08.002
(837 KB)
“

Software-Defined Events (SDEs) in MAGMA-Sparse,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-12: University of Tennessee, December 2018.
(481.69 KB)
“

Initial Integration and Evaluation of SLATE and STRUMPACK,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-11: University of Tennessee, December 2018.
(249.78 KB)
“

Least Squares Performance Report,”
SLATE Working Notes, no. 09, ICL-UT-18-10: Innovative Computing Laboratory, University of Tennessee, December 2018.
(1.76 MB)
“

Tensor Contraction on Distributed Hybrid Architectures using a Task-Based Runtime System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-13: University of Tennessee, December 2018.
(326.11 KB)
“

Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 12, pp. 2700–2712, December 2018.
DOI: 10.1109/TPDS.2018.2842785
(2.53 MB)
“

Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2040–2055, November 2018.
DOI: 10.1109/JPROC.2018.2868961
(2.53 MB)
“

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Dallas, TX, IEEE, November 2018.
DOI: 10.1109/SC.2018.00050
(642.51 KB)
“

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,”
SIAM Review, vol. 60, issue 4, pp. 808–865, November 2018.
DOI: 10.1137/17M1117732
(2.5 MB)
“

A Collection of White Papers from the BDEC2 Workshop in Bloomington, IN,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-15: University of Tennessee, Knoxville, November 2018.
(9.26 MB)
“

Autotuning in High-Performance Computing Applications,”
Proceedings of the IEEE, vol. 106, issue 11, pp. 2068–2083, November 2018.
DOI: 10.1109/JPROC.2018.2841200
(2.5 MB)
“

Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster, November 2018.
(740.37 KB)

Using Jacobi Iterations and Blocking for Solving Sparse Triangular Systems in Incomplete Factorization Preconditioning,”
Journal of Parallel and Distributed Computing, vol. 119, pp. 219–230, November 2018.
DOI: 10.1016/j.jpdc.2018.04.017
(273.53 KB)
“

The 30th Anniversary of the Supercomputing Conference: Bringing the Future Closer—Supercomputing History and the Immortality of Now,”
Computer, vol. 51, issue 10, pp. 74–85, November 2018.
DOI: 10.1109/MC.2018.3971352
(1.73 MB)
“

MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
(2.55 MB)

Evaluation and Design of FFT for Distributed Accelerated Systems,”
ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.
(7.53 MB)
“

Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP,”
International Journal of Computational Science and Engineering (IJCSE), vol. 17, no. 3, October 2018.
DOI: http://dx.doi.org/10.1504/IJCSE.2018.095851
“
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-09: Innovative Computing Laboratory, University of Tennessee, September 2018.
(3.74 MB)
“

Variable-Size Batched Condition Number Calculation on GPUs,”
SBAC-PAD, Lyon, France, September 2018.
(509.3 KB)
“

Linear Systems Performance Report,”
SLATE Working Notes, no. 08, ICL-UT-18-08: Innovative Computing Laboratory, University of Tennessee, September 2018.
(1.64 MB)
“

Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,”
Cluster 2018, Belfast, UK, IEEE Computer Society Press, September 2018.
(423.75 KB)
“

Multi-Level Checkpointing and Silent Error Detection for Linear Workflows,”
Journal of Computational Science, vol. 28, pp. 398–415, September 2018.
“
PAPI's New Software-Defined Events for In-Depth Performance Analysis
, Lyon, France, CCDSC 2018: Workshop on Clusters, Clouds, and Data for Scientific Computing, September 2018.
Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,”
IEEE High Performance Extreme Computing Conference (HPEC’18), Waltham, MA, IEEE, September 2018.
(729.87 KB)
“

A Survey of MPI Usage in the US Exascale Computing Project,”
Concurrency Computation: Practice and Experience, September 2018.
DOI: 10.1002/cpe.4851
(359.54 KB)
“

Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“

Computational Benefit of GPU Optimization for Atmospheric Chemistry Modeling,”
Journal of Advances in Modeling Earth Systems, vol. 10, issue 8, pp. 1952–1969, August 2018.
DOI: 10.1029/2018MS001276
(3.4 MB)
“

Symmetric Indefinite Linear Solver using OpenMP Task on Multicore Architectures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 8, pp. 1879–1892, August 2018.
DOI: 10.1109/TPDS.2018.2808964
(2.88 MB)
“

A Performance Model to Execute Workflows on High-Bandwidth Memory Architectures,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(868.44 KB)
“

Checkpointing Workflows for Fail-Stop Errors,”
IEEE Transactions on Computers, vol. 67, issue 8, pp. 1105–1120, August 2018.
“
A Generic Approach to Scheduling and Checkpointing Workflows,”
The 47th International Conference on Parallel Processing (ICPP 2018), Eugene, OR, IEEE Computer Society Press, August 2018.
(737.11 KB)
“

ParILUT - A New Parallel Threshold ILU,”
SIAM Journal on Scientific Computing, vol. 40, issue 4: SIAM, pp. C503–C519, July 2018.
DOI: 10.1137/16M1079506
(19.26 MB)
“

Accelerating NWChem Coupled Cluster through dataflow-based Execution,”
The International Journal of High Performance Computing Applications, vol. 32, issue 4, pp. 540--551, July 2018.
DOI: 10.1177/1094342016672543
(1.68 MB)
“

Software-Defined Events through PAPI for In-Depth Analysis of Application Performance
, Basel, Switzerland, 5th Platform for Advanced Scientific Computing Conference (PASC18), July 2018.
Batched BLAS (Basic Linear Algebra Subprograms) 2018 Specification
, July 2018.
(483.05 KB)

Big Data and Extreme-Scale Computing: Pathways to Convergence - Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry,”
The International Journal of High Performance Computing Applications, vol. 32, issue 4, pp. 435–479, July 2018.
DOI: 10.1177/1094342018778123
(1.29 MB)
“

Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors,”
Parallel Computing, vol. 75, pp. 41–60, July 2018.
DOI: 10.1016/j.parco.2018.03.004
(2.56 MB)
“

Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption,”
ISC High Performance (ISC'18), Best Poster, Frankfurt, Germany, June 2018.
(3.01 MB)
“

Implementation of the C++ API for Batch BLAS,”
SLATE Working Notes, no. 07, ICL-UT-18-04: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.07 MB)
“

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques,”
International Conference on Computational Science (ICCS 2018), vol. 10860, Wuxi, China, Springer, pp. 586–600, June 2018.
DOI: 10.1007/978-3-319-93698-7_45
(487.88 KB)
“

Initial Integration and Evaluation of SLATE Parallel BLAS in LATTE,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-07: Innovative Computing Laboratory, University of Tennessee, June 2018.
(366.6 KB)
“

Parallel Norms Performance Report,”
SLATE Working Notes, no. 06, ICL-UT-18-06: Innovative Computing Laboratory, University of Tennessee, June 2018.
(1.13 MB)
“

Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
, Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
(3.01 MB)

ADAPT: An Event-Based Adaptive Collective Communication Framework,”
The 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18), Tempe, Arizona, ACM Press, June 2018.
DOI: 10.1145/3208040.3208054
(493.65 KB)
“

Distributed Termination Detection for HPC Task-Based Environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-14: University of Tennessee, June 2018.
“
Solver Interface & Performance on Cori,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-05: University of Tennessee, June 2018.
(188.05 KB)
“

Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,”
Journal of Computational Science, vol. 26, pp. 226–236, May 2018.
DOI: 10.1016/j.jocs.2018.01.005
(3.73 MB)
“

Budget-Aware Scheduling Algorithms for Scientific Workflows with Stochastic Task Weights on Heterogeneous IaaS Cloud Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, IEEE, May 2018.
DOI: 10.1109/IPDPSW.2018.00014
(1.31 MB)
“

A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973–984, May 2018.
DOI: 10.1109/TPDS.2017.2783929
(832.92 KB)
“

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award, Vancouver, BC, Canada, IEEE, May 2018.
DOI: 10.1109/IPDPSW.2018.00127
(899.3 KB)
“

Evaluation of Dataflow Programming Models for Electronic Structure Theory,”
Concurrency and Computation: Practice and Experience: Special Issue on Parallel and Distributed Algorithms, vol. 2018, issue e4490, pp. 1–20, May 2018.
DOI: 10.1002/cpe.4490
(1.69 MB)
“

Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, IEEE, May 2018.
(1.37 MB)
“

Data Movement Interfaces to Support Dataflow Runtimes,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-03: University of Tennessee, May 2018.
(210.94 KB)
“

Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,”
Journal of Computational Science, vol. 26, pp. 237–245, May 2018.
DOI: 10.1016/j.jocs.2018.01.007
(2.18 MB)
“

Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,”
Parallel Computing, vol. 74, pp. 3–18, May 2018.
DOI: 10.1016/j.parco.2017.10.004
(1.34 MB)
“

Parallel BLAS Performance Report,”
SLATE Working Notes, no. 05, ICL-UT-18-01: University of Tennessee, April 2018.
(4.39 MB)
“

PAPI: Counting outside the Box
, Barcelona, Spain, 8th JLESC Meeting, April 2018.
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
, Washington, DC, NSF PI Meeting, Poster, April 2018.
DOI: 10.6084/m9.figshare.6174143.v3
(2.4 MB)

Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem,”
LAPACK Working Note, no. LAWN 295, ICL-UT-18-02: University of Tennessee, April 2018.
(1.53 MB)
“

Investigating Power Capping toward Energy-Efficient Scientific Applications,”
Concurrency Computation: Practice and Experience, vol. 2018, issue e4485, pp. 1-14, April 2018.
DOI: 10.1002/cpe.4485
(1.2 MB)
“

Production Implementations of Pipelined & Communication-Avoiding Iterative Linear Solvers
, Tokyo, Japan, SIAM Conference on Parallel Processing for Scientific Computing, March 2018.
(2.34 MB)

Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(2.96 MB)

Optimization and Performance Evaluation of the IDR Iterative Krylov Solver on GPUs,”
The International Journal of High Performance Computing Applications, vol. 32, no. 2, pp. 220–230, March 2018.
DOI: 10.1177/1094342016646844
(2.08 MB)
“

Tensor Contractions using Optimized Batch GEMM Routines
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
(1.64 MB)

Accelerating Linear Algebra with MAGMA
, Knoxville, TN, ECP Annual Meeting 2018, Tutorial, February 2018.
(35.27 MB)

A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
DOI: 10.1177/1094342017711505
(1.04 MB)
“

Incomplete Sparse Approximate Inverses for Parallel Preconditioning,”
Parallel Computing, vol. 71, pp. 1–22, January 2018.
DOI: 10.1016/j.parco.2017.10.003
(1.24 MB)
“

PMIx: Process Management for Exascale Environments,”
Parallel Computing, vol. 79, pp. 9–29, January 2018.
DOI: 10.1016/j.parco.2018.08.002
“
Co-Scheduling Amdhal Applications on Cache-Partitioned Systems,”
International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 123–138, January 2018.
DOI: 10.1177/1094342017710806
(672.52 KB)
“

High-Performance GPU Implementation of PageRank with Reduced Precision based on Mantissa Segmentation,”
8th Workshop on Irregular Applications: Architectures and Algorithms, 2018.
“
Evaluating Contexts in OpenSHMEM-X Reference Implementation,”
OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, Cham, Springer International Publishing, pp. 50–62, 2018.
DOI: 10.1007/978-3-319-73814-7_4
“
A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs,”
SBAC-PAD, Lyon, France, IEEE, 2018.
(237.68 KB)
“

Scheduling for Fault-Tolerance: An Introduction,”
Topics in Parallel and Distributed Computing: Springer International Publishing, pp. 143–170, 2018.
DOI: 10.1007/978-3-319-93109-8
“
2017
LAWN 294: Aasen's Symmetric Indenite Linear Solvers in LAPACK,”
LAPACK Working Note, no. LAWN 294, ICL-UT-17-13: University of Tennessee, December 2017.
(854.1 KB)
“

Scaling Point Set Registration in 3D Across Thread Counts on Multicore and Hardware Accelerator Platforms through Autotuning for Large Scale Analysis of Scientific Point Clouds,”
IEEE International Workshop on Benchmarking, Performance Tuning and Optimization for Big Data Applications (BPOD 2017), Boston, MA, IEEE, December 2017.
DOI: 10.1109/BigData.2017.8258258
(6.71 MB)
“

POMPEI: Programming with OpenMP4 for Exascale Investigations,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-09: University of Tennessee, December 2017.
(1.1 MB)
“

C++ API for Batch BLAS,”
SLATE Working Notes, no. 04, ICL-UT-17-12: University of Tennessee, December 2017.
(1.89 MB)
“

Sampling Algorithms to Update Truncated SVD,”
IEEE International Conference on Big Data, Boston, MA, IEEE, December 2017.
(700.79 KB)
“

MagmaDNN – High-Performance Data Analytics for Manycore GPUs and CPUs
, Knoxville, TN, 2017 Summer Research Experiences for Undergraduate (REU), Presentation, December 2017.
(5.06 MB)

Performance Analysis and Debugging Tools at Scale,”
Exascale Scientific Applications: Scalability and Performance Portability: Chapman & Hall / CRC Press, pp. 17-50, November 2017.
DOI: 10.1201/b21930
“
Flexible Batched Sparse Matrix-Vector Product on GPUs,”
8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), Denver, CO, ACM Press, November 2017.
DOI: http://dx.doi.org/10.1145/3148226.3148230
(583.4 KB)
“

“BDEC Pathways to Convergence: Toward a Shaping Strategy for a Future Software and Data Ecosystem for Scientific Inquiry,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-08: University of Tennessee, November 2017.
Flexible Batched Sparse Matrix Vector Product on GPUs
, Denver, Colorado, ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, November 2017.
(16.8 MB)

Efficient Communications in Training Large Scale Neural Networks,”
ACM MultiMedia Workshop 2017, Mountain View, CA, ACM, October 2017.
(1.41 MB)
“

The Case for Directive Programming for Accelerator Autotuner Optimization,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-07: University of Tennessee, October 2017.
(341.52 KB)
“

Argobots: A Lightweight Low-Level Threading and Tasking Framework,”
IEEE Transactions on Parallel and Distributed Systems, October 2017.
DOI: 10.1109/TPDS.2017.2766062
“
Designing SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 03, ICL-UT-17-06: Innovative Computing Laboratory, University of Tennessee, October 2017.
(2.8 MB)
“

MAGMA-sparse Interface Design Whitepaper,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-05, September 2017.
(1.28 MB)
“

Report on the TianHe-2A System,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-04: University of Tennessee, September 2017.
(7.15 MB)
“

Assuming failure independence: are we right to be wrong?,”
The 3rd International Workshop on Fault Tolerant Systems (FTS), Honolulu, Hawaii, IEEE, September 2017.
(597.11 KB)
“

Using Software-Based Performance Counters to Expose Low-Level Open MPI Performance Information,”
EuroMPI, Chicago, IL, ACM, September 2017.
DOI: 10.1145/3127024.3127039
(745.58 KB)
“

Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist, Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091085
(908.84 KB)
“

Checkpointing Workflows for Fail-Stop Errors,”
IEEE Cluster, Honolulu, Hawaii, IEEE, September 2017.
(400.64 KB)
“

Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic,”
2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091031
(1.67 MB)
“

Out of Memory SVD Solver for Big Data,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Waltham, MA, IEEE, September 2017.
(1.33 MB)
“

Dynamic Task Discovery in PaRSEC- A data-flow task-based Runtime,”
ScalA17, Denver, ACM, September 2017.
DOI: 10.1145/3148226.3148233
(1.15 MB)
“

Variable-Size Batched LU for Small Matrices and Its Integration into Block-Jacobi Preconditioning,”
46th International Conference on Parallel Processing (ICPP), Bristol, United Kingdom, IEEE, August 2017.
DOI: 10.1109/ICPP.2017.18
“
Optimized Batched Linear Algebra for Modern Architectures,”
Euro-Par 2017, Santiago de Compostela, Spain, Springer, August 2017.
DOI: 10.1007/978-3-319-64203-1_37
(618.33 KB)
“

Resilience for Stencil Computations with Latent Errors,”
International Conference on Parallel Processing (ICPP), Bristol, UK, IEEE Computer Society Press, August 2017.
(1.19 MB)
“

Towards Optimal Multi-Level Checkpointing,”
IEEE Transactions on Computers, vol. 66, issue 7, pp. 1212–1226, July 2017.
DOI: 10.1109/TC.2016.2643660
(1.39 MB)
“

Comparing performance of s-step and pipelined GMRES on distributed-memory multicore CPUs
, Pittsburgh, Pennsylvania, SIAM Annual Meeting, July 2017.
(748 KB)

Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives,”
Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award, Orlando, FL, June 2017.
DOI: 10.1109/IPDPSW.2017.65
(453.66 KB)
“

Optimal Checkpointing Period with replicated execution on heterogeneous platforms,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, IEEE Computer Society Press, June 2017.
DOI: 10.1145/3086157.3086165
(1.02 MB)
“

PLASMA 17.1 Functionality Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-10: University of Tennessee, June 2017.
(1.8 MB)
“

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning,”
International Conference on Computational Science (ICCS 2017), vol. 108, Zurich, Switzerland, Procedia Computer Science, pp. 1783-1792, June 2017.
DOI: 10.1016/j.procs.2017.05.186
(512.57 KB)
“

C++ API for BLAS and LAPACK,”
SLATE Working Notes, no. 02, ICL-UT-17-03: Innovative Computing Laboratory, University of Tennessee, June 2017.
(1.12 MB)
“

Power-Aware HPC on Intel Xeon Phi KNL Processors
, Frankfurt, Germany, ISC High Performance (ISC17), Intel Booth Presentation, June 2017.
(5.87 MB)

Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale,”
SLATE Working Notes, no. 01, ICL-UT-17-02: Innovative Computing Laboratory, University of Tennessee, June 2017.
(2.8 MB)
“

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,”
International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.
DOI: 10.1016/j.procs.2017.05.237
(364.95 KB)
“

Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices,”
Parallel and Distributed Processing Symposium Workshops (IPDPSW), Orlando, FL, IEEE, June 2017.
DOI: 10.1109/IPDPSW.2017.18
“
Preconditioned Krylov Solvers on GPUs,”
Parallel Computing, June 2017.
DOI: 10.1016/j.parco.2017.05.006
(1.19 MB)
“

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs,”
International Conference on Supercomputing (ICS '17), Chicago, Illinois, ACM, June 2017.
DOI: 10.1145/3079079.3079103
(1.04 MB)
“

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures,”
Procedia Computer Science, vol. 108, pp. 606–615, June 2017.
DOI: 10.1016/j.procs.2017.05.250
(643.44 KB)
“

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems,”
International Conference on Computational Science (ICCS 2017), Zürich, Switzerland, Elsevier, June 2017.
DOI: DOI:10.1016/j.procs.2017.05.138
(446.14 KB)
“

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale,”
2017 Workshop on Fault-Tolerance for HPC at Extreme Scale, Washington, DC, ACM, June 2017.
DOI: 10.1145/3086157.3086162
(865.68 KB)
“

PLASMA 17 Performance Report,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-11: University of Tennessee, June 2017.
(7.57 MB)
“

A Framework for Out of Memory SVD Algorithms,”
ISC High Performance 2017, pp. 158–178, June 2017.
DOI: 10.1007/978-3-319-58667-0_9
(393.22 KB)
“

With Extreme Computing, the Rules Have Changed,”
Computing in Science & Engineering, vol. 19, issue 3, pp. 52-62, May 2017.
DOI: 10.1109/MCSE.2017.48
(485.34 KB)
“

MAGMA Tensors and Batched Computing for Accelerating Applications on GPUs
, San Jose, CA, GPU Technology Conference (GTC17), Presentation in Session S7728, May 2017.
(11.12 MB)

Co-Scheduling Algorithms for Cache-Partitioned Systems,”
19th Workshop on Advances in Parallel and Distributed Computational Models, Orlando, FL, IEEE Computer Society Press, May 2017.
DOI: 10.1109/IPDPSW.2017.60
(584.76 KB)
“

Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,”
IEEE Embedded Systems Letters, vol. 9, issue 3, pp. 61–64, May 2017.
DOI: 10.1109/LES.2017.2700401
(339.11 KB)
“

Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Orlando, FL, IEEE, May 2017.
DOI: 10.1109/IPDPS.2017.46
(328.15 KB)
“

Dataflow Programming Paradigms for Computational Chemistry Methods,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-01, Knoxville, TN, University of Tennessee, May 2017.
“
Resilient Co-Scheduling of Malleable Applications,”
International Journal of High Performance Computing Applications (IJHPCA), May 2017.
DOI: 10.1177/1094342017704979
(1.62 MB)
“

Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,”
Journal of Computational Science, vol. 20, pp. 85–93, May 2017.
DOI: 10.1016/j.jocs.2016.12.009
(3.6 MB)
“

Small Tensor Operations on Advanced Architectures for High-Order Applications,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-17-749: Innovative Computing Laboratory, University of Tennessee, April 2017.
(1.09 MB)
“

Solving Dense Symmetric Indefinite Systems using GPUs,”
Concurrency and Computation: Practice and Experience, vol. 29, issue 9, March 2017.
DOI: 10.1002/cpe.4055
(1.94 MB)
“

Accelerating Tensor Contractions in High-Order FEM with MAGMA Batched
, Atlanta, GA, SIAM Conference on Computer Science and Engineering (SIAM CSE17), Presentation, March 2017.
(9.29 MB)

Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs,”
Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, New York, NY, USA, ACM, pp. 1–10, February 2017.
DOI: 10.1145/3026937.3026940
(552.62 KB)
“

High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
DOI: 10.1145/3038228.3038237
(872.18 KB)
“

Accelerating NWChem Coupled Cluster through Dataflow-Based Execution,”
The International Journal of High Performance Computing Applications, pp. 1–13, January 2017.
DOI: 10.1177/1094342016672543
(4.07 MB)
“

Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers,”
ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, ACM.
(766.35 KB)
“

Bringing High Performance Computing to Big Data Algorithms,”
Handbook of Big Data Technologies: Springer, 2017.
DOI: 10.1007/978-3-319-49340-4
(1.22 MB)
“

A Look Back on 30 Years of the Gordon Bell Prize,”
International Journal of High Performance Computing and Networking, vol. 31, issue 6, pp. 469–484, 2017.
“
PMIx: Process Management for Exascale Environments,”
Proceedings of the 24th European MPI Users' Group Meeting, New York, NY, USA, ACM, pp. 14:1–14:10, 2017.
DOI: 10.1145/3127024.3127027
“
Design and Implementation of the PULSAR Programming System for Large Scale Computing,”
Supercomputing Frontiers and Innovations, vol. 4, issue 1, 2017.
DOI: 10.14529/jsfi170101
(764.96 KB)
“

2016
Failure Detection and Propagation in HPC Systems,”
Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
“
On block-asynchronous execution on GPUs,”
LAPACK Working Note, no. 291, November 2016.
(1.05 MB)
“

Towards Achieving Performance Portability Using Directives for Accelerators,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, Utah, Innovative Computing Laboratory, University of Tennessee, November 2016.
(567.02 KB)
“

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs,”
Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 49–56, November 2016.
DOI: 10.1109/ScalA.2016.11
“
Non-GPU-resident Dense Symmetric Indefinite Factorization,”
Concurrency and Computation: Practice and Experience, November 2016.
DOI: 10.1002/cpe.4012
“
Fine-grained Bit-Flip Protection for Relaxation Methods,”
Journal of Computational Science, November 2016.
DOI: 10.1016/j.jocs.2016.11.013
(1.47 MB)
“

Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU,”
ACM Transactions on Mathematical Software (TOMS), vol. 43, issue 2, October 2016.
“
High Performance Realtime Convex Solver for Embedded Systems,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-745, October 2016.
(225.43 KB)
“

On the performance and energy efficiency of sparse linear algebra on GPUs,”
International Journal of High Performance Computing Applications, October 2016.
DOI: 10.1177/1094342016672081
(1.19 MB)
“

Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations,”
2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16), Waltham, MA, IEEE, September 2016.
(480.29 KB)
“

Sunway TaihuLight Supercomputer Makes Its Appearance,”
National Science Review, vol. 3, issue 3, pp. 256-266, September 2016.
DOI: 10.1093/nsr/nww044
(292.11 KB)
“

Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
, Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster, September 2016.
(4.29 MB)

2016 Dense Linear Algebra Software Packages Survey,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-744 / LAWN 290: University of Tennessee, September 2016.
(366.43 KB)
“

Domain Overlap for Iterative Sparse Triangular Solves on GPUs,”
Software for Exascale Computing - SPPEXA, vol. 113: Springer International Publishing, pp. 527–545, September 2016.
DOI: 10.1007/978-3-319-40528-5_24
“
LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi,”
IEEE High Performance Extreme Computing Conference (HPEC'16), Waltham, MA, IEEE, September 2016.
(943.23 KB)
“

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-16-02: University of Tennessee, August 2016.
(929.79 KB)
“

Assessing General-purpose Algorithms to Cope with Fail-stop and Silent Errors,”
ACM Transactions on Parallel Computing, August 2016.
DOI: 10.1145/2897189
(573.71 KB)
“

High-performance Matrix-matrix Multiplications of Very Small Matrices,”
22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016.
“
The HPL Benchmark: Past, Present & Future
, ISC High Performance, Frankfurt, Germany, July 2016.
(3.41 MB)

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(626.21 KB)
“

Porting the PLASMA Numerical Library to the OpenMP Standard,”
International Journal of Parallel Programming, June 2016.
DOI: 10.1007/s10766-016-0441-6
(1.66 MB)
“

High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
(2.36 MB)
“

Report on the Sunway TaihuLight System,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-742: University of Tennessee, June 2016.
“
GPU-Aware Non-contiguous Data Movement In Open MPI,”
25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'16), Kyoto, Japan, ACM, June 2016.
DOI: http://dx.doi.org/10.1145/2907294.2907317
(482.32 KB)
“

Performance, Design, and Autotuning of Batched GEMM for GPUs,”
The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany, June 2016.
(1.27 MB)
“

Efficiency of General Krylov Methods on GPUs – An Experimental Study,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Chicago, IL, IEEE, May 2016.
DOI: 10.1109/IPDPSW.2016.45
(285.28 KB)
“

Search Space Generation and Pruning System for Autotuners,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(555.44 KB)
“

Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs,”
Concurrency and Computation: Practice and Experience, vol. 28, issue 12, pp. 3447 - 3465, May 2016.
DOI: 10.1002/cpe.v28.1210.1002/cpe.3874
(3.21 MB)
“

Optimal Resilience Patterns to Cope with Fail-stop and Silent Errors,”
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
DOI: 10.1109/IPDPS.2016.39
(603.58 KB)
“

Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
(535.72 KB)
“

Efficiency of General Krylov Methods on GPUs – An Experimental Study,”
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 683-691, May 2016.
DOI: 10.1109/IPDPSW.2016.45
“
Performance Analysis and Modeling of Task-Based Runtimes,”
Department of Electrical Engineering and Computer Science, vol. PhD, Knoxville, University of Tennessee, May 2016.
(5.14 MB)
“

On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,”
The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(708.62 KB)
“

Linear Algebra Software for Large-Scale Accelerated Multicore Computing,”
Acta Numerica, vol. 25, pp. 1-160, May 2016.
DOI: 10.1017/S0962492916000015
“
Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
(2.73 MB)
“

Cholesky Factorization on Batches of Matrices with Fixed and Variable Sizes
, San Jose, CA, GPU Technology Conference (GTC16), Poster, April 2016.
(480.51 KB)

A Standard for Batched BLAS Routines
, Paris, France, 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16), April 2016.
(1.93 MB)

Updating Incomplete Factorization Preconditioners for Model Order Reduction,”
Numerical Algorithms, vol. 73, issue 3, no. 3, pp. 611–630, February 2016.
DOI: 10.1007/s11075-016-0110-2
(565.34 KB)
“

Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results,”
Parallel Computing, vol. 52, pp. 22-41, February 2016.
DOI: doi:10.1016/j.parco.2015.09.005
(2.06 MB)
“

High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,”
International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016.
DOI: 10.1177/1094342015593158
(277.51 KB)
“

Performance, Design, and Autotuning of Batched GEMM for GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-739: University of Tennessee, February 2016.
(1.27 MB)
“

A New Metric for Ranking High-Performance Computing Systems,”
National Science Review, vol. 3, issue 1, pp. 30-35, January 2016.
DOI: 10.1093/nsr/nwv084
(393.55 KB)
“

Context Identifier Allocation in Open MPI,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-16-01: Innovative Computing Laboratory, University of Tennessee, January 2016.
(490.89 KB)
“

High-Performance Tensor Contractions for GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738: University of Tennessee, January 2016.
(2.36 MB)
“

Dense Symmetric Indefinite Factorization on GPU Accelerated Architectures,”
Lecture Notes in Computer Science, vol. 9573: Springer International Publishing, pp. 86-95, September 2015, 2016.
DOI: 10.1007/978-3-319-32149-3_9
(327.14 KB)
“

Surviving Errors with OpenSHMEM,”
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, Baltimore, MD, USA, Springer International Publishing, pp. 66–81, 2016.
“
Power Management and Event Verification in PAPI,”
Tools for High Performance Computing 2015: Proceedings of the 9th International Workshop on Parallel Tools for High Performance Computing, September 2015, Dresden, Germany, Dresden, Germany, Springer International Publishing, pp. pp. 41-51, 2016.
DOI: 10.1007/978-3-319-39589-0_4
(565.14 KB)
“

Scheduling Computational Workflows on Failure-prone Platforms,”
International Journal of Networking and Computing, vol. 6, no. 1, pp. 2-26, 2016.
(503.81 KB)
“

Performance, Design, and Autotuning of Batched GEMM for GPUs,”
High Performance Computing: 31st International Conference, ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings, no. 9697: Springer International Publishing, pp. 21–38, 2016.
DOI: 10.1007/978-3-319-41321-1_2
(1.98 MB)
“

2015
UCX: An Open Source Framework for HPC Network APIs and Beyond,”
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, USA, IEEE, pp. 40-43, 2015.
DOI: 10.1109/HOTI.2015.13
“
Experiences in autotuning matrix multiplication for energy minimization on GPUs,”
Concurrency in Computation: Practice and Experience, vol. 27, issue 17, pp. 5096-5113, December 2015.
DOI: 10.1002/cpe.3516
(1.98 MB)
“

Mixed-precision Block Gram Schmidt Orthogonalization,”
6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Austin, TX, ACM, November 2015.
(235.69 KB)
“

The TOP500 List and Progress in High-Performance Computing,”
IEEE Computer, vol. 48, issue 11, pp. 42-49, November 2015.
DOI: doi:10.1109/MC.2015.338
“
Randomized Algorithms to Update Partial Singular Value Decomposition on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
“
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
“
GPU-accelerated Co-design of Induced Dimension Reduction: Algorithmic Fusion and Kernel Overlap,”
2nd International Workshop on Hardware-Software Co-Design for High Performance Computing, Austin, TX, ACM, November 2015.
(1.46 MB)
“

Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(550.96 KB)
“

Adaptive Precision Solvers for Sparse Linear Systems,”
3rd International Workshop on Energy Efficient Supercomputing (E2SC '15), Austin, TX, ACM, November 2015.
“
Mixing LU-QR Factorization Algorithms to Design High-Performance Dense Linear Algebra Solvers,”
Journal of Parallel and Distributed Computing, vol. 85, pp. 32-46, November 2015.
DOI: doi:10.1016/j.jpdc.2015.06.007
(5.06 MB)
“

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs,”
IEEE Transactions on Parallel and Distributed Systems, no. 1045-9219, November 2015.
“
Tuning Stationary Iterative Solvers for Fault Resilience,”
6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA15), Austin, TX, ACM, November 2015.
(1.28 MB)
“

Weighted Dynamic Scheduling with Many Parallelism Grains for Offloading of Numerical Workloads to Multiple Varied Accelerators,”
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA'15), vol. No. 5, Austin, TX, ACM, November 2015.
(347.6 KB)
“

Visualizing Execution Traces with Task Dependencies,”
2nd Workshop on Visual Performance Analysis (VPA '15), Austin, TX, ACM, November 2015.
(927.5 KB)
“

Accelerating Collaborative Filtering for Implicit Feedback Datasets using GPUs,”
2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, IEEE, November 2015.
(1.02 MB)
“

Efficient Implementation Of Quantum Materials Simulations On Distributed CPU-GPU Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(1.09 MB)
“

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(4.7 MB)
“

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems,”
Supercomputing Frontiers and Innovations, vol. 2, no. 4, October 2015.
DOI: 10.14529/jsfi1504
(3.68 MB)
“

Batched Matrix Computations on Hardware Accelerators Based on GPUs,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(9.36 MB)
“

Efficient Eigensolver Algorithms on Accelerator Based Architectures,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(6.98 MB)
“

Random-Order Alternating Schwarz for Sparse Triangular Solves,”
2015 SIAM Conference on Applied Linear Algebra (SIAM LA), Atlanta, GA, SIAM, October 2015.
(1.53 MB)
“

Mixed-precision orthogonalization process Performance on multicore CPUs with GPUs,”
2015 SIAM Conference on Applied Linear Algebra, Atlanta, GA, SIAM, October 2015.
(301.01 KB)
“

On the Design, Autotuning, and Optimization of GPU Kernels for Kinetic Network Simulations Using Fast Explicit Integration and GPU Batched Computation
, Oak Ridge, TN, Joint Institute for Computational Sciences Seminar Series, Presentation, September 2015.
(17.25 MB)

Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery,”
22nd European MPI Users' Group Meeting, Bordeaux, France, ACM, September 2015.
DOI: 10.1145/2802658.2802668
(543.32 KB)
“

A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 14, pp. 3702-3723, September 2015.
DOI: 10.1002/cpe.3403
(8.16 MB)
“

Batched Matrix Computations on Hardware Accelerators,”
EuroMPI/Asia 2015 Workshop, Bordeaux, France, September 2015.
(589.05 KB)
“

MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing,”
2015 IEEE High Performance Extreme Computing Conference (HPEC ’15), (Best Paper Award), Waltham, MA, IEEE, September 2015.
(678.86 KB)
“

Accelerating NWChem Coupled Cluster through dataflow-based Execution,”
11th International Conference on Parallel Processing and Applied Mathematics (PPAM 2015), Krakow, Poland, Springer International Publishing, September 2015.
(452.82 KB)
“

PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution,”
2015 IEEE International Conference on Cluster Computing, Chicago, IL, IEEE, September 2015.
(1.77 MB)
“

Towards a High-Performance Tensor Algebra Package for Accelerators
, Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC15), September 2015.
(1.76 MB)

Iterative Sparse Triangular Solves for Preconditioning,”
EuroPar 2015, Vienna, Austria, Springer Berlin, August 2015.
DOI: 10.1007/978-3-662-48096-0_50
(322.36 KB)
“

Cholesky Across Accelerators,”
17th IEEE International Conference on High Performance Computing and Communications (HPCC 2015), Elizabeth, NJ, IEEE, August 2015.
“
Flexible Linear Algebra Development and Scheduling with Cholesky Factorization,”
17th IEEE International Conference on High Performance Computing and Communications, Newark, NJ, August 2015.
(494.31 KB)
“

Efficient Checkpoint/Verification Patterns,”
International Journal on High Performance Computing Applications, July 2015.
DOI: 10.1177/1094342015594531
(392.76 KB)
“

Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs,”
International Supercomputing Conference (ISC 2015), Frankfurt, Germany, July 2015.
“
Exascale Computing and Big Data,”
Communications of the ACM, vol. 58, no. 7: ACM, pp. 56-68, July 2015.
DOI: 10.1145/2699414
(7.3 MB)
“

Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations,”
ISC High Performance, Frankfurt, Germany, Springer, July 2015.
(778.26 KB)
“

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors,”
ISC High Performance 2015, Frankfurt, Germany, July 2015.
(1.49 MB)
“

Performance Analysis and Optimization of Two-Sided Factorization Algorithms for Heterogeneous Platform,”
International Conference on Computational Science (ICCS 2015), Reykjavík, Iceland, June 2015.
(1.12 MB)
“

Linear Algebra Software for High-Performance Computing (Part 2: Software for Hardware Accelerators and Coprocessors)
, Frankfurt, Germany, ISC High Performance (ISC18), Tutorial Presentation, June 2015.
(15.41 MB)

MAGMA MIC: Optimizing Linear Algebra for Intel Xeon Phi
, Frankfurt, Germany, ISC High Performance (ISC15), Intel Booth Presentation, June 2015.
(2.03 MB)

Mixed-Precision Cholesky QR Factorization and its Case Studies on Multicore CPU with Multiple GPUs,”
SIAM Journal on Scientific Computing, vol. 37, no. 3, pp. C203-C330, May 2015.
DOI: DOI:10.1137/14M0973773
(374.8 KB)
“

Fault Tolerance Techniques for High-performance Computing,”
University of Tennessee Computer Science Technical Report (also LAWN 289), no. UT-EECS-15-734: University of Tennessee, May 2015.
“
A Data Flow Divide and Conquer Algorithm for Multicore Architecture,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(535.44 KB)
“

Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“

Design for a Soft Error Resilient Dynamic Task-based Runtime,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(2.31 MB)
“

Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems: Formal Proof,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-15-01, April 2015.
(570.97 KB)
“

Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product,”
Spring Simulation Multi-Conference 2015 (SpringSim'15), Alexandria, VA, SCS, April 2015.
(1.46 MB)
“

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures,”
The Spring Simulation Multi-Conference 2015 (SpringSim'15), Best Paper Award, Alexandria, VA, April 2015.
(608.44 KB)
“

A Survey of Recent Developments in Parallel Implementations of Gaussian Elimination,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 5, pp. 1292-1309, April 2015.
DOI: 10.1002/cpe.3306
(783.45 KB)
“

Towards Batched Linear Solvers on Accelerated Hardware Platforms,”
8th Workshop on General Purpose Processing Using GPUs (GPGPU 8) co-located with PPOPP 2015, San Francisco, CA, ACM, February 2015.
(403.74 KB)
“

Batched matrix computations on hardware accelerators based on GPUs,”
International Journal of High Performance Computing Applications, February 2015.
DOI: 10.1177/1094342014567546
(2.16 MB)
“

Optimization for Performance and Energy for Batched Matrix Computations on GPUs,”
8th Workshop on General Purpose Processing Using GPUs (GPGPU 8), San Francisco, CA, ACM, February 2015.
DOI: 10.1145/2716282.2716288
(699.5 KB)
“

Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers,”
Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '15), San Francisco, CA, ACM, February 2015.
DOI: 10.1145/2712386.2712387
(2.29 MB)
“

HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems,”
University of Tennessee Computer Science Technical Report , no. ut-eecs-15-736: University of Tennessee, January 2015.
“
Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy,”
ACM Transactions on Parallel Computing, vol. 1, issue 2, no. 10, pp. 10:1-10:28, January 2015.
DOI: 10.1145/2686892
(1.14 MB)
“

Scheduling for fault-tolerance: an introduction,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-15-02: University of Tennessee, January 2015.
(416.37 KB)
“

Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing,”
International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, January 2015.
(755.54 KB)
“

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
Scientific Programming, vol. 23, issue 1, January 2015.
DOI: 10.3233/SPR-140404
(553.94 KB)
“

Experiences in Autotuning Matrix Multiplication for Energy Minimization on GPUs,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 17, pp. 5096 - 5113, Oct 12, 2015.
DOI: 10.1002/cpe.3516
(1.99 MB)
“

Acceleration of GPU-based Krylov solvers via Data Transfer Reduction,”
International Journal of High Performance Computing Applications, 2015.
“
High-Performance Conjugate-Gradient Benchmark: A New Metric for Ranking High-Performance Computing Systems,”
The International Journal of High Performance Computing Applications, 2015.
DOI: 10.1177/1094342015593158
(336.19 KB)
“

From MPI to OpenSHMEM: Porting LAMMPS,”
OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, Annapolis, MD, USA, Springer International Publishing, pp. 121–137, 2015.
DOI: 10.1007/978-3-319-26428-8_8
“
Computing Low-rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and its Application to Solving a Hierarchically Semiseparable Linear System of Equations,”
Scientific Programming, 2015.
(648.87 KB)
“

High-Performance Computing,”
The Princeton Companion to Applied Mathematics, Princeton, New Jersey, Princeton University Press, pp. 839-842, 2015.
“
2014
Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors,”
5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '14), New Orleans, LA, IEEE, November 2014.
DOI: 10.1109/ScalA.2014.8
(407.5 KB)
“

PTG: An Abstraction for Unhindered Parallelism,”
International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), New Orleans, LA, IEEE Press, November 2014.
(480.05 KB)
“

PULSAR Users’ Guide, Parallel Ultra-Light Systolic Array Runtime,”
University of Tennessee EECS Technical Report, no. UT-EECS-14-733: University of Tennessee, November 2014.
(561.56 KB)
“

Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES,”
5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, New Orleans, LA, November 2014.
(465.52 KB)
“

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), New Orleans, LA, IEEE, November 2014.
“
Design for a Soft Error Resilient Dynamic Task-based Runtime,”
ICL Technical Report, no. ICL-UT-14-04: University of Tennessee, November 2014.
(2.61 MB)
“

Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-14-731: University of Tennessee, October 2014.
(1.83 MB)
“

Access-averse Framework for Computing Low-rank Matrix Approximations,”
First International Workshop on High Performance Big Graph Data Management, Analysis, and Mining, Washington, DC, October 2014.
“
A Multithreaded Communication Substrate for OpenSHMEM,”
8th International Conference on Partitioned Global Address Space Programming Models (PGAS), Eugene, OR, October 2014.
(261.66 KB)
“

A Fast Batched Cholesky Factorization on a GPU,”
International Conference on Parallel Processing (ICPP-2014), Minneapolis, MN, September 2014.
(1.37 MB)
“

Search Space Pruning Constraints Visualization,”
VISSOFT'14: 2nd IEEE Working Conference on Software Visualization, Victoria, BC, Canada, IEEE, September 2014.
(1.32 MB)
“

Power Monitoring with PAPI for Extreme Scale Architectures and Dataflow-based Programming Models,”
2014 IEEE International Conference on Cluster Computing, no. ICL-UT-14-04, Madrid, Spain, IEEE, September 2014.
DOI: 10.1109/CLUSTER.2014.6968672
(3.45 MB)
“

Utilizing Dataflow-based Execution for Coupled Cluster Methods,”
2014 IEEE International Conference on Cluster Computing, no. ICL-UT-14-02, Madrid, Spain, IEEE, September 2014.
(260.23 KB)
“

Unveiling the Performance-energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors,”
Concurrency and Computation: Practice and Experience, vol. 27, issue 4, pp. 885-904, September 2014.
DOI: 10.1002/cpe.3341
(1.83 MB)
“

Computing Least Squares Condition Numbers on Hybrid Multicore/GPU Systems,”
International Interdisciplinary Conference on Applied Mathematics, Modeling and Computational Science (AMMCS), Waterloo, Ontario, CA, August 2014.
(130.18 KB)
“

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU,”
16th IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(684.73 KB)
“

Assembly Operations for Multicore Architectures using Task-Based Runtime Systems,”
Euro-Par 2014, Porto, Portugal, Springer International Publishing, August 2014.
(481.52 KB)
“

Task-Based Programming for Seismic Imaging: Preliminary Results,”
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), Paris, France, IEEE, August 2014.
(625.86 KB)
“

Communication-Avoiding Symmetric-Indefinite Factorization,”
SIAM Journal on Matrix Analysis and Application, vol. 35, issue 4, pp. 1364-1406, July 2014.
(593.18 KB)
“

Looking Back at Dense Linear Algebra Software,”
Journal of Parallel and Distributed Computing, vol. 74, issue 7, pp. 2548–2560, July 2014.
DOI: 10.1016/j.jpdc.2013.10.005
(1.79 MB)
“

An Efficient Distributed Randomized Algorithm for Solving Large Dense Symmetric Indefinite Linear Systems,”
Parallel Computing, vol. 40, issue 7, pp. 213-223, July 2014.
DOI: 10.1016/j.parco.2013.12.003
(1.42 MB)
“

Improving the Energy Efficiency of Sparse Linear System Solvers on Multicore and Manycore Systems,”
Philosophical Transactions of the Royal Society A -- Mathematical, Physical and Engineering Sciences, vol. 372, issue 2018, July 2014.
DOI: 10.1098/rsta.2013.0279
(779.57 KB)
“

Heterogeneous Acceleration for Linear Algebra in Mulit-Coprocessor Environments,”
VECPAR 2014, Eugene, OR, June 2014.
(276.52 KB)
“

Self-Adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures,”
VECPAR 2014, Eugene, OR, June 2014.
(430.56 KB)
“

Mixed-precision orthogonalization scheme and adaptive step size for CA-GMRES on GPUs,”
VECPAR 2014 (Best Paper), Eugene, OR, June 2014.
(438.54 KB)
“

Performance of Various Computers Using Standard Linear Equations Software, (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, no. CS-89-85: University of Tennessee, June 2014.
(514.64 KB)
“

Scaling Up Matrix Computations on Shared-Memory Manycore Systems with 1000 CPU Cores,”
International conference on Supercomputing, Munich, Germany, ACM, pp. 333-342, June 2014.
DOI: 10.1145/2597652.2597670
(2.9 MB)
“

Accelerating Eigenvector Computation in the Nonsymmetric Eigenvalue Problem,”
VECPAR 2014, Eugene, OR, June 2014.
(199.44 KB)
“

Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, May 2014.
(490.08 KB)
“

New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem,”
Workshop on Parallel and Distributed Scientific and Engineering Computing, IPDPS 2014 (Best Paper), Phoenix, AZ, IEEE, May 2014.
DOI: 10.1109/IPDPSW.2014.130
(2.33 MB)
“

Efficient checkpoint/verification patterns for silent error detection,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-14-03: University of Tennessee, May 2014.
(397.75 KB)
“

Optimizing Krylov Subspace Solvers on Graphics Processing Units,”
Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(536.32 KB)
“

Improving the performance of CA-GMRES on multicores with multiple GPUs,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(333.82 KB)
“

Designing LU-QR Hybrid Solvers for Performance and Stability,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
DOI: 10.1109/IPDPS.2014.108
(4.2 MB)
“

A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,”
International Journal of High Performance Computing Applications, vol. 28, issue 2, pp. 196-209, May 2014.
DOI: 10.1177/1094342013502097
(1.74 MB)
“

Hybrid Multi-Elimination ILU Preconditioners on GPUs,”
International Heterogeneity in Computing Workshop (HCW), IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.67 MB)
“

Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime,”
Workshop on Large-Scale Parallel Processing, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(398.16 KB)
“

clMAGMA: High Performance Dense Linear Algebra with OpenCL ,”
International Workshop on OpenCL, Bristol University, England, May 2014.
(460.91 KB)
“

A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.01 MB)
“

Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes,”
23rd International Heterogeneity in Computing Workshop, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(807.33 KB)
“

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting,”
Concurrency and Computation: Practice and Experience, vol. 26, issue 7, pp. 1408-1431, May 2014.
DOI: 10.1002/cpe.3110
(1.96 MB)
“

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment,”
IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.51 MB)
“

Assessing the Impact of ABFT and Checkpoint Composite Strategies,”
16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.02 MB)
“

Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ formats on NVIDIA GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-14-727: University of Tennessee, April 2014.
(578.11 KB)
“

MIAMI: A Framework for Application Performance Diagnosis ,”
IPASS-2014, Monterey, CA, IEEE, March 2014.
DOI: 10.1109/ISPASS.2014.6844480
(1010.75 KB)
“

Performance Analysis of the MPAS-Ocean Code using HPCToolkit and MIAMI,”
ICL Technical Report, no. ICL-UT-14-01: University of Tennessee, February 2014.
(894.39 KB)
“

Analyzing PAPI Performance on Virtual Machines,”
VMWare Technical Journal, vol. Winter 2013, January 2014.
“
Performance and Reliability Trade-offs for the Double Checkpointing Algorithm,”
International Journal of Networking and Computing, vol. 4, no. 1, pp. 32-41.
(859.04 KB)
“

Model-Driven One-Sided Factorizations on Multicore, Accelerated Systems,”
Supercomputing Frontiers and Innovations, vol. 1, issue 1, 2014.
DOI: http://dx.doi.org/10.14529/jsfi1401
(1.86 MB)
“

Accelerating Numerical Dense Linear Algebra Calculations with GPUs,”
Numerical Computations with GPUs: Springer International Publishing, pp. 3-28, 2014.
DOI: 10.1007/978-3-319-06548-9_1
(1.06 MB)
“

2013
A Block-Asynchronous Relaxation Method for Graphics Processing Units,”
Journal of Parallel and Distributed Computing, vol. 73, issue 12, pp. 1613–1626, December 2013.
DOI: http://dx.doi.org/10.1016/j.jpdc.2013.05.008
(1.08 MB)
“

An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
DOI: 10.1007/s00607-013-0331-3
(311.23 KB)
“

PaRSEC: Exploiting Heterogeneity to Enhance Scalability,”
IEEE Computing in Science and Engineering, vol. 15, issue 6, pp. 36-45, November 2013.
DOI: 10.1109/MCSE.2013.98
(2.16 MB)
“

An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
Supercomputing 2013, Denver, CO, November 2013.
“
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
Concurrency and Computation: Practice and Experience, November 2013.
DOI: 10.1002/cpe.3173
(894.61 KB)
“

Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance,”
International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013, Denver, CO, November 2013.
(147.09 KB)
“

CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience,”
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Montpellier, France, November 2013.
(238.58 KB)
“

Soft Error Resilient QR Factorization for Hybrid System with GPGPU,”
Journal of Computational Science, vol. 4, issue 6, pp. 457–464, November 2013.
DOI: http://dx.doi.org/10.1016/j.jocs.2013.01.004
(995.45 KB)
“

Optimal Checkpointing Period: Time vs. Energy,”
University of Tennessee Computer Science Technical Report (also LAWN 281), no. ut-eecs-13-718: University of Tennessee, October 2013.
(440.13 KB)
“

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems,”
Concurrency and Computation: Practice and Experience, October 2013.
(1.71 MB)
“

An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware,”
University of Tennessee Computer Science Technical Report (also LAWN 283), no. ut-eecs-13-720: University of Tennessee, October 2013.
(1.23 MB)
“

Designing LU-QR hybrid solvers for performance and stability,”
University of Tennessee Computer Science Technical Report (also LAWN 282), no. ut-eecs-13-719: University of Tennessee, October 2013.
(4.11 MB)
“

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi,”
PPAM 2013, Warsaw, Poland, September 2013.
(284.97 KB)
“

Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures,”
7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Berlin, Germany, September 2013.
(102.51 KB)
“

Standards for Graph Algorithm Primitives,”
17th IEEE High Performance Extreme Computing Conference (HPEC '13), Waltham, MA, IEEE, September 2013.
DOI: 10.1109/HPEC.2013.6670338
(108.86 KB)
“

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization,”
Euro-Par 2013, Aachen, Germany, Springer, August 2013.
(431.84 KB)
“

LU Factorization with Partial Pivoting for a Multicore System with Accelerators,”
IEEE Transactions on Parallel and Distributed Computing, vol. 24, issue 8, pp. 1613-1621, August 2013.
DOI: http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.242
(1.08 MB)
“

Analyzing PAPI Performance on Virtual Machines,”
ICL Technical Report, no. ICL-UT-13-02, August 2013.
(437.37 KB)
“

Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
DOI: 10.1002/cpe.3100
(3.89 MB)
“

Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013.
(866.68 KB)
“

Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms,”
Journal of Parallel and Distributed Computing, vol. 73, issue 7, pp. 1000-1010, July 2013.
DOI: 10.1016/j.jpdc.2013.01.015
(1.4 MB)
“

Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,”
University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013.
(659.77 KB)
“

Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
DOI: 10.1145/2464996.2465438
(1.27 MB)
“

Enabling Workflows in GridSolve: Request Sequencing and Service Trading,”
Journal of Supercomputing, vol. 64, issue 3, pp. 1133-1152, June 2013.
DOI: 10.1007/s11227-010-0549-1
(821.29 KB)
“

Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q,”
International Supercomputing Conference 2013 (ISC'13), Leipzig, Germany, Springer, June 2013.
(624.58 KB)
“

Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures,”
UT-CS-13-712: University of Tennessee Computer Science Technical Report, June 2013.
(206.42 KB)
“

On the Combination of Silent Error Detection and Checkpointing,”
UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013.
(1.29 MB)
“

A Parallel Solver for Incompressible Fluid Flows,”
International Conference on Computational Science (ICCS 2013), Barcelona, Spain, Elsevier B.V., June 2013.
DOI: DOI: 10.1016/j.procs.2013.05.207
(588.79 KB)
“

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations,”
International Supercomputing Conference (ISC), Lecture Notes in Computer Science, vol. 7905, Leipzig, Germany, Springer Berlin Heidelberg, pp. 67-80, June 2013.
DOI: 10.1007/978-3-642-38750-0_6
(2.14 MB)
“

Toward a New Metric for Ranking High Performance Computing Systems,”
SAND2013 - 4744, June 2013.
(225.32 KB)
“

Diagnosis and Optimization of Application Prefetching Performance,”
Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013.
DOI: 10.1145/2464996.2465014
(827.31 KB)
“

Virtual Systolic Array for QR Decomposition,”
15th Workshop on Advances in Parallel and Distributed Computational Models, IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013), Boston, MA, IEEE, May 2013.
DOI: 10.1109/IPDPS.2013.119
(749.84 KB)
“

Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster,”
The Third International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2013.
“
Hierarchical QR Factorization Algorithms for Multi-core Cluster Systems,”
Parallel Computing, vol. 39, issue 4-5, pp. 212-232, May 2013.
(1.43 MB)
“

Revisiting the Double Checkpointing Algorithm,”
15th Workshop on Advances in Parallel and Distributed Computational Models, at the IEEE International Parallel & Distributed Processing Symposium, Boston, MA, May 2013.
(591.1 KB)
“

Implementing a systolic algorithm for QR factorization on multicore clusters with PaRSEC,”
Lawn 277, no. UT-CS-13-709, May 2013.
(298.63 KB)
“

PAPI 5: Measuring Power, Energy, and the Cloud
, Austin, TX, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, April 2013.
(78.39 KB)

Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations,”
2013 IEEE International Symposium on Performance Analysis of Systems and Software, Austin, TX, IEEE, April 2013.
(307.24 KB)
“

clMAGMA: High Performance Dense Linear Algebra with OpenCL,”
University of Tennessee Technical Report (Lawn 275), no. UT-CS-13-706: University of Tennessee, March 2013.
(526.6 KB)
“

BlackjackBench: Portable Hardware Characterization with Automated Results Analysis,”
The Computer Journal, March 2013.
DOI: 10.1093/comjnl/bxt057
(408.45 KB)
“

Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Concurrency and Computation: Practice and Experience, vol. 25, issue 4, pp. 572-585, March 2013.
DOI: 10.1002/cpe.2859
(636.68 KB)
“

Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,”
Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.
(1.01 MB)
“

Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-01, February 2013.
(497.64 KB)
“

Performance of Various Computers Using Standard Linear Equations Software,”
University of Tennessee Computer Science Technical Report, no. cs-89-85, February 2013.
(539.24 KB)
“

Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 2, February 2013.
DOI: 10.1145/2427023.2427026
(439.46 KB)
“

Accelerating Linear System Solutions Using Randomization Techniques,”
ACM Transactions on Mathematical Software (also LAWN 246), vol. 39, issue 2, February 2013.
DOI: 10.1145/2427023.2427025
(358.79 KB)
“

Revisiting the Double Checkpointing Algorithm,”
University of Tennessee Computer Science Technical Report (LAWN 274), no. ut-cs-13-705, January 2013.
(682.22 KB)
“

Post-failure recovery of MPI communication capability: Design and rationale,”
International Journal of High Performance Computing Applications, vol. 27, issue 3, pp. 244 - 254, January 2013.
DOI: 10.1177/1094342013488238
(285.77 KB)
“

Implementing a Blocked Aasen’s Algorithm with a Dynamic Scheduler on Multicore Architectures,”
IPDPS 2013 (submitted), Boston, MA, 00 2013.
(1.22 MB)
“

Multithreading in the PLASMA Library,”
Multi and Many-Core Processing: Architecture, Programming, Algorithms, & Applications: Taylor & Francis, 00 2013.
(536.28 KB)
“

High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
ACM Transactions on Mathematical Software (TOMS), vol. 39, issue 3, no. 16, 2013.
DOI: 10.1145/2450153.2450154
(665.7 KB)
“

Assessing the impact of ABFT and Checkpoint composite strategies,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-13-03, 2013.
(968.47 KB)
“

HPC Challenge: Design, History, and Implementation Highlights,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(790.01 KB)
“

Scalable Dense Linear Algebra on Heterogeneous Hardware,”
HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.
(760.32 KB)
“

LAPACK,”
Handbook of Linear Algebra, Second, Boca Raton, FL, CRC Press, 2013.
(223.21 KB)
“

Keeneland: Computational Science Using Heterogeneous GPU Computing,”
Contemporary High Performance Computing: From Petascale Toward Exascale, Boca Raton, FL, Taylor and Francis, 2013.
(2.7 MB)
“

2012
On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,”
University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012.
(358.98 KB)
“

Autotuning GEMM Kernels for the Fermi GPU,”
IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 11, November 2012.
DOI: 10.1109/TPDS.2011.311
(742.5 KB)
“

Matrices Over Runtime Systems at Exascale,”
Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.
“
A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks,”
Supercomputing '12 (poster), Salt Lake City, Utah, November 2012.
“
MAGMA: A New Generation of Linear Algebra Library for GPU and Multicore Architectures
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), Presentation, November 2012.
(4.69 MB)

MAGMA MIC: Linear Algebra Library for Intel Xeon Phi Coprocessors
, Salt Lake City, UT, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC12), November 2012.
(6.4 MB)

Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile Algorithms on Multicore Architecture,”
The 2nd International Conference on Cloud and Green Computing (submitted), Xiangtan, Hunan, China, November 2012.
(329.5 KB)
“

Acceleration of the BLAST Hydro Code on GPU,”
Supercomputing '12 (poster), Salt Lake City, Utah, SC12, November 2012.
“
Performance evaluation of LU factorization through hardware counter measurements,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-700, October 2012.
(794.82 KB)
“

Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures using Tree Reduction,”
Lecture Notes in Computer Science, vol. 7203, pp. 661-670, September 2012.
(185.77 KB)
“

An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“
Anatomy of a Globally Recursive Embedded LINPACK Benchmark,”
2012 IEEE High Performance Extreme Computing Conference, Waltham, MA, pp. 1-6, September 2012.
DOI: 10.1109/HPEC.2012.6408679
(204.74 KB)
“

Measuring Energy and Power with PAPI,”
International Workshop on Power-Aware Systems and Architectures, Pittsburgh, PA, September 2012.
DOI: 10.1109/ICPPW.2012.39
(146.79 KB)
“

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems,”
Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September 2012.
(290.27 KB)
“

Providing GPU Capability to LU and QR within the ScaLAPACK Framework,”
University of Tennessee Computer Science Technical Report (also LAWN 272), no. UT-CS-12-699, September 2012.
(7.48 MB)
“

PAPI-V: Performance Monitoring for Virtual Machines,”
CloudTech-HPC 2012, Pittsburgh, PA, September 2012.
DOI: 10.1109/ICPPW.2012.29
(2.69 MB)
“

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming,”
Parallel Computing, vol. 38, no. 8, pp. 391-407, August 2012.
(1.64 MB)
“

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems,”
Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (Best Paper), Rhodes Island, Greece, August 2012.
(764.02 KB)
“

User Level Failure Mitigation in MPI,”
Euro-Par 2012: Parallel Processing Workshops, vol. 7640, Rhodes Island, Greece, Springer Berlin Heidelberg, pp. 499-504, August 2012.
(136.15 KB)
“

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,”
EuroPar 2012 (also LAWN 260), Rhodes Island, Greece, August 2012.
(662.98 KB)
“

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“

From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“

Power Aware Computing on GPUs,”
SAAHPC '12 (Best Paper Award), Argonne, IL, July 2012.
(658.06 KB)
“

How LAPACK library enables Microsoft Visual Studio support with CMake and LAPACKE,”
University of Tennessee Computer Science Technical Report (also LAWN 270), no. UT-CS-12-698, July 2012.
(501.53 KB)
“

An efficient distributed randomized solver with application to large dense linear systems,”
ICL Technical Report, no. ICL-UT-12-02, July 2012.
(626.26 KB)
“

Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices,”
SIAM Journal on Scientific Computing (Accepted), July 2012.
“
Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators,”
VECPAR 2012, Kobe, Japan, July 2012.
(737.28 KB)
“

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors,”
ICCS 2012, Omaha, NE, June 2012.
(1.27 MB)
“

A Scalable Framework for Heterogeneous GPU-Based Clusters,”
The 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), Pittsburgh, PA, USA, ACM, June 2012.
(3.39 MB)
“

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems,”
ICCS 2012, Omaha, NE, June 2012.
(608.95 KB)
“

A Class of Communication-Avoiding Algorithms for Solving General Dense Linear Systems on CPU/GPU Parallel Machines,”
Proc. of the International Conference on Computational Science (ICCS), vol. 9, pp. 17-26, June 2012.
“
Unified Model for Assessing Checkpointing Protocols at Extreme-Scale,”
University of Tennessee Computer Science Technical Report (also LAWN 269), no. UT-CS-12-697, June 2012.
(2.76 MB)
“

One-Sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators,”
The International Conference on Computational Science (ICCS), June 2012.
“
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems,”
26th ACM International Conference on Supercomputing (ICS 2012), San Servolo Island, Venice, Italy, ACM, June 2012.
(5.88 MB)
“

Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
IPDPS 2012, the 26th IEEE International Parallel and Distributed Processing Symposium, Shanghai, China, IEEE Computer Society Press, May 2012.
(405.71 KB)
“

MAGMA: A Breakthrough in Solvers for Eigenvalue Problems
, San Jose, CA, GPU Technology Conference (GTC12), Presentation, May 2012.
(9.23 MB)

A Parallel Tiled Solver for Symmetric Indefinite Systems On Multicore Architectures,”
IPDPS 2012, Shanghai, China, May 2012.
(544.09 KB)
“

Enabling Application Resilience With and Without the MPI Standard,”
11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada, May 2012.
(262.93 KB)
“

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction,”
IPDPS 2012, Shanghai, China, May 2012.
(480.43 KB)
“

HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“

High Performance Computing Systems: Status and Outlook,”
Acta Numerica, vol. 21, Cambridge, UK, Cambridge University Press, pp. 379-474, May 2012.
(1.48 MB)
“

Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems,”
SIAM Journal on Scientific Computing, vol. 34(2), pp. C70-C82, April 2012.
“
Programming the LU Factorization for a Multicore System with Accelerators,”
Proceedings of VECPAR’12, Kobe, Japan, April 2012.
(414.33 KB)
“

Weighted Block-Asynchronous Relaxation for GPU-Accelerated Systems,”
SIAM Journal on Computing (submitted), March 2012.
(811.01 KB)
“

The Future of Computing: Software Libraries
, Savannah, GA, DOD CREATE Developers' Review, Keynote Presentation, February 2012.
(6.76 MB)

A Proposal for User-Level Failure Mitigation in the MPI-3 Standard,”
University of Tennessee Electrical Engineering and Computer Science Technical Report, no. ut-cs-12-693: University of Tennessee, February 2012.
(159.46 KB)
“

Algorithm-Based Fault Tolerance for Dense Matrix Factorization,”
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, ACM, pp. 225-234, February 2012.
DOI: 10.1145/2145816.2145845
(865.79 KB)
“

MAGMA Tutorial
, Atlanta, GA, Keeneland Workshop, February 2012.
(2.47 MB)

Dense Linear Algebra on Accelerated Multicore Hardware,”
High Performance Scientific Computing: Algorithms and Applications, London, UK, Springer-Verlag, 00 2012.
“
DAGuE: A generic distributed DAG Engine for High Performance Computing.,”
Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.
(830.85 KB)
“

Preliminary Results of Autotuning GEMM Kernels for the NVIDIA Kepler Architecture,”
LAWN 267, 00 2012.
(1.14 MB)
“

Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI,”
University of Tennessee Computer Science Technical Report, no. ut-cs-12-702, 00 2012.
(422.76 KB)
“

An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs,”
Applied Parallel and Scientific Computing, vol. 7133, pp. 248-257, 00 2012.
(623.5 KB)
“

HPC Challenge: Design, History, and Implementation Highlights,”
On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing (to appear): Chapman & Hall/CRC Press, 00 2012.
(469.92 KB)
“

“Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012,”
Lecture Notes in Computer Science, vol. 7490, Vienna, Austria, 00 2012.
Reducing the Amount of Pivoting in Symmetric Indefinite Systems,”
Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (PPAM 2011), vol. 7203: Springer-Verlag Berlin Heidelberg, pp. 133-142, 00 2012.
(145.76 KB)
“

Looking Back at Dense Linear Algebra Software,”
Perspectives on Parallel and Distributed Processing: Looking Back and What's Ahead (to appear), 00 2012.
(235.91 KB)
“

“Parallel Processing and Applied Mathematics, 9th International Conference, PPAM 2011,”
Lecture Notes in Computer Science, vol. 7203, Torun, Poland, 00 2012.
Performance Counter Monitoring for the Blue Gene/Q Architecture,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-12-01, 00 2012.
(92.5 KB)
“

Dynamic Task Execution on Shared and Distributed Memory Architectures
, 2012.
(3.29 MB)

2011
GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement,”
University of Tennessee Computer Science Technical Report UT-CS-11-690 (also Lawn 260), December 2011.
(662.98 KB)
“

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
, no. UT-CS-11-689, December 2011.
(608.95 KB)

LU Factorization for Accelerator-Based Systems,”
IEEE/ACS AICCSA 2011, Sharm-El-Sheikh, Egypt, December 2011.
(234.86 KB)
“

High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures,”
Proceedings of MTAGS11, Seattle, WA, November 2011.
(879.49 KB)
“

Soft Error Resilient QR Factorization for Hybrid System with GPGPU,”
Journal of Computational Science, Seattle, WA, Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems at SC11, November 2011.
(965.88 KB)
“

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs,”
ACM/IEEE Conference on Supercomputing (SC’11), Seattle, WA, November 2011.
(630.63 KB)
“

A Block-Asynchronous Relaxation Method for Graphics Processing Units,”
University of Tennessee Computer Science Technical Report, no. UT-CS-11-687 / LAWN 258, November 2011.
(1.08 MB)
“

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels,”
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), Seattle, WA, November 2011.
(636.01 KB)
“

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-11-07, October 2011.
(544.2 KB)
“

Hierarchical QR Factorization Algorithms for Multi-Core Cluster Systems,”
University of Tennessee Computer Science Technical Report (also Lawn 257), no. UT-CS-11-684, October 2011.
(405.71 KB)
“

On Scalability for MPI Runtime Systems,”
International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, IEEEE, pp. 187-195, September 2011.
(898.76 KB)
“

Power-aware Computing on GPGPUs
, Gatlinburg, TN, Fall Creek Falls Conference, Poster, September 2011.
(2.89 MB)

OMPIO: A Modular Software Architecture for MPI I/O,”
18th EuroMPI, Santorini, Greece, Springer, pp. 81-89, September 2011.
“
High Performance Dense Linear System Solver with Soft Error Resilience,”
IEEE Cluster 2011, Austin, TX, September 2011.
(1.27 MB)
“

Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW,”
18th EuroMPI, Santorini, Greece, Springer, pp. 247-254, September 2011.
“
Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure,”
Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, vol. 6960, Santorini, Greece, Springer, pp. 342-344, September 2011.
(115.75 KB)
“

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs,”
International Conference on Parallel Processing (ICPP'11), Taipei, Taiwan, ACM, September 2011.
DOI: 10.1109/ICPP.2011.71
(1.41 MB)
“

Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization,”
University of Tennessee Computer Science Technical Report (also as a LAWN), no. ICL-UT-11-08, September 2011.
(618.53 KB)
“

Profiling High Performance Dense Linear Algebra Algorithms on Multicore Architectures for Power and Energy Efficiency,”
International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011), Hamburg, Germany, September 2011.
(1.27 MB)
“

Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs,”
Int'l Conference on Parallel Processing (ICPP '11), Taipei, Taiwan, September 2011.
“
An open-source tool-chain for performance analysis,”
Parallel Tools Workshop, Dresden, Germany, September 2011.
(622.1 KB)
“

Power-Aware Prediction Models of Hybrid (MPI/OpenMP) Scientific Applications,”
International Conference on Energy-Aware High Performance Computing (EnA-HPC 2011), Hamburg, Germany, September 2011.
(479.49 KB)
“

Evaluation of the HPC Challenge Benchmarks in Virtualized Environments,”
6th Workshop on Virtualization in High-Performance Cloud Computing, Bordeaux, France, August 2011.
(114.73 KB)
“

Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community,”
IEEE Computing in Science & Engineering, vol. 13, issue 5, pp. 90-95, August 2011.
DOI: 10.1109/MCSE.2011.83
(932.57 KB)
“

MAGMA - LAPACK for HPC on Heterogeneous Architectures
, Oak Ridge, TN, Titan Summit at Oak Ridge National Laboratory, Presentation, August 2011.
(20.43 MB)

Algorithm-based Fault Tolerance for Dense Matrix Factorizations,”
University of Tennessee Computer Science Technical Report, no. UT-CS-11-676, Knoxville, TN, August 2011.
(865.79 KB)
“

Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Proceedings of 17th International Conference, Euro-Par 2011, Part II, vol. 6853, Bordeaux, France, Springer, pp. 51-64, August 2011.
(486.68 KB)
“

Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels,”
University of Tennessee Computer Science Technical Report, UT-CS-11-677, (also Lawn254), August 2011.
(636.01 KB)
“

Soft Error Resilient QR Factorization for Hybrid System,”
UT-CS-11-675 (also LAPACK Working Note #252), no. ICL-CS-11-675, July 2011.
(1.39 MB)
“

A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures,”
Symposium for Application Accelerators in High Performance Computing (SAAHPC'11), Knoxville, TN, July 2011.
(329.68 KB)
“

Autotuned Parallel I/O for Highly Scalable Biosequence Analysis,”
TeraGrid'11, Salt Lake City, Utah, July 2011.
(275.34 KB)
“

Accelerating Linear System Solutions Using Randomization Techniques,”
INRIA RR-7616 / LAWN #246 (presented at International AMMCS’11), Waterloo, Ontario, Canada, July 2011.
(358.79 KB)
“

Soft Error Resilient QR Factorization for Hybrid System,”
University of Tennessee Computer Science Technical Report, no. UT-CS-11-675, Knoxville, TN, July 2011.
(1.39 MB)
“

High-Performance High-Resolution Semi-Lagrangian Tracer Transport on a Sphere,”
Journal of Computational Physics, vol. 230, issue 17, pp. 6778-6799, July 2011.
DOI: 10.1016/j.jcp.2011.05.008
(1.68 MB)
“

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-668, (also Lawn 250), June 2011.
(5.93 MB)
“

Performance Portability of a GPU Enabled Factorization with the DAGuE Framework,”
IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 2011.
(290.98 KB)
“

Two-stage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1432-1441, May 2011.
(1.26 MB)
“

User-Defined Events for Hardware Performance Monitoring,”
Procedia Computer Science, vol. 4: Elsevier, pp. 2096-2104, May 2011.
DOI: 10.1016/j.procs.2011.04.229
(361.76 KB)
“

BlackjackBench: Hardware Characterization with Portable Micro-Benchmarks and Automatic Statistical Analysis of Results,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“
High Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-673, (also Lawn 247), May 2011.
(424.93 KB)
“

Overlapping Computation and Communication for Advection on a Hybrid Parallel Computer,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“
The Design of an Auto-tuning I/O Framework on Cray XT5 System,”
Cray Users Group Conference (CUG'11) (Best Paper Finalist), Fairbanks, Alaska, May 2011.
(459.57 KB)
“

Reducing the Amount of Pivoting in Symmetric Indefinite Systems,”
University of Tennessee Innovative Computing Laboratory Technical Report, no. ICL-UT-11-06, Knoxville, TN, Submitted to PPAM 2011, May 2011.
(145.76 KB)
“

Matrix Algebra on GPU and Multicore Architectures
, Basel, Switzerland, Workshop on GPU-enabled Numerical Libraries, Presentation, May 2011.
(49.27 MB)

A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“
On Scalability for MPI Runtime Systems,”
University of Tennessee Computer Science Technical Report, no. ICL-UT-11-05, Knoxville, TN, May 2011.
(898.76 KB)
“

Autotuning GEMMs for Fermi,”
University of Tennessee Computer Science Technical Report, UT-CS-11-671, (also Lawn 245), April 2011.
(397.45 KB)
“

Exploiting Fine-Grain Parallelism in Recursive LU Factorization,”
Proceedings of PARCO'11, no. ICL-UT-11-04, Gent, Belgium, April 2011.
“
MAGMA - LAPACK for GPUs
, Atlanta, GA, Keeneland GPU Tutorial, April 2011.
(742.14 KB)

Towards a Parallel Tile LDL Factorization for Multicore Architectures,”
ICL Technical Report, no. ICL-UT-11-03, Seattle, WA, April 2011.
(425.45 KB)
“

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-11-666, (also Lawn 243), March 2011.
(1.65 MB)
“

Algebraic Schwarz Preconditioning for the Schur Complement: Application to the Time-Harmonic Maxwell Equations Discretized by a Discontinuous Galerkin Method.,”
The Twentieth International Conference on Domain Decomposition Methods, La Jolla, California, February 2011.
“
QCG-OMPI: MPI Applications on Grids.,”
Future Generation Computer Systems, vol. 27, no. 4, pp. 435-369, January 2011.
(1.48 MB)
“

The International Exascale Software Project Roadmap,”
International Journal of High Performance Computing, vol. 25, no. 1, pp. 3-60, January 2011.
DOI: 10.1177/1094342010391989
(719.74 KB)
“

Process Distance-aware Adaptive MPI Collective Communications,”
IEEE Int'l Conference on Cluster Computing (Cluster 2011), Austin, Texas, 00 2011.
“
A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs,”
in GPU Computing Gems, Jade Edition, vol. 2: Elsevier, pp. 473-484, 00 2011.
“
Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems,”
International Journal of High Performance Computing Applications, vol. 25, no. 3, pp. 342-350, 00 2011.
(467.18 KB)
“

DAGuE: A Generic Distributed DAG Engine for High Performance Computing,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1151-1158, 00 2011.
(830.85 KB)
“

Toward High Performance Divide and Conquer Eigensolver for Dense Symmetric Matrices.,”
Submitted to SIAM Journal on Scientific Computing (SISC), 00 2011.
“
Three-dimensional parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver.,”
To appear in Geophysical Prospecting journal., 00 2011.
(1.04 MB)
“

QUARK Users' Guide: QUeueing And Runtime for Kernels,”
University of Tennessee Innovative Computing Laboratory Technical Report, no. ICL-UT-11-02, 00 2011.
(247.12 KB)
“

3-D parallel frequency-domain visco-acoustic wave modelling based on a hybrid direct/iterative solver,”
73rd EAGE Conference & Exhibition incorporating SPE EUROPEC 2011, Vienna, Austria, 23-26 May, 00 2011.
“
Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, no. CS-89-85, 00 2011.
(6.42 MB)
“

Changes in Dense Linear Algebra Kernels - Decades Long Perspective,”
in Solving the Schrodinger Equation: Has everything been tried? (to appear): Imperial College Press, 00 2011.
“
Parallel algebraic domain decomposition solver for the solution of augmented systems.,”
Parallel, Distributed, Grid and Cloud Computing for Engineering, Ajaccio, Corsica, France, 12-15 April, 00 2011.
“
2010
Can Hardware Performance Counters Produce Expected, Deterministic Results?,”
3rd Workshop on Functionality of Hardware Performance Monitoring, Atlanta, GA, December 2010.
(392.71 KB)
“

EZTrace: a generic framework for performance analysis,”
ICL Technical Report, no. ICL-UT-11-01, December 2010.
“
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems,”
SC'10, New Orleans, LA, ACM SIGARCH/ IEEE Computer Society, November 2010.
(3.42 MB)
“

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures,”
Submitted to Concurrency and Computations: Practice and Experience, November 2010.
(1.65 MB)
“

Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs,”
University of Tennessee Computer Science Technical Report, UT-CS-10-663, November 2010.
(384.75 KB)
“

Using MAGMA with PGI Fortran,”
PGI Insider, November 2010.
(176.67 KB)
“

Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling,”
University of Tennessee Computer Science Technical Report, no. UT-CS-10-661, October 2010.
(287.87 KB)
“

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators,”
Proceedings of IPDPS 2011, no. ICL-UT-10-04, Anchorage, AK, October 2010.
(468.17 KB)
“

Tuning Principal Component Analysis for GRASS GIS on Multi-core and GPU Architectures,”
FOSS4G 2010, Barcelona, Spain, September 2010.
(1.57 MB)
“

Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA,”
University of Tennessee Computer Science Technical Report, UT-CS-10-660, September 2010.
(366.26 KB)
“

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols,”
Proceedings of EuroMPI 2010, Stuttgart, Germany, Springer, September 2010.
(202.87 KB)
“

Locality and Topology aware Intra-node Communication Among Multicore CPUs,”
Proceedings of the 17th EuroMPI conference, Stuttgart, Germany, LNCS, September 2010.
(327.01 KB)
“

“Recent Advances in the Message Passing Interface, Lecture Notes in Computer Science (LNCS),”
EuroMPI 2010 Proceedings, vol. 6305, Stuttgart, Germany, Springer, September 2010.
“8th International Conference on Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science (LNCS),”
PPAM 2009 Proceedings, vol. 6067, Wroclaw, Poland, Springer, September 2010.
Mixed-Tool Performance Analysis on Hybrid Multicore Architectures,”
First International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2010), San Diego, CA, September 2010.
(1.24 MB)
“

Divide & Conquer on Hybrid GPU-Accelerated Multicore Systems,”
SIAM Journal on Scientific Computing (submitted), August 2010.
“
OpenCL Evaluation for Numerical Linear Algebra Library Development,”
Symposium on Application Accelerators in High-Performance Computing (SAAHPC '10), Knoxville, TN, July 2010.
(2.69 MB)
“

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
: 2010 Symposium on Application Accelerators in. High-Performance Computing (SAAHPC'10), Tutorial, July 2010.
(499.51 KB)

An Improved MAGMA GEMM for Fermi GPUs,”
University of Tennessee Computer Science Technical Report, no. UT-CS-10-655 (also LAPACK working note 227), July 2010.
(486.71 KB)
“

Scheduling Cholesky Factorization on Multicore Architectures with GPU Accelerators
, Knoxville, TN, 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC'10), Poster, July 2010.
(3.86 MB)

Scalability Study of a Quantum Simulation Code,”
PARA 2010, Reykjavik, Iceland, June 2010.
“
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators,”
Proc. of VECPAR'10 (to appear), Berkeley, CA, June 2010.
(870.46 KB)
“

Accelerating GPU Kernels for Dense Linear Algebra,”
Proc. of VECPAR'10, Berkeley, CA, June 2010.
(615.07 KB)
“

An Introduction to the MAGMA project - Acceleration of Dense Linear Algebra
: NVIDIA Webinar, June 2010.
MaPHyS or the Development of a Parallel Algebraic Domain Decomposition Solver in the Course of the Solstice Project,”
Sparse Days 2010 Meeting at CERFACS, Toulouse, France, June 2010.
“
Towards a Complexity Analysis of Sparse Hybrid Linear Solvers,”
PARA 2010, Reykjavik, Iceland, June 2010.
“
Autotuning Dense Linear Algebra Libraries on GPUs
, Basel, Switzerland, Sixth International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2010), June 2010.
(579.44 KB)

LINPACK on Future Manycore and GPu Based Systems,”
PARA 2010, Reykjavik, Iceland, June 2010.
“
Redesigning the Message Logging Model for High Performance,”
Concurrency and Computation: Practice and Experience (online version), June 2010.
(438.42 KB)
“

Analysis of Various Scalar, Vector, and Parallel Implementations of RandomAccess,”
Innovative Computing Laboratory (ICL) Technical Report, no. ICL-UT-10-03, June 2010.
(226.9 KB)
“

Intelligent Service Trading and Brokering for Distributed Network Services in GridSolve,”
VECPAR 2010, 9th International Meeting on High Performance Computing for Computational Science, Berkeley, CA, June 2010.
(256.04 KB)
“

Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI,”
Proceedings of International Conference on Computational Science, ICCS 2010 (to appear), Amsterdam The Netherlands, Elsevier, June 2010.
(125.01 KB)
“

“Proceedings of the International Conference on Computational Science,”
ICCS 2010, Amsterdam, Elsevier, May 2010.
Collecting Performance Data with PAPI-C,”
Tools for High Performance Computing 2009, 3rd Parallel Tools Workshop, Dresden, Germany, Springer Berlin / Heidelberg, pp. 157-173, May 2010.
DOI: 10.1007/978-3-642-11261-4_11
(4.45 MB)
“

Performance Evaluation for Petascale Quantum Simulation Tools,”
Proceedings of the Cray Users' Group Meeting, Atlanta, GA, May 2010.
“
International Exascale Software Project Roadmap v1.0,”
University of Tennessee Computer Science Technical Report, UT-CS-10-654, May 2010.
(719.74 KB)
“

DAGuE: A generic distributed DAG engine for high performance computing,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-01, April 2010.
(830.85 KB)
“

Dense Linear Algebra Solvers for Multicore with GPU Accelerators
, Atlanta, GA, International Parallel and Distributed Processing Symposium (IPDPS 2010), April 2010.
(956.68 KB)

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment,”
24th IEEE International Parallel and Distributed Processing Symposium (also LAWN 224), Atlanta, GA, April 2010.
(261.55 KB)
“

Rectangular Full Packed Format for Cholesky’s Algorithm: Factorization, Solution, and Inversion,”
ACM Transactions on Mathematical Software (TOMS), vol. 37, no. 2, Atlanta, GA, April 2010.
(896.03 KB)
“

Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems,”
University of Tennessee Computer Science Technical Report, vol. –10-653, April 2010.
(3.42 MB)
“

Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion,”
ACM Transactions on Mathematical Software (TOMS), vol. 37, no. 2, April 2010.
(896.03 KB)
“

Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures,”
IEEE Transactions on Parallel and Distributed Systems, pp. 417-423, April 2010.
(208.16 KB)
“

QCG-OMPI: MPI Applications on Grids,”
Future Generation Computer Systems, vol. 27, no. 4, pp. 357-369, March 2010.
(1.48 MB)
“

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators,”
IEEE Transaction on Parallel and Distributed Systems (submitted), March 2010.
(3.75 MB)
“

Self-Healing Network for Scalable Fault-Tolerant Runtime Environments,”
Future Generation Computer Systems, vol. 26, no. 3, pp. 479-485, March 2010.
(1.54 MB)
“

SmartGridRPC: The new RPC model for high performance Grid Computing and Its Implementation in SmartGridSolve,”
Concurrency and Computation: Practice and Experience (to appear), January 2010.
(1.08 MB)
“

Scheduling Dense Linear Algebra Operations on Multicore Processors,”
Concurrency and Computation: Practice and Experience, vol. 22, no. 1, pp. 15-44, January 2010.
(1.23 MB)
“

Empirical Performance Tuning of Dense Linear Algebra Software,”
in Performance Tuning of Scientific Applications (to appear), 00 2010.
“
Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm,”
ACM TOMS (submitted), also LAPACK Working Note (LAWN) 211, 00 2010.
(190.2 KB)
“

Scheduling Two-sided Transformations using Tile Algorithms on Multicore Architectures,”
Journal of Scientific Computing, vol. 18, no. 1, pp. 33-50, 00 2010.
(334.5 KB)
“

Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,”
Parallel Computing, vol. 36, no. 5-6, pp. 232-240, 00 2010.
(606.41 KB)
“

Faster, Cheaper, Better - A Hybridization Methodology to Develop Linear Algebra Software for GPUs,”
LAPACK Working Note, no. 230, 00 2010.
(334.48 KB)
“

Sparse approximations of the Schur complement for parallel algebraic hybrid solvers in 3D,”
Numerical Mathematics: Theory, Methods and Applications, vol. 3, no. 3, Beijing, Golbal Science Press, pp. 64-82, 00 2010.
“
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures,”
Parallel Computing (to appear), 00 2010.
(612.23 KB)
“

Accelerating the Reduction to Upper Hessenberg, Tridiagonal, and Bidiagonal Forms through Hybrid GPU-Based Computing,”
Parallel Computing, vol. 36, no. 12, pp. 645-654, 00 2010.
(1.39 MB)
“

Using multiple levels of parallelism to enhance the performance of domain decomposition solvers,”
Parallel Computing, vol. 36, no. 5-6: Elsevier journals, pp. 285-296, 00 2010.
(418.57 KB)
“

QR Factorization for the CELL Processor,”
Scientific Programming, vol. 17, no. 1-2, pp. 31-42, 00 2010.
(194.95 KB)
“

Trace-based Performance Analysis for the Petascale Simulation Code FLASH,”
International Journal of High Performance Computing Applications (to appear), 00 2010.
(887.54 KB)
“

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures,”
24th IEEE International Parallel and Distributed Processing Symposium (submitted), 00 2010.
(313.98 KB)
“

An Improved MAGMA GEMM for Fermi GPUs,”
International Journal of High Performance Computing, vol. 24, no. 4, pp. 511-515, 00 2010.
“
Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, UT-CS-89-85, 00 2010.
(6.42 MB)
“

Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-10-02, 00 2010.
(400.75 KB)
“

Dense Linear Algebra Solvers for Multicore with GPU Accelerators,”
Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, Atlanta, GA, pp. 1-8, 2010.
DOI: 10.1109/IPDPSW.2010.5470941
(1 MB)
“

Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing,”
Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale, vol. 19, pp. 441-451, 2010.
DOI: 10.3233/978-1-60750-530-3-441
“
Dense Linear Algebra for Hybrid GPU-based Systems,”
Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.
“
Blas for GPUs,”
Scientific Computing with Multicore and Accelerators, Boca Raton, Florida, CRC Press, 2010.
(1.05 MB)
“

2009
Enhancing Parallelism of Tile QR Factorization for Multicore Architectures,”
Submitted to Transaction on Parallel and Distributed Systems, December 2009.
(464.23 KB)
“

Accelerating Scientific Computations with Mixed Precision Algorithms,”
Computer Physics Communications, vol. 180, issue 12, pp. 2526-2533, December 2009.
DOI: 10.1016/j.cpc.2008.11.005
(402.69 KB)
“

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures,”
accepted in 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2010), Atlanta, GA, December 2009.
“
Dynamic Task Scheduling for Linear Algebra Algorithms on Distributed-Memory Multicore Systems,”
International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09), Portland, OR, November 2009.
(502.49 KB)
“

Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing,”
IEEE Transactions on Computers, vol. 58, issue 11, pp. 1512-1524, November 2009.
DOI: 10.1109/TC.2009.42
(1.81 MB)
“

Numerical Linear Algebra on Hybrid Architectures: Recent Developments in the MAGMA Project
, Portland, Oregon, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
(1.41 MB)

Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects
, Portland, OR, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC09), November 2009.
(3.53 MB)

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures,”
Innovative Computing Laboratory Technical Report (also LAPACK Working Note 222 and CS Tech Report UT-CS-09-645), no. ICL-UT-09-03, September 2009.
(464.23 KB)
“

Dependency-Driven Scheduling of Dense Matrix Factorizations on Shared-Memory Systems,”
PPAM 2009, Poland, September 2009.
“
Constructing Resilient Communication Infrastructure for Runtime Environments,”
ParCo 2009, Lyon France, September 2009.
“
Impact of Quad-core Cray XT4 System and Software Stack on Scientific Computation,”
Euro-Par 2009, Lecture Notes in Computer Science, vol. 5704/2009, Delft, The Netherlands, Springer Berlin / Heidelberg, pp. 334-344, August 2009.
(312.74 KB)
“

Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery,”
CLUSTER '09, New Orleans, IEEE, August 2009.
DOI: 10.1109/CLUSTR.2009.5289157
(191.36 KB)
“

Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems,”
IEEE Cluster 2009, New Orleans, August 2009.
(395.53 KB)
“

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team,”
SciDAC 2009, Journal of Physics: Conference Series, vol. 180(2009)012039, San Diego, California, IOP Publishing, July 2009.
(906.39 KB)
“

Constructing resiliant communication infrastructure for runtime environments,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-09-02, July 2009.
(463.71 KB)
“

Parallel Programming in MATLAB,”
The International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 277-283, July 2009.
(215.71 KB)
“

The International Exascale Software Project: A Call to Cooperative Action by the Global High Performance Community,”
International Journal of High Performance Computing Applications (to appear), July 2009.
(203.04 KB)
“

Making Performance Analysis and Tuning Part of the Software Development Cycle,”
Proceedings of DoD HPCMP UGC 2009, San Diego, CA, IEEE, June 2009.
“
I/O Performance Analysis for the Petascale Simulation Code FLASH,”
ISC'09, Hamburg, Germany, June 2009.
(88.88 KB)
“

MPI-aware Compiler Optimizations for Improving Communication-Computation Overlap,”
Proceedings of the 23rd annual International Conference on Supercomputing (ICS '09), Yorktown Heights, NY, USA, ACM, pp. 316-325, June 2009.
(308.92 KB)
“

A Holistic Approach for Performance Measurement and Analysis for Petascale Applications,”
ICCS 2009 Joint Workshop: Tools for Program Development and Analysis in Computational Science and Software Engineering for Large-Scale Computing, vol. 2009, Baton Rouge, Louisiana, Springer-Verlag Berlin Heidelberg 2009, pp. 686-695, May 2009.
(3.96 MB)
“

Performance evaluation for petascale quantum simulation tools,”
Proceedings of CUG09, Atlanta, GA, May 2009.
(1.09 MB)
“

Accelerating the Reduction to Upper Hessenberg Form through Hybrid GPU-Based Computing,”
University of Tennessee Computer Science Technical Report, UT-CS-09-642 (also LAPACK Working Note 219), May 2009.
(2.37 MB)
“

“Computational Science – ICCS 2009, Proceedings of the 9th International Conference,”
Lecture Notes in Computer Science: Theoretical Computer Science and General Issues, vol. -, no. 5544-5545, Baton Rouge, LA, May 2009.
Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures,”
IEEE Transactions on Parallel and Distributed Systems (to appear), May 2009.
(208.16 KB)
“

A Note on Auto-tuning GEMM for GPUs,”
9th International Conference on Computational Science (ICCS 2009), no. 5544-5545, Baton Rouge, LA, pp. 884-892, May 2009.
DOI: 10.1007/978-3-642-01970-8_89
(236.02 KB)
“

A Scalable Non-blocking Multicast Scheme for Distributed DAG Scheduling,”
The International Conference on Computational Science 2009 (ICCS 2009), vol. 5544, Baton Rouge, LA, pp. 195-204, May 2009.
(228.45 KB)
“

Trace-based Performance Analysis for the Petascale Simulation Code FLASH,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-09-01, April 2009.
(887.54 KB)
“

Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion,”
ACM TOMS (to appear), 00 2009.
(896.03 KB)
“

Reliability and Performance Modeling and Analysis for Grid Computing,”
in Handbook of Research on Scalable Computing Technologies (to appear): IGI Global, pp. 219-245, 00 2009.
(200.57 KB)
“

Parallel Dense Linear Algebra Software in the Multicore Era,”
in Cyberinfrastructure Technologies and Applications: Nova Science Publishers, Inc., pp. 9-24, 00 2009.
“
Accelerating Time-To-Solution for Computational Science and Engineering,”
SciDAC Review, 00 2009.
(739.11 KB)
“

Scheduling Linear Algebra Operations on Multicore Processors,”
Concurrency Practice and Experience (to appear), 00 2009.
(716.18 KB)
“

Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture - CELL Processor,”
Parallel Computing, vol. 35, pp. 138-150, 00 2009.
(591.16 KB)
“

Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software,”
Cluster Computing Journal: Special Issue on High Performance Distributed Computing, vol. 12, no. 2: Springer Netherlands, pp. 101-122, 00 2009.
(451.07 KB)
“

Capturing and Analyzing the Execution Control Flow of OpenMP Applications,”
International Journal of Parallel Programming, vol. 37, no. 3, pp. 266-276, 00 2009.
“
Transparent Cross-Platform Access to Software Services using GridSolve and GridRPC,”
in Cloud Computing and Software Services: Theory and Techniques (to appear): CRC Press, 00 2009.
“
Grid Computing applied to the Boundary Element Method,”
Proceedings of the First International Conference on Parallel, Distributed and Grid Computing for Engineering, vol. 27, no. :104203/9027, Stirlingshire, UK, Civil-Comp Press, 00 2009.
“
Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA Projects,”
Journal of Physics: Conference Series, vol. 180, 00 2009.
(119.37 KB)
“

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures,”
Parallel Computing, vol. 35, pp. 38-53, 00 2009.
(274.74 KB)
“

The Problem with the Linpack Benchmark Matrix Generator,”
International Journal of High Performance Computing Applications, vol. 23, no. 1, pp. 5-14, 00 2009.
(136.41 KB)
“

Recording the Control Flow of Parallel Applications to Determine Iterative and Phase-Based Behavior,”
Future Generation Computing Systems, vol. 26, pp. 162-166, 00 2009.
“
VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance,”
SC’09 The International Conference for High Performance Computing, Networking, Storage and Analysis (to appear), Portland, OR, 00 2009.
(648.82 KB)
“

QR Factorization for the CELL Processor,”
Scientific Programming (to appear), 00 2009.
(234.02 KB)
“

Algorithmic Based Fault Tolerance Applied to High Performance Computing,”
Journal of Parallel and Distributed Computing, vol. 69, pp. 410-416, 00 2009.
(313.55 KB)
“

Recent Trends in High Performance Computing,”
in Birth of Numerical Analysis (to appear), 00 2009.
“
Scheduling Linear Algebra Operations on Multicore Processors,”
University of Tennessee Computer Science Department Technical Report, UT-CS-09-636 (Also LAPACK Working Note 213), 00 2009.
(716.18 KB)
“

Computing the Conditioning of the Components of a Linear Least-squares Solution,”
Numerical Linear Algebra with Applications, vol. 16, no. 7, pp. 517-533, 00 2009.
(374.97 KB)
“

Fully Dynamic Scheduler for Numerical Computing on Multicore Processors,”
University of Tennessee Computer Science Department Technical Report, UT-CS-09-643 (Also LAPACK Working Note 220), 00 2009.
(488.24 KB)
“

Towards Efficient MapReduce Using MPI,”
Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface - 16th European PVM/MPI Users' Group Meeting, vol. 5759, Espoo, Finland, Springer Berlin / Heidelberg, pp. 240-249, 00 2009.
“
Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware,”
2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '09) (to appear), 00 2009.
(515.63 KB)
“

2008
Revisiting Matrix Product on Master-Worker Platforms,”
International Journal of Foundations of Computer Science (IJFCS), vol. 19, no. 6, pp. 1317-1336, December 2008.
(248.66 KB)
“

Enhancing the Performance of Dense Linear Algebra Solvers on GPUs (in the MAGMA Project)
, Austin, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC08), November 2008.
(5.28 MB)

Request Sequencing: Enabling Workflow for Efficient Problem Solving in GridSolve,”
International Conference on Grid and Cooperative Computing (GCC 2008) (submitted), Shenzhen, China, October 2008.
(1.64 MB)
“

A Comparison of Search Heuristics for Empirical Code Optimization,”
The 3rd international Workshop on Automatic Performance Tuning, Tsukuba, Japan, October 2008.
(772.48 KB)
“

Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited,”
University of Tennessee Computer Science Technical Report, UT-CS-08-624 (also LAPACK Working Note 208), August 2008.
(420.31 KB)
“

The Problem with the Linpack Benchmark Matrix Generator,”
University of Tennessee Computer Science Technical Report, UT-CS-08-621 (also LAPACK Working Note 206), June 2008.
(136.41 KB)
“

The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Computational Kernels and Software,”
ACM/IEEE International Symposium on High Performance Distributed Computing, Boston, MA., June 2008.
(403.89 KB)
“

Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures,”
PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim Norway, May 2008.
“
Interior State Computation of Nano Structures,”
PARA 2008, 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim, Norway, May 2008.
(137.12 KB)
“

QR Factorization for the CELL Processor,”
University of Tennessee Computer Science Technical Report, UT-CS-08-616 (also LAPACK Working Note 201), May 2008.
(194.95 KB)
“

Request Sequencing: Enabling Workflow for Efficient Parallel Problem Solving in GridSolve,”
ICL Technical Report, no. ICL-UT-08-01, April 2008.
(1.64 MB)
“

Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion,”
University of Tennessee Computer Science Technical Report, UT-CS-08-614 (also LAPACK Working Note 199), April 2008.
(896.03 KB)
“

Algorithmic Based Fault Tolerance Applied to High Performance Computing,”
University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205), January 2008.
(313.55 KB)
“

Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,”
in High Performance Computing and Grids in Action, Amsterdam, IOS Press, January 2008.
(92.95 KB)
“

Algorithm-Based Fault Tolerance for Fail-Stop Failures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 12, January 2008.
(340.49 KB)
“

Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures,”
University of Tennessee Computer Science Technical Report, UT-CS-08-615 (also LAPACK Working Note 200), January 2008.
(289.93 KB)
“

Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems,”
University of Tennessee Computer Science Technical Report, UT-CS-08-632 (also LAPACK Working Note 210), January 2008.
(606.41 KB)
“

Visualizing the Program Execution Control Flow of OpenMP Applications,”
Proc. 4th International Workshop on OpenMP (IWOMP 2008), West Lafayette, Indiana, Lecture Notes in Computer Science 5004, pp. 181-190, January 2008.
(194.25 KB)
“

Netlib and NA-Net: Building a Scientific Computing Community,”
IEEE Annals of the History of Computing, vol. 30, no. 2, pp. 30-41, January 2008.
(352.71 KB)
“

A Tribute to Gene Golub,”
Computing in Science and Engineering: IEEE, pp. 5, January 2008.
“
Using dual techniques to derive componentwise and mixed condition numbers for a linear functional of a linear least squares solution,”
University of Tennessee Computer Science Technical Report, UT-CS-08-622 (also LAPACK Working Note 207), January 2008.
(159.65 KB)
“

OpenMP-centric Performance Analysis of Hybrid Applications,”
Proc. 2008 IEEE International Conference on Cluster Computing (CLUSTER 2008), Tsukuba, Japan, January 2008.
(218.63 KB)
“

State-of-the-Art Eigensolvers for Electronic Structure Calculations of Large Scale Nano-Systems,”
Journal of Computational Physics, vol. 227, no. 15, pp. 7113-7124, January 2008.
“
The PlayStation 3 for High Performance Scientific Computing,”
University of Tennessee Computer Science Technical Report, no. UT-CS-08-608, January 2008.
(2.45 MB)
“

Detection and Analysis of Iterative Behavior in Parallel Applications,”
Proceedings of the 2008 International Conference on Computational Science (ICCS 2008), vol. 5103, Krakow, Poland, pp. 261-267, January 2008.
(141.02 KB)
“

Task placement of parallel multi-dimensional FFTs on a mesh communication network,”
University of Tennessee Computer Science Technical Report, no. UT-CS-08-613, January 2008.
(2.33 MB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, CS-89-85, January 2008.
(6.42 MB)
“

Computing the Conditioning of the Components of a Linear Least Squares Solution,”
VECPAR '08, High Performance Computing for Computational Science, Toulouse, France, January 2008.
(374.97 KB)
“

Parallel Tiled QR Factorization for Multicore Architectures,”
Concurrency and Computation: Practice and Experience, vol. 20, pp. 1573-1590, January 2008.
(277.92 KB)
“

Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor,”
University of Tennessee Computer Science Technical Report, no. UT-CS-08-609, (also LAPACK Working Note 189), January 2008.
(500.99 KB)
“

Fault Tolerance Management for a Hierarchical GridRPC Middleware,”
8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France, January 2008.
(319.79 KB)
“

DARPA's HPCS Program: History, Models, Tools, Languages,”
in Advances in Computers, vol. 72: Elsevier, January 2008.
(3.61 MB)
“

Exploring New Architectures in Accelerating CFD for Air Force Applications,”
Proceedings of the DoD HPCMP User Group Conference, Seattle, Washington, January 2008.
(492.86 KB)
“

Redesigning the Message Logging Model for High Performance,”
International Supercomputer Conference (ISC 2008), Dresden, Germany, January 2008.
(622.1 KB)
“

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization,”
IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 9, pp. 1-11, January 2008.
(751.57 KB)
“

PERI Auto-tuning,”
Proc. SciDAC 2008, vol. 125, Seatlle, Washington, Journal of Physics, January 2008.
(873.75 KB)
“

HPCS Library Study Effort,”
University of Tennessee Computer Science Technical Report, UT-CS-08-617, January 2008.
(73.22 KB)
“

How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination,”
in Beautiful Code Leading Programmers Explain How They Think (Chapter 14), pp. 243-282, January 2008.
(257 KB)
“

Analytical Modeling for Affinity-Based Thread Scheduling on Multicore Platforms,”
University of Tennessee Computer Science Technical Report, UT-CS-08-626, January 2008.
(650.75 KB)
“

Matrix Product on Heterogeneous Master Worker Platforms,”
2008 PPoPP Conference, Salt Lake City, Utah, January 2008.
“
Usage of the Scalasca Toolset for Scalable Performance Analysis of Large-scale Parallel Applications,”
Proceedings of the 2nd International Workshop on Tools for High Performance Computing, Stuttgart, Germany, Springer, pp. 157-167, January 2008.
(229.2 KB)
“

The PlayStation 3 for High Performance Scientific Computing,”
Computing in Science and Engineering, pp. 80-83, January 2008.
(2.45 MB)
“

Custom assignment of MPI ranks for parallel multi-dimensional FFTs: Evaluation of BG/P versus BG/L,”
Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA-08), Sydney, Australia, IEEE Computer Society, pp. 271-283, January 2008.
(2.6 MB)
“

Interactive Grid-Access Using Gridsolve and Giggle,”
Computing and Informatics, vol. 27, no. 2, pp. 233-248,ISSN1335-9150, 00 2008.
(533.4 KB)
“

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy,”
ACM Transactions on Mathematical Software, vol. 34, no. 4, pp. 17-22, 00 2008.
(364.48 KB)
“

The LINPACK Benchmark: Past, Present, and Future,”
Concurrency: Practice and Experience, vol. 15, pp. 803-820, 00 2008.
(94.86 KB)
“

Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications,”
Lecture Notes in Computer Science, OpenMP Shared Memory Parallel Programming, vol. 4315: Springer Berlin / Heidelberg, 00 2008.
(350.9 KB)
“

High Performance GridRPC Middleware,”
Recent developments in Grid Technology and Applications: Nova Science Publishers, 00 2008.
(923.06 KB)
“

2007
Optimal Routing in Binomial Graph Networks,”
The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT), Adelaide, Australia, IEEE Computer Society, December 2007.
“
Self-Healing in Binomial Graph Networks,”
2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007), Vilamoura, Algarve, Portugal, November 2007.
(322.39 KB)
“

An Evaluation of Open MPI's Matching Transport Layer on the Cray XT,”
EuroPVM/MPI 2007, September 2007.
(369.01 KB)
“

Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging,”
Accepted for Euro PVM/MPI 2007: Springer, September 2007.
“
Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology,”
Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), Niagara Falls, Canada, Springer, August 2007.
(480.47 KB)
“

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems,”
International Journal of High Performance Computer Applications (to appear), August 2007.
(157.4 KB)
“

Automatic Analysis of Inefficiency Patterns in Parallel Applications,”
Concurrency and Computation: Practice and Experience, vol. 19, no. 11, pp. 1481-1496, August 2007.
(233.31 KB)
“

Netlib and NA-Net: building a scientific computing community,”
In IEEE Annals of the History of Computing (to appear), August 2007.
(352.71 KB)
“

Decision Trees and MPI Collective Algorithm Selection Problem,”
Euro-Par 2007, Rennes, France, Springer, pp. 105–115, August 2007.
(552.94 KB)
“

Implementation of Mixed Precision in Solving Systems of Linear Equations on the Cell Processor,”
Concurrency and Computation: Practice and Experience, vol. 19, no. 10, pp. 1371-1385, July 2007.
(453.78 KB)
“

Bi-objective Scheduling Algorithms for Optimizing Makespan and Reliability on Heterogeneous Systems,”
19th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (submitted), San Diego, CA, June 2007.
(223.82 KB)
“

How Elegant Code Evolves With Hardware: The Case Of Gaussian Elimination,”
in Beautiful Code Leading Programmers Explain How They Think: O'Reilly Media, Inc., June 2007.
(257 KB)
“

Creating Software Technology to Harness the Power of Leadership-class Computing Systems,”
DOE SciDAC Review (to appear), June 2007.
(617.02 KB)
“

Performance Analysis of MPI Collective Operations,”
Cluster computing, vol. 10, no. 2: Springer Netherlands, pp. 127-143, June 2007.
(1018.28 KB)
“

Feedback-Directed Thread Scheduling with Memory Considerations,”
IEEE International Symposium on High Performance Distributed Computing, Monterey Bay, CA, June 2007.
(297.24 KB)
“

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
SIAM SISC (to appear), May 2007.
(241.36 KB)
“

A Comparison of Application Performance Using Open MPI and Cray MPI,”
Cray User Group, CUG 2007, May 2007.
(248.83 KB)
“

Reliability Analysis of Self-Healing Network using Discrete-Event Simulation,”
Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07): IEEE Computer Society, pp. 437-444, May 2007.
“
Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing,”
Proceedings of Workshop on Self Adapting Application Level Fault Tolerance for Parallel and Distributed Computing at IPDPS, pp. 1-8, March 2007.
(162.47 KB)
“

Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Servers Middleware,”
Parallel Processing Letters, vol. 17, no. 1, pp. 47-59, March 2007.
(718.4 KB)
“

Numerical Metadata API Reference,”
Innovative Computing Laboratory Technical Report, February 2007.
(454.79 KB)
“

The Impact of Multicore on Computational Science Software,”
CTWatch Quarterly, vol. 3, issue 1, February 2007.
“
Specification and detection of performance problems with ASL,”
Concurrency and Computation: Practice and Experience, vol. 19, no. 11: John Wiley and Sons Ltd., pp. 1451-1464, January 2007.
“
Memory Leak Detection in Fortran Applications using TAU,”
Proc. DoD HPCMP Users Group Conference (HPCMP-UGC'07), Pittsburgh, PA, IEEE Computer Society, January 2007.
“
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization,”
UT Computer Science Technical Report (Also LAPACK Working Note 184), no. UT-CS-07-596, January 2007.
(751.57 KB)
“

Computing the Conditioning of the Components of a Linear Least Squares Solution,”
University of Tennessee Computer Science Technical Report, no. UT-CS-07-604, (also LAPACK Working Note 193), January 2007.
(374.97 KB)
“

Automated Empirical Tuning of a Multiresolution Analysis Kernel,”
ICL Technical Report, no. ICL-UT-07-01, pp. 10, January 2007.
(120.7 KB)
“

Continuous Runtime Profiling of OpenMP Applications,”
Proceedings of the 2007 Conference on Parallel Computing (PARCO 2007), Juelich and Aachen, Germany, January 2007.
(408.01 KB)
“

Multithreading for synchronization tolerance in matrix factorization,”
Journal of Physics: Conference Series, SciDAC 2007, vol. 78, no. 2007, January 2007.
(577.73 KB)
“

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures,”
University of Tennessee Computer Science Technical Report, no. UT-CS-07-600 (also LAPACK Working Note 191), January 2007.
(274.74 KB)
“

On Using Incremental Profiling for the Performance Analysis of Shared Memory Parallel Applications,”
Proceedings of the 13th International Euro-Par Conference on Parallel Processing (Euro-Par '07), Rennes, France, Springer LNCS, January 2007.
“
Empirical Tuning of a Multiresolution Analysis Kernel using a Specialized Code Generator,”
ICL Technical Report, no. ICL-UT-07-02, January 2007.
(123.34 KB)
“

Results of the PERI survey of SciDAC applications,”
Journal of Physics: Conference Series, SciDAC 2007, vol. 78, no. 2007, January 2007.
(692.83 KB)
“

L2 Cache Modeling for Scientific Applications on Chip Multi-Processors,”
Proceedings of the 2007 International Conference on Parallel Processing, Xi'an, China, IEEE Computer Society, January 2007.
(654.11 KB)
“

Exploiting Mixed Precision Floating Point Hardware in Scientific Computations,”
In High Performance Computing and Grids in Action (to appear), Amsterdam, IOS Press, 00 2007.
(122.01 KB)
“

GridSolve: The Evolution of Network Enabled Solver,”
Grid-Based Problem Solving Environments: IFIP TC2/WG 2.5 Working Conference on Grid-Based Problem Solving Environments (Prescott, AZ, July 2006): Springer, pp. 215-226, 00 2007.
(377.48 KB)
“

Revisiting Matrix Product on Master-Worker Platforms,”
International Journal of Foundations of Computer Science (IJFCS) (accepted), 00 2007.
(248.66 KB)
“

SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3,”
University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595, 00 2007.
(1.74 MB)
“

The Use of Bulk States to Accelerate the Band Edge State Calculation of a Semiconductor Quantum Dot,”
Journal of Computational Physics, vol. 223, pp. 774-782, 00 2007.
(452.6 KB)
“

Remembering Ken Kennedy,”
SciDAC Review, vol. 5, no. 2007, 00 2007.
(519.68 KB)
“

Parallel Tiled QR Factorization for Multicore Architectures,”
University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-598 (also LAPACK Working Note 190), 00 2007.
(277.92 KB)
“

Limitations of the Playstation 3 for High Performance Cluster Computing,”
University of Tennessee Computer Science Technical Report, UT-CS-07-597 (Also LAPACK Working Note 185), 00 2007.
(171.01 KB)
“

Disaster Survival Guide in Petascale Computing: An Algorithmic Approach,”
in Petascale Computing: Algorithms and Applications (to appear): Chapman & Hall - CRC Press, 00 2007.
(260.18 KB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Dept. Technical Report CS-89-85, 00 2007.
(6.42 MB)
“

High Performance Development for High End Computing with Python Language Wrapper (PLW),”
International Journal for High Performance Computer Applications, vol. 21, no. 3, pp. 360-369, 00 2007.
(179.32 KB)
“

MPI Collective Algorithm Selection and Quadtree Encoding,”
Parallel Computing (Special Edition: EuroPVM/MPI 2006): Elsevier, 00 2007.
(308.39 KB)
“

Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors,”
Proceedings of the 2007 International Conference on Computational Science (ICCS 2007), vol. 4487-4490, Beijing, China, Springer LNCS, pp. 815-822, 2007.
DOI: 10.1007/978-3-540-72586-2_115
(145.84 KB)
“

A New Approach to MPI Collective Communication Implementations,”
Distributed and Parallel Systems: Springer US, pp. 45-54, 2007.
DOI: 10.1007/978-0-387-69858-8_5
(140.2 KB)
“

2006
The HPC Challenge (HPCC) Benchmark Suite,”
SC06 Conference Tutorial, Tampa, Florida, IEEE, November 2006.
(1.08 MB)
“

MPI Collective Algorithm Selection and Quadtree Encoding,”
Lecture Notes in Computer Science, vol. 4192, no. ICL-UT-06-13: Springer Berlin / Heidelberg, pp. 40-48, September 2006.
(308.39 KB)
“

A High-Performance, Heterogeneous MPI,”
HeteroPar 2006, Barcelona, Spain, September 2006.
(193.73 KB)
“

High Performance RDMA Protocols in HPC,”
Euro PVM/MPI 2006, Bonn, Germany, September 2006.
(1.06 MB)
“

Implementation and Usage of the PERUSE-Interface in Open MPI,”
Euro PVM/MPI 2006, Bonn, Germany, September 2006.
(310.76 KB)
“

Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor,”
University of Tennessee Computer Science Tech Report, no. UT-CS-06-580, LAPACK Working Note #177, September 2006.
(506.18 KB)
“

Prospectus for the Next LAPACK and ScaLAPACK Libraries,”
PARA 2006, Umea, Sweden, June 2006.
(460.11 KB)
“

The Impact of Multicore on Math Software,”
PARA 2006, Umea, Sweden, June 2006.
(223.53 KB)
“

A Systematic Multi-step Methodology for Performance Analysis of Communication Traces of Distributed Applications based on Hierarchical Clustering,”
Proc. of the 5th International Workshop on Performance Modeling, Evaluation, and Organization of Parallel and Distributed Systems (PMEO-PDS 2006), no. ICL-UT-05-06, Rhodes Island, Greece, IEEE Computer Society, April 2006.
(1.02 MB)
“

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy,”
University of Tennessee Computer Science Tech Report, no. UT-CS-06-574, LAPACK Working Note #175, April 2006.
(221.39 KB)
“

Large Event Traces in Parallel Performance Analysis,”
8th Workshop 'Parallel Systems and Algorithms' (PASA), Lecture Notes in Informatics, no. ICL-UT-06-08, Frankfurt/Main, Germany, Gesellschaft für Informatik, March 2006.
(92.47 KB)
“

Improved Runtime and Transfer Time Prediction Mechanisms in a Network Enabled Server,”
Parallel Processing Letters, vol. 17, no. 1, pp. 47-59, March 2006.
(718.4 KB)
“

An Asynchronous Algorithm on NetSolve Global Computing System,”
Future Generation Computer Systems, vol. 22, issue 3, pp. 279-290, February 2006.
DOI: 10.1016/j.future.2005.10.003
(568.92 KB)
“

Towards bulk based preconditioning for quantum dot computations,”
IEEE/ACM Proceedings of HPCNano SC06 (to appear), January 2006.
(172.46 KB)
“

Self-Healing Network for Scalable Fault Tolerant Runtime Environments,”
DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems, Innsbruck, Austria, January 2006.
(162.83 KB)
“

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs,”
ICL Technical Report, no. ICL-UT-06-09, January 2006.
(84.67 KB)
“

Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead,”
University of Tennessee Computer Science Tech Report, UT-CS-06-581, LAPACK Working Note #178, January 2006.
(304.4 KB)
“

Experiments with Strassen's Algorithm: From Sequential to Parallel,”
18th IASTED International Conference on Parallel and Distributed Computing and Systems PDCS 2006 (submitted), Dallas, Texas, January 2006.
(514.33 KB)
“

Proposal of MPI operation level Checkpoint/Rollback and one implementation,”
Proceedings of IEEE CCGrid 2006: IEEE Computer Society, January 2006.
(277.27 KB)
“

Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors,”
University of Tennessee Computer Science Technical Report, no. UT-CS-06-583, January 2006.
(652.93 KB)
“

Flexible collective communication tuning architecture applied to Open MPI,”
2006 Euro PVM/MPI (submitted), Bonn, Germany, January 2006.
(206.58 KB)
“

Predicting the electronic properties of 3D, million-atom semiconductor nanostructure architectures,”
J. Phys.: Conf. Ser. 46, vol. :101088/1742-6596/46/1/040, pp. 292-298, January 2006.
(644.1 KB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-526, vol. –89-95, January 2006.
(6.42 MB)
“

Performance Instrumentation and Compiler Optimizations for MPI/OpenMP Applications,”
Second International Workshop on OpenMP, Reims, France, January 2006.
(350.9 KB)
“

The use of bulk states to accelerate the band edge state calculation of a semiconductor quantum dot,”
Journal of Computational Physics (submitted), January 2006.
(337.08 KB)
“

Self Adapting Numerical Software SANS Effort,”
IBM Journal of Research and Development, vol. 50, no. 2/3, pp. 223-238, January 2006.
(357.53 KB)
“

ATLAS on the BlueGene/L – Preliminary Results,”
ICL Technical Report, no. ICL-UT-06-10, January 2006.
(46.19 KB)
“

Performance evaluation of eigensolvers in nano-structure computations,”
IEEE/ACM Proceedings of HPCNano SC06 (to appear), January 2006.
(120.61 KB)
“

Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources,”
IPDPS 2006, 20th IEEE International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, January 2006.
(266.54 KB)
“

FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study,”
Lecture Notes in Computer Science, vol. 4192, no. ICL-UT-06-14: Springer Berlin / Heidelberg, pp. 133-140, 00 2006.
(362.44 KB)
“

Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,”
International Journal of Computational Science and Engineering, vol. 2, no. 3/4, pp. 205-212, 00 2006.
(428.21 KB)
“

Scalable Fault Tolerant Protocol for Parallel Runtime Environments,”
2006 Euro PVM/MPI, no. ICL-UT-06-12, Bonn, Germany, 00 2006.
(149.07 KB)
“

High Performance Development for High End Computing with Python Language Wrapper (PLW),”
International Journal of High Performance Computing Applications (to appear), 00 2006.
(179.32 KB)
“

Application of Machine Learning to the Selection of Sparse Linear Solvers,”
International Journal of High Performance Computing Applications (submitted), 00 2006.
(392.96 KB)
“

Recent Developments in GridSolve,”
International Journal of High Performance Computing Applications (Special Issue: Scheduling for Large-Scale Heterogeneous Platforms), vol. 20, no. 1: Sage Science Press, 00 2006.
(496.69 KB)
“

MPI Collective Algorithm Selection and Quadtree Encoding,”
ICL Technical Report, no. ICL-UT-06-11, 00 2006.
(308.39 KB)
“

Twenty-Plus Years of Netlib and NA-Net,”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-526, 00 2006.
(62.79 KB)
“

2005
Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources,”
University of Tennessee Computer Science Department Technical Report, vol. –05-561, November 2005.
(266.54 KB)
“

Performance Profiling Overhead Compensation for MPI Programs,”
In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference: Springer LNCS, September 2005.
(220.26 KB)
“

Hash Functions for Datatype Signatures in MPI,”
Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, vol. 3666, Sorrento (Naples), Italy, Springer-Verlag Berlin, pp. 76-83, September 2005.
(304.2 KB)
“

Trace-Based Parallel Performance Overhead Compensation,”
In Proc. of the International Conference on High Performance Computing and Communications (HPCC), Sorrento (Naples), Italy, September 2005.
(306.88 KB)
“

Performance Analysis of One-sided Communication Mechanisms,”
Mini-Symposium "Tools Support for Parallel Programming", Proceedings of Parallel Computing (ParCo), no. ICL-UT-06-07, Malaga, Spain, September 2005.
(121.49 KB)
“

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm,”
Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, vol. 3666, Sorrento (Naples) , Italy, Springer-Verlag Berlin, pp. 67, September 2005.
(144.86 KB)
“

A Scalable Approach to MPI Application Performance Analysis,”
In Proc. of the 12th European Parallel Virtual Machine and Message Passing Interface Conference: Springer LNCS, September 2005.
(988.58 KB)
“

PerfMiner: Cluster-Wide Collection, Storage and Presentation of Application Level Hardware Performance Data,”
European Conference on Parallel Processing (Euro-Par 2005), Monte de Caparica, Portugal, Springer, September 2005.
DOI: 10.1007/11549468_1
(205.45 KB)
“

Event-based Measurement and Analysis of One-sided Communication,”
In Proceedings of the European Conference on Parallel Computing (Euro-Par), Lisbon, Portugal, Springer, August 2005.
(403.44 KB)
“

Performance Profiling and Analysis of DoD Applications using PAPI and TAU,”
Proceedings of DoD HPCMP UGC 2005, Nashville, TN, IEEE, June 2005.
(322.56 KB)
“

Automatic Experimental Analysis of Communication Patterns in Virtual Topologies,”
In Proceedings of the International Conference on Parallel Processing, Oslo, Norway, IEEE Computer Society, June 2005.
(227.13 KB)
“

New Grid Scheduling and Rescheduling Methods in the GrADS Project,”
International Journal of Parallel Programming, vol. 33, no. 2: Springer, pp. 209-229, June 2005.
(306.41 KB)
“

NanoPSE: A Nanoscience Problem Solving Environment for Atomistic Electronic Structure of Semiconductor Nanostructures,”
Journal of Physics: Conference Series, issue 16, pp. 277-282, June 2005.
DOI: 10.1088/1742-6596/16/1/038
(476.64 KB)
“

Remote Software Toolkit Installer,”
ICL Technical Report, no. ICL-UT-05-04, June 2005.
(490.6 KB)
“

Biological Sequence Alignment on the Computational Grid Using the GrADS Framework,”
Future Generation Computing Systems, vol. 21, no. 6: Elsevier, pp. 980-986, June 2005.
(147.29 KB)
“

Performance Analysis of GYRO: A Tool Evaluation,”
In Proceedings of the 2005 SciDAC Conference, San Francisco, CA, June 2005.
(172.07 KB)
“

The Component Structure of a Self-Adapting Numerical Software System,”
International Journal of Parallel Programming, vol. 33, no. 2, June 2005.
(64.88 KB)
“

A Pattern-Based Approach to Automated Application Performance Analysis,”
Workshop on Patterns in High Performance Computing, University of Illinois at Urbana-Champaign, May 2005.
(3.47 MB)
“

Performance Analysis of MPI Collective Operations,”
4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05), Denver, Colorado, April 2005.
(1018.28 KB)
“

Introduction to the HPC Challenge Benchmark Suite
, March 2005.
(124.86 KB)

Improving Time to Solution with Automated Performance Analysis,”
Second Workshop on Productivity and Performance in High-End Computing (P-PHEC) at 11th International Symposium on High Performance Computer Architecture (HPCA-2005), San Francisco, February 2005.
(112.63 KB)
“

Comparison of Nonlinear Conjugate-Gradient methods for computing the Electronic Properties of Nanostructure Architectures,”
Proceedings of 5th International Conference on Computational Science (ICCS), Atlanta, GA, USA, Springer's Lecture Notes in Computer Science, pp. 317-325, January 2005.
(172.86 KB)
“

Analysis and Optimization of Yee_Bench using Hardware Performance Counters,”
Proceedings of Parallel Computing 2005 (ParCo), Malaga, Spain, January 2005.
(72.27 KB)
“

Performance Analysis of MPI Collective Operations,”
Cluster Computing Journal (to appear), January 2005.
(1018.28 KB)
“

On the Parallel Solution of Large Industrial Wave Propagation Problems,”
Journal of Computational Acoustics (to appear), January 2005.
(1.08 MB)
“

An Effective Empirical Search Method for Automatic Software Tuning,”
ICL Technical Report, no. ICL-UT-05-02, January 2005.
(74.66 KB)
“

Condition Numbers of Gaussian Random Matrices,”
SIAM Journal on Matrix Analysis and Applications (to appear), January 2005.
(186.46 KB)
“

Rounding Error Analysis of the Classical Gram-Schmidt Orthogonalization Process,”
Numerische Mathematik, vol. 101, no. 1, pp. 87-100, January 2005.
(157.48 KB)
“

Towards an Accurate Model for Collective Communications,”
ICL Technical Report, no. ICL-UT-05-03, January 2005.
(250.73 KB)
“

Optimization Problem Solving System Using GridRPC,”
IEEE Transactions on Parallel and Distributed Systems (submitted), January 2005.
(740.57 KB)
“

Fault Tolerant High Performance Computing by a Coding Approach,”
Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear), Chicago, Illinois, January 2005.
(209.37 KB)
“

Introduction to the HPCChallenge Benchmark Suite,”
ICL Technical Report, no. ICL-UT-05-01, January 2005.
(124.86 KB)
“

Dynamic Process Management for Pipelined Applications,”
Proceedings of DoD HPCMP UGC 2005 (to appear), Nashville, TN, IEEE, January 2005.
“
LAPACK 2005 Prospectus: Reliable and Scalable Software for Linear Algebra Computations on High End Computers
: LAPACK Working Note 164, January 2005.
(172.59 KB)

HPC Challenge v1.x Benchmark Suite,”
SC|05 Tutorial - S13, Seattle, Washington, January 2005.
(2.94 MB)
“

Numerically Stable Real Number Codes Based on Random Matrices,”
The International Conference on Computational Science, Atlanta, GA, LNCS 3514, Springer-Verlag, January 2005.
(166.2 KB)
“

Conjugate-Gradient Eigenvalue Solvers in Computing Electronic Properties of Nanostructure Architectures,”
International Journal of Computational Science and Engineering (to appear), January 2005.
(428.21 KB)
“

Condition Numbers of Gaussian Random Matrices,”
University of Tennessee Computer Science Department Technical Report, vol. –04-539, 00 2005.
(186.46 KB)
“

A Not So Simple Matter of Software,”
NCSA Access Online: NCSA, 00 2005.
(457.69 KB)
“

NetSolve: Grid Enabling Scientific Computing Environments,”
Grid Computing and New Frontiers of High Performance Processing, no. 14: Elsevier, 00 2005.
(425 KB)
“

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 00 2005.
(241.36 KB)
“

Automatic analysis of inefficiency patterns in parallel applications,”
Concurrency and Computation: Practice and Experience, Special issue "Automatic Performance Analysis" (submitted), 00 2005.
(233.31 KB)
“

Self Adaptivity in Grid Computing,”
Concurrency and Computation: Practice and Experience, Special Issue: Grid Performance, vol. 17, no. 2-4, pp. 235-257, 00 2005.
(394.66 KB)
“

2004
Numerically Stable Real-Number Codes Based on Random Matrices,”
University of Tennessee Computer Science Department Technical Report, vol. –04-526, October 2004.
(91.66 KB)
“

EARL - API Documentation,”
ICL Technical Report, no. ICL-UT-04-03, October 2004.
(111.36 KB)
“

An Algebra for Cross-Experiment Performance Analysis,”
2004 International Conference on Parallel Processing (ICCP-04), Montreal, Quebec, Canada, August 2004.
(166.12 KB)
“

Efficient Pattern Search in Large Traces through Successive Refinement,”
Proceedings of Euro-Par 2004, Pisa, Italy, Springer-Verlag, August 2004.
(177.46 KB)
“

Design of an Interactive Environment for Numerically Intensive Parallel Linear Algebra Calculations,”
International Conference on Computational Science, Poland, Springer Verlag, June 2004.
DOI: 10.1007/978-3-540-25944-2_35
(88.31 KB)
“

Accurate Cache and TLB Characterization Using Hardware Counters,”
International Conference on Computational Science (ICCS 2004), Krakow, Poland, Springer, June 2004.
DOI: 10.1007/978-3-540-24688-6_57
(167.1 KB)
“

Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,”
Proceedings of ISC2004 (to appear), Heidelberg, Germany, June 2004.
(548.38 KB)
“

Automatic Blocking of QR and LU Factorizations for Locality,”
2nd ACM SIGPLAN Workshop on Memory System Performance (MSP 2004), Washington, DC, ACM, June 2004.
DOI: 10.1145/1065895.1065898
(212.77 KB)
“

Automating the Large-Scale Collection and Analysis of Performance,”
5th LCI International Conference on Linux Clusters: The HPC Revolution, Austin, Texas, May 2004.
(511.6 KB)
“

Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,”
International Journal for High Performance Applications and Supercomputing (to appear), April 2004.
(186.9 KB)
“

An Asynchronous Algorithm on NetSolve Global Computing System,”
PRiSM - Laboratoire de recherche en informatique, Université de Versailles St-Quentin Technical Report, March 2004.
(377.33 KB)
“

CUBE User Manual,”
ICL Technical Report, no. ICL-UT-04-01, February 2004.
(429.12 KB)
“

NetBuild: Automated Installation and Use of Network-Accessible Software Libraries,”
ICL Technical Report, no. ICL-UT-04-02, January 2004.
(80.52 KB)
“

Memory Bandwidth and the Performance of Scientific Applications: A Study of the AMD Opteron Processor,”
2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (submitted), January 2004.
(210.29 KB)
“

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment,”
ICL Technical Report, no. ICL-UT-04-04, January 2004.
(241.36 KB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Department Technical Report, CS-89-85, January 2004.
(6.42 MB)
“

Towards an Accurate Model for Collective Communications,”
International Journal of High Performance Applications, Special Issue: Automatic Performance Tuning, vol. 18, no. 1, pp. 159-167, January 2004.
(250.73 KB)
“

The Virtual Instrument: Support for Grid-enabled Scientific Simulations,”
International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 3-17, January 2004.
(282.16 KB)
“

Recommendations for Automatic Responses to Electronic Mail,”
RFC 3834: Internet Engineering Task Force (IETF), January 2004.
(174.76 KB)
“

LAPACK for Clusters Project: An Example of Self Adapting Numerical Software,”
Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04'), vol. 9, Big Island, Hawaii, pp. 90282, January 2004.
(80.97 KB)
“

Active Logistical State Management in the GridSolve/L,”
4th International Symposium on Cluster Computing and the Grid (CCGrid 2004)(submitted), Chicago, Illinois, January 2004.
(123.69 KB)
“

Cray X1 Evaluation Status Report,”
Oak Ridge National Laboratory Report, vol. /-2004/13, January 2004.
(817.33 KB)
“

Trends in High Performance Computing,”
The Computer Journal, vol. 47, no. 4: The British Computer Society, pp. 399-403, 00 2004.
(455.96 KB)
“

Self Adapting Linear Algebra Algorithms and Software,”
IEEE Proceedings (to appear), 00 2004.
(587.67 KB)
“

Improvements in the Efficient Composition of Applications,”
IPDPS 2004, NGS Workshop (to appear), Sante Fe, 00 2004.
(42.85 KB)
“

Building and using a Fault Tolerant MPI implementation,”
International Journal of High Performance Applications and Supercomputing (to appear), 00 2004.
“
An Overview of Heterogeneous High Performance and Grid Computing,”
Engineering the Grid (to appear): Nova Science Publishers, Inc., 00 2004.
(199.93 KB)
“

Performance Optimization and Modeling of Blocked Sparse Kernels,”
ICL Technical Report, no. ICL-UT-04-05, 00 2004.
(229.58 KB)
“

2003
Hardware-Counter Based Automatic Performance Analysis of Parallel Programs,”
Advances in Parallel Computing, vol. 13, Dresden, Germany, Elsevier, pp. 753-760, January 2004, 2003.
DOI: 10.1016/S0927-5452(04)80092-3
“
Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters,”
Parallel Computing, vol. 29, no. 11-12, pp. 1723-1743, November 2003.
(343.44 KB)
“

Automatic performance analysis of hybrid MPI/OpenMP applications,”
Journal of Systems Architecture, Special Issue 'Evolutions in parallel distributed and network-based processing', vol. 49(10-11): Elsevier, pp. 421-439, November 2003.
“
A Proposed Standard for Matrix Metadata,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-03-02, Submitted to ACM TOMS, November 2003.
(13.39 KB)
“

Fault Tolerant Communication Library and Applications for High Performance Computing,”
Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented), Santa Fe, NM, October 2003.
(146.05 KB)
“

Evaluating The Performance Of MPI-2 Dynamic Communicators And One-Sided Communication,”
Lecture Notes in Computer Science, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 10th European PVM/MPI User's Group Meeting, vol. 2840, Venice, Italy, Springer-Verlag, Berlin, pp. 88-97, September 2003.
(254.08 KB)
“

KOJAK - A Tool Set for Automatic Performance Analysis of Parallel Applications,”
Proc. of the European Conference on Parallel Computing (EuroPar), vol. 2790, Klagenfurt, Austria, Springer-Verlag, pp. 1301-1304, August 2003.
(196.05 KB)
“

Self-Adapting Numerical Software and Automatic Tuning of Heuristics,”
Lecture Notes in Computer Science, vol. 2660, Melbourne, Australia, Springer Verlag, pp. 759-770, June 2003.
(45.95 KB)
“

SRS - A Framework for Developing Malleable and Migratable Parallel Software,”
Parallel Processing Letters, vol. 13, no. 2, pp. 291-312, June 2003.
(211.6 KB)
“

A Fault-Tolerant Communication Library for Grid Environments,”
17th Annual ACM International Conference on Supercomputing (ICS'03) International Workshop on Grid Computing and e-Science, San Francisco, June 2003.
(377.14 KB)
“

Performance Instrumentation and Measurement for Terascale Systems,”
ICCS 2003 Terascale Workshop, Melbourne, Australia, Springer, Berlin, Heidelberg, June 2003.
DOI: 10.1007/3-540-44864-0_6
(5.36 MB)
“

Computational Science — ICCS 2003,”
Lecture Notes in Computer Science, vol. 2657-2660, ICCS 2003, International Conference. Melbourne, Australia, Springer-Verlag, Berlin, June 2003.
“
A Performance Oriented Migration Framework for the Grid,”
Proceedings of the 3rd International Symposium on Cluster Computing and the Grid, Tokyo, Japan, pp. 130-137, May 2003.
(113.6 KB)
“

Experiences and Lessons Learned with a Portable Interface to Hardware Performance Counters,”
PADTAD Workshop, IPDPS 2003, Nice, France, IEEE, April 2003.
(432.57 KB)
“

A Simple Installation and Administration Tool for Large-scaled PC Cluster System,”
ClusterWorld Conference and Expo, San Jose, CA, March 2003.
(275.97 KB)
“

Scheduling in the Grid Application Development Software Project,”
Resource Management in the Grid: Kluwer Publishers, March 2003.
(375.92 KB)
“

Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover,”
Special Issue on Biological Applications of Genetic and Evolutionary Computation (submitted), March 2003.
(438.68 KB)
“

Applying Aspect-Oriented Programming Concepts to a Component-based Programming Model,”
IPDPS 2003, Workshop on NSF-Next Generation Software, Nice, France, March 2003.
(66.99 KB)
“

Scalable, Trustworthy Network Computing Using Untrusted Intermediaries: A Position Paper,”
DOE/NSF Workshop on New Directions in Cyber-Security in Large-Scale Networks: Development Obstacles, National Conference Center - Landsdowne, Virginia, March 2003.
(54.62 KB)
“

Self Adaptability in Grid Computing,”
Concurrency: Practice and Experience (submitted), March 2003.
(258.89 KB)
“

Distributed Storage in RIB,”
ICL Tech Report, no. ICL-UT-03-01, March 2003.
(213.02 KB)
“

Optimization Problem Solving System using Grid RPC,”
3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, Tokyo, Japan, March 2003.
(71.6 KB)
“

GrADSolve - A Grid-based RPC System for Remote Invocation of Parallel Software,”
Journal of Parallel and Distributed Computing (submitted), March 2003.
(241.3 KB)
“

Distributed Probablistic Model-Building Genetic Algorithm,”
Lecture Notes in Computer Science, vol. 2723: Springer-Verlag, Heidelberg, pp. 1015-1028, January 2003.
(288.91 KB)
“

Self Adapting Numerical Algorithm for Next Generation Applications,”
International Journal of High Performance Computing Applications, vol. 17, no. 2, pp. 125-132, January 2003.
(479.18 KB)
“

Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters (LAPACK Working Note 160),”
University of Tennessee Computer Science Technical Report, UT-CS-03-499, January 2003.
(343.44 KB)
“

High Performance Computing for Computational Science,”
Lecture Notes in Computer Science, vol. 2565, VECPAR 2002, 5th International Conference June 26-28, 2002, Springer-Verlag, Berlin, January 2003.
“
High Performance Computing Trends and Self Adapting Numerial Software,”
Lecture Notes in Computer Science, High Performance Computing, 5th International Symposium ISHPC, vol. 2858, Tokyo-Odaiba, Japan, Springer-Verlag, Heidelberg, pp. 1-9, January 2003.
“
Finite-choice Algorithm Optimization in Conjugate Gradients (LAPACK Working Note 159),”
University of Tennessee Computer Science Technical Report, UT-CS-03-502, January 2003.
(64.52 KB)
“

GrADSolve - RPC for High Performance Computing on the Grid,”
Lecture Notes in Computer Science, Proceedings of the 9th International Euro-Par Conference, vol. 2790, Klagenfurt, Austria, Springer-Verlag, Berlin, pp. 394-403, January 2003.
DOI: 10.1007/978-3-540-45209-6_58
(125.96 KB)
“

Optimization of Injection Schedule of Diesel Engine Using GridRPC,”
Information Processing Society of Japan Symposium Series, vol. 2003, no. 14, pp. 189-197, January 2003.
(520.96 KB)
“

Optimizing Performance and Reliability in Distributed Computing Systems Through Wide Spectrum Storage,”
Proceedings of the IPDPS 2003, NGS Workshop, Nice, France, pp. 209, January 2003.
“
High Performance Computing Trends, Supercomputers, Clusters, and Grids,”
Information Processing Society of Japan Symposium Series, vol. 2003, no. 14, pp. 55-58, January 2003.
“
Recent Advances in Parallel Virtual Machine and Message Passing Interface,”
Lecture Notes in Computer Science, vol. 2840: Springer-Verlag, Berlin, January 2003.
“
Static Scheduling for ScaLAPACK on the Grid Using Genetic Algorithm,”
Information Processing Society of Japan Symposium Series, vol. 2003, no. 14, pp. 3-10, January 2003.
(506.42 KB)
“

A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures,”
SIAM Journal on Scientific Computing, vol. 24, no. 1, pp. 284-311, January 2003.
(224.7 KB)
“

The Future of Supercomputing: An Interim Report,”
National Research Council, Washington, D.C., The National Academies Press, January 2003.
“
The Semantic Conference Organizer,”
Statistical Data Mining and Knowledge Discovery: CRC Press, 00 2003.
(998.12 KB)
“

VisPerf: Monitoring Tool for Grid Computing,”
Lecture Notes in Computer Science, vol. 2659: Springer Verlag, Heidelberg, pp. 233-243, 00 2003.
(835.09 KB)
“

NetSolve: Past, Present, and Future - A Look at a Grid Enabled Server,”
Making the Global Infrastructure a Reality: Wiley Publishing, 00 2003.
(158.19 KB)
“

Automatic Translation of Fortran to JVM Bytecode,”
Concurrency and Computation: Practice and Experience, vol. 15, no. 3-5, pp. 202-207, 00 2003.
(185.8 KB)
“

2002
An Updated Set of Basic Linear Algebra Subprograms (BLAS),”
ACM Transactions on Mathematical Software, vol. 28, no. 2, pp. 135-151, December 2002.
DOI: 10.1145/567806.567807
(228.33 KB)
“

Experiments with Scheduling Using Simulated Annealing in a Grid Environment,”
Grid Computing - GRID 2002, Third International Workshop, vol. 2536, Baltimore, MD, Springer, pp. 232-242, November 2002.
(66.91 KB)
“

GridRPC: A Remote Procedure Call API for Grid Computing,”
ICL Technical Report, no. ICL-UT-02-06, November 2002.
(287.73 KB)
“

NetBuild: Transparent Cross-Platform Access to Computational Software Libraries,”
Concurrency and Computation: Practice and Experience, Special Issue: Grid Computing Environments, vol. 14, no. 13-15, pp. 1445-1456, November 2002.
(74.84 KB)
“

Parallelizing the Divide and Conquer Algorithm for the Symmetric Tridiagonal Eigenvalue Problem on Distributed Memory Architectures,”
SIAM Journal on Scientific Computing, vol. 6, no. 20, pp. 2223-2236, October 2002.
(321.36 KB)
“

Optimization System Using Grid RPC,”
Meeting of the Japan Society of Mechanical Engineers, Kyoto University, Kyoto, Japan, October 2002.
“
The Internet BackPlane Protocol: A Study in Resource Sharing,”
Proceedings of the second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID 2002), Berlin, Germany, October 2002.
“
JLAPACK - Compiling LAPACK Fortran to Java,”
Scientific Programming, vol. 7, no. 2, pp. 111-138, October 2002.
(307.46 KB)
“

Truss Structural Optimization Using NetSolve System,”
Meeting of the Japan Society of Mechanical Engineers, Kyoto University, Kyoto, Japan, October 2002.
(450.65 KB)
“

Algorithmic Redistribution Methods for Block Cyclic Decompositions,”
IEEE Transactions on Parallel and Distributed Computing, vol. 10, no. 12, pp. 201-220, October 2002.
(524.82 KB)
“

The Virtual Instrument: Support for Grid-enabled Scientific Simulations,”
Journal of Parallel and Distributed Computing (submitted), October 2002.
(282.16 KB)
“

The Marketplace for High-Performance Computers,”
Parallel Computing, vol. 25, no. 13-14, pp. 1517-1545, October 2002.
(285.78 KB)
“

Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments,”
Journal of Parallel and Distributed Computing, vol. 98, no. 1, pp. 68-91, October 2002.
(266.82 KB)
“

A Comparison of Parallel Solvers for General Narrow Banded Linear Systems,”
Parallel and Distributed Computing Practices, vol. 2, pp. 385-400, October 2002.
(304.96 KB)
“

Biannual Top-500 Computer Lists Track Changing Environments for Scientific Computing,”
SIAM News, vol. 34, no. 9, October 2002.
(2.62 MB)
“

A Parallel Implementation of the Nonsymmetric QR Algorithm for Disitributed Memory Architectures,”
SIAM Journal on Scientific Computing, vol. 16, no. 2, pp. 284-311, October 2002.
(224.7 KB)
“

Adaptive Scheduling for Task Farming with Grid Middleware,”
International Journal of Supercomputer Applications and High-Performance Computing, vol. 13, no. 3, pp. 231-240, October 2002.
(461.08 KB)
“

Numerical Libraries and Tools for Scalable Parallel Cluster Computing,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 2, pp. 175-180, October 2002.
(37.38 KB)
“

Automatic Optimisation of Parallel Linear Algebra Routines in Systems with Variable Load,”
EuroPar 2002, Paderborn, Germany, August 2002.
(92.59 KB)
“

Middleware for the Use of Storage in Communication,”
Parallel Computing, vol. 28, no. 12, pp. 1773-1788, August 2002.
(87.97 KB)
“

A Metascheduler For The Grid,”
Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC 2002), Edinburgh, Scotland, IEEE Computer Society, pp. 343-351, July 2002.
(99.53 KB)
“

Users' Guide to NetSolve v1.4.1,”
ICL Technical Report, no. ICL-UT-02-05, June 2002.
(328.01 KB)
“

Development of the PICMSS NetSolve Service,”
ICL Technical Report, no. ICL-UT-02-04, April 2002.
(328.44 KB)
“

Toward a Framework for Preparing and Executing Adaptive Grid Programs,”
International Parallel and Distributed Processing Symposium: IPDPS 2002 Workshops, Fort Lauderdale, FL, pp. 0171, April 2002.
(64.5 KB)
“

A Comparison of Counting and Sampling Modes of Using Performance Monitoring Hardware,”
International Conference on Computational Science (ICCS 2002), Amsterdam, Netherlands, Springer, April 2002.
DOI: 10.1007/3-540-46080-2_95
(122 KB)
“

Innovations of the NetSolve Grid Computing System,”
Concurrency: Practice and Experience, vol. 14, no. 13-15, pp. 1457-1479, January 2002.
(311.31 KB)
“

HARNESS Fault Tolerant MPI Design, Usage and Performance Issues,”
Future Generation Computer Systems, vol. 18, no. 8, pp. 1127-1142, January 2002.
(403.41 KB)
“

Polynomial Acceleration of Optimised Multi-grid Smoothers; Basic Theory,”
ICL Technical Report, vol. 156, no. ICL-UT-02-03, January 2002.
(100.66 KB)
“

“Basic Linear Algebra Subprograms Technical (BLAST) Forum Standard,”
International Journal of High Performance Computing Applications: Special Issue - Part I & II, vol. 16, no. 1-2, pp. 1-199, January 2002.
Hardware Software Server in NetSolve,”
ICL Technical Report, no. ICL-UT-02-02, January 2002.
(221.4 KB)
“

Overview of GridRPC: A Remote Procedure Call API for Grid Computing,”
Proceedings of the Third International Workshop on Grid Computing, pp. 274-278, January 2002.
(221.82 KB)
“

Deploying Parallel Numerical Library Routines to Cluster Computing in a Self Adapting Fashion,”
Parallel Computing: Advances and Current Issues:Proceedings of the International Conference ParCo2001, London, England, Imperial College Press, January 2002.
(381.89 KB)
“

Active Netlib: An Active Mathematical Software Collection for Inquiry-based Computational Science and Engineering Education,”
Journal of Digital Information special issue on Interactivity in Digital Libraries, vol. 2, no. 4, 00 2002.
(182.59 KB)
“

An Iterative Solver Benchmark,”
Scientific Programming (to appear), 00 2002.
(142.67 KB)
“

Self-adapting Numerical Software for Next Generation Applications (LAPACK Working Note 157),”
ICL Technical Report, no. ICL-UT-02-07, 00 2002.
(475.94 KB)
“

2001
Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries,”
Journal of Parallel and Distributed Computing, vol. 61, no. 12, pp. 1803-1826, December 2001.
(386.37 KB)
“

Network-Enabled Solvers: A Step Toward Grid-Based Computing,”
SIAM News, vol. 34, no. 10, December 2001.
“
High Performance Computing Trends,”
HERMIS, vol. 2, pp. 155-163, November 2001.
“
Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication,”
submitted to SC2001, Denver, Colorado, November 2001.
(41.79 KB)
“

Performance Modeling for Self Adapting Collective Communications for MPI,”
LACSI Symposium 2001, Santa Fe, NM, October 2001.
(105.49 KB)
“

Review of Performance Analysis Tools for MPI Parallel Programs,”
European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting, Lecture Notes in Computer Science 2131, Greece, Springer Verlag, Berlin, pp. 241-248, September 2001.
DOI: 10.1007/3-540-45417-9_34
(39.61 KB)
“

Parallel IO Support for Meta-Computing Applications: MPI_Connect IO Applied to PACX-MPI,”
8th European PVM/MPI User's Group Meeting, Lecture Notes in Computer Science, vol. 2131, Greece, Springer Verlag, Berlin, September 2001.
(129.3 KB)
“

End-user Tools for Application Performance Analysis, Using Hardware Counters,”
International Conference on Parallel and Distributed Computing Systems, Dallas, TX, August 2001.
(306.54 KB)
“

Automatic Translation of Fortran to JVM Bytecode,”
Joint ACM Java Grande - ISCOPE 2001 Conference (submitted), Stanford University, California, June 2001.
(185.8 KB)
“

Using PAPI for Hardware Performance Monitoring on Linux Systems,”
Conference on Linux Clusters: The HPC Revolution, Urbana, Illinois, Linux Clusters Institute, June 2001.
(422.35 KB)
“

The PAPI Cross-Platform Interface to Hardware Performance Counters,”
Department of Defense Users' Group Conference Proceedings, Biloxi, Mississippi, June 2001.
(328.56 KB)
“

Metacomputing Support for the SARA3D Structural Acoustics Application,”
Department of Defense Users' Group Conference (to appear), Biloxi, Mississippi, June 2001.
(64.58 KB)
“

Parallel I/O for EQM Applications,”
Department of Defense Users' Group Conference Proceedings (to appear),, Biloxi, Mississippi, June 2001.
(81.41 KB)
“

The Quest for Petascale Computing,”
Computing in Science and Engineering, vol. 3, no. 3, pp. 32-39, May 2001.
(178.3 KB)
“

Enabling Full Service Surrogates Using the Portable Channel Representation,”
Tenth International World Wide Web Conference Proceedings (to appear),, Hong Kong, May 2001.
(267.23 KB)
“

Network-Enabled Server Systems: Deploying Scientific Simulations on the Grid,”
2001 High Performance Computing Symposium (HPC'01), part of the Advance Simulation Technologies Conference, Seattle, Washington, April 2001.
(175.23 KB)
“

Grid-Enabling Problem Solving Environments: A Case Study of SCIRUN and NetSolve,”
Proceedings of the High Performance Computing Symposium (HPC 2001) in 2001 Advanced Simulation Technologies Conference, Seattle, Washington, Society for Modeling and Simulation International, April 2001.
(144.19 KB)
“

Basic Linear Algebra Subprograms (BLAS),”
(an update), submitted to ACM TOMS, February 2001.
(228.33 KB)
“

Internet Backplane Protocol: API 1.0,”
University of Tennessee Computer Science Technical Report, no. UT-CS-01-464, January 2001.
(55.33 KB)
“

NetBuild,”
University of Tennessee Computer Science Technical Report, no. UT-CS-O1-461, January 2001.
(17.71 KB)
“

On the Convergence of Computational and Data Grids,”
Parallel Processing Letters, vol. 11, no. 2-3, pp. 187-202, January 2001.
(213.35 KB)
“

Overview of High Performance Computers,”
Handbook of Massive Data Sets: Kluwer Academic Publishers, pp. 791-852, January 2001.
(442.71 KB)
“

HARNESS and Fault Tolerant MPI,”
Parallel Computing, vol. 27, no. 11, pp. 1479-1496, January 2001.
(164.2 KB)
“

Automated Empirical Optimization of Software and the ATLAS Project,”
Parallel Computing, vol. 27, no. 1-2, pp. 3-25, January 2001.
(370.71 KB)
“

Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK,”
University of Tennessee Computer Science Technical Report, no. UT-CS-01-460, January 2001.
(91.78 KB)
“

The GrADS Project: Software Support for High-Level Grid Application Development,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 4, pp. 327-344, January 2001.
(271.52 KB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Technical Report, no. CS-89-85, January 2001.
(6.42 MB)
“

Numerical Libraries and The Grid,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 4, pp. 359-374, January 2001.
(67.09 KB)
“

Internet Backplane Protocol - Test Language v. 1.0,”
University of Tennessee Computer Science Technical Report, no. UT-CS-01-464, January 2001.
(22.43 KB)
“

Numerical Libraries and Tools for Scalable Parallel Cluster Computing,”
International Journal of High Performance Applications and Supercomputing, vol. 15, no. 2, pp. 175-180, January 2001.
(37.38 KB)
“

Automatic Determination of Matrix-Blocks,”
Lapack Working Note 151, University of Tennessee Computer Science Technical Report, no. UT-CS-01-458, January 2001.
(1.15 MB)
“

Fault Tolerant MPI for the HARNESS Meta-Computing System,”
Proceedings of International Conference of Computational Science - ICCS 2001, Lecture Notes in Computer Science, vol. 2073, Berlin, Springer Verlag, pp. 355-366, 00 2001.
DOI: 10.1007/3-540-45545-0_44
“
RIBAPI - Repository in a Box Application Programmer's Interface,”
University of Tennessee Computer Science Technical Report, no. UT-CS-00-438, 00 2001.
(57.5 KB)
“

Recursive Approach in Sparse Matrix LU Factorization,”
Scientific Programming, vol. 9, no. 1, pp. 51-60, 00 2001.
(217.16 KB)
“

Measuring Computer Performance: A Practioner's Guide,”
SIAM Review (book review), vol. 43, no. 2, pp. 383-384, 00 2001.
(558.9 KB)
“

Iterative Solver Benchmark (LAPACK Working Note 152),”
Scientific Programming, vol. 9, no. 4, pp. 223-231, 00 2001.
(168.05 KB)
“

Repository in a Box Toolkit for Software and Resource Sharing,”
University of Tennessee Computer Science Department Technical Report, no. ICL-UT-05-05, 00 2001.
(195.96 KB)
“

2000
A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters,”
Proceedings of SuperComputing 2000 (SC'00), Dallas, TX, November 2000.
(178.15 KB)
“

Automatically Tuned Collective Communications,”
Proceedings of SuperComputing 2000 (SC'2000), Dallas, TX, November 2000.
(232.69 KB)
“

Developing an Architecture to Support the Implementation and Development of Scientific Computing Applications,”
to appear in Proceedings of Working Conference 8: Software Architecture for Scientific Computing Applications, Ottawa, Canada, October 2000.
(176.25 KB)
“

Automated Empirical Optimizations of Software and the ATLAS Project (LAPACK Working Note 147),”
University of Tennessee Computer Science Department Technical Report,, no. UT-CS-00-448, September 2000.
(373.69 KB)
“

A Portable Programming Interface for Performance Evaluation on Modern Processors,”
The International Journal of High Performance Computing Applications, vol. 14, no. 3, pp. 189-204, September 2000.
DOI: 10.1177/109434200001400303
(655.17 KB)
“

The NetSolve Environment: Progressing Towards the Seamless Grid,”
2000 International Conference on Parallel Processing (ICPP-2000), Toronto, Canada, August 2000.
(148.85 KB)
“

Seamless Access to Adaptive Solver Algorithms,”
Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation, Lausanne, Switzerland, August 2000.
(151.42 KB)
“

A New Recursive Implementation of Sparse Cholesky Factorization,”
Proceedings of 16th IMACS World Congress 2000 on Scientific Computing, Applications Mathematics and Simulation, Lausanne, Switzerland, August 2000.
“
Metacomputing: An Evaluation of Emerging Systems,”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-00-445, July 2000.
(280.21 KB)
“

Secure Remote Access to Numerical Software and Computation Hardware,”
University of Tennessee Computer Science Technical Report, UT-CS-00-446, July 2000.
(402.31 KB)
“

A Portable Programming Interface for Performance Evaluation on Modern Processors,”
University of Tennessee Computer Science Technical Report, UT-CS-00-444, July 2000.
(655.17 KB)
“

Top500 Supercomputer Sites (15th edition),”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-00-442, June 2000.
(278.88 KB)
“

Secure Remote Access to Numerical Software and Computational Hardware,”
Proceedings of the DoD HPC Users Group Conference (HPCUG) 2000, Albuquerque, NM, June 2000.
(172.6 KB)
“

Design and Implementation of NetSolve using DCOM as the Remoting Layer,”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-00-440, May 2000.
(65.45 KB)
“

Providing Infrastructure and Interface to High Performance Applications in a Distributed Setting,”
ASTC-HPC 2000, Washington, DC, April 2000.
(96.04 KB)
“

The GrADS Project: Software Support for High-Level Grid Application Development,”
Technical Report, February 2000.
(347.41 KB)
“

Recent Advances in Parallel Virtual Machine and Message Passing Interface,”
Lecture Notes in Computer Science: Proceedings of 7th European PVM/MPI Users' Group Meeting 2000, (Hungary: Springer Verlag), pp. V1908, January 2000.
“
Recursive approach in sparse matrix LU factorization,”
Proceedings of 1st SGI Users Conference, Cracow, Poland (ACC Cyfronet UMM, 2000), pp. 409-418, January 2000.
(176.14 KB)
“

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World,”
Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000, (Hungary: Springer Verlag, 2000), pp. V1908,346-353, January 2000.
(51.95 KB)
“

Performance of Various Computers Using Standard Linear Equations Software (Linpack Benchmark Report),”
University of Tennessee Computer Science Department Technical Report, no. CS-89-85, January 2000.
(354.1 KB)
“

The Design and Implementation of the Parallel Out of Core ScaLAPACK LU, QR, and Cholesky Factorization Routines,”
Concurrency: Practice and Experience, vol. 12, no. 15, pp. 1481-1493, January 2000.
(374.18 KB)
“

Request Sequencing: Optimizing Communication for the Grid,”
Lecture Notes in Computer Science: Proceedings of 6th International Euro-Par Conference 2000, Parallel Processing, (Germany: Springer Verlag 2000), pp. V1900,1213-1222, January 2000.
(165.92 KB)
“

High Performance Computing Today,”
FOMMS 2000: Foundations of Molecular Modeling and Simulation Conference (to appear), January 2000.
(66 KB)
“

Logistical Networking: Sharing More Than the Wires,”
In Active Middleware Services, Ed. Salim Hariri, Craig A. Lee, Cauligi S. Raghavendra (2000), Kluwer Academic, Norwell, MA, January 2000.
(84.69 KB)
“

Message Passing Software Systems,”
Encyclopedia of Electrical and Engineering, Supplement 1: John Wiley & Sons, Inc., 00 2000.
(289.38 KB)
“

1999
The 'Weighted Modification' Incomplete Factorisation Method,”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-436, December 1999.
(198.71 KB)
“

On the Existence Problem of Incomplete Factorisation Methods,”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-435, December 1999.
(222.2 KB)
“

Top500 Supercomputer Sites (14th edition),”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-434, November 1999.
(281.81 KB)
“

Numerical Linear Algebra Algorithms and Software,”
Journal of Computational and Applied Mathematics, vol. 123, no. 1-2, pp. 489-514, October 1999.
(258.62 KB)
“

Deploying Fault-tolerance and Task Migration with NetSolve,”
Future Generation Computer Systems, vol. 15, no. 5-6: Elsevier, pp. 745-755, October 1999.
(236 KB)
“

Numerical Linear Algebra,”
Encyclopedia of Computer Science and Technology, eds. Kent, A., Williams, J., vol. 41, pp. 207-233, August 1999.
(262 KB)
“

PAPI: A Portable Interface to Hardware Performance Counters,”
Proceedings of Department of Defense HPCMP Users Group Conference, June 1999.
(57.77 KB)
“

Top500 Supercomputer Sites (13th edition),”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-425, June 1999.
(278.51 KB)
“

A Numerical Linear Algebra Problem Solving Environment Designer's Perspective (LAPACK Working Note 139),”
SIAM Annual Meeting, Atlanta, GA, May 1999.
(319.71 KB)
“

Portable Representation of Internet Content Channels in I2-DSI,”
4th Intl. Web Caching Workshop, San Diego, CA, March 1999.
“
Experiences with Windows 95/NT as a Cluster Computing Platform for Parallel Computing,”
Parallel and Distributed Computing Practices, Special Issue: Cluster Computing, vol. 2, no. 2: Nova Science Publishers, USA, pp. 119-128, February 1999.
(164.04 KB)
“

IBP - Internet Backplane Protocol: Infrastructure for Distributed Storage (V O.2),”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-430, February 1999.
(37.72 KB)
“

A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow Banded Linear Systems II (LAPACK Working Note 143),”
University of Tennessee Computer Science Department Technical Report, no. UT-CS-99-415, January 1999.
(174.46 KB)
“

HARNESS: A Next Generation Distributed Virtual Machine,”
International Journal on Future Generation Computer Systems, vol. 15, no. 5-6, pp. 571-582, January 1999.
(183.78 KB)
“

Numerical Libraries and Tools for Scalable Parallel Cluster Computing,”
IEEE Cluster Computing BOF at SC99, Portland, Oregon, January 1999.
(37.38 KB)
“

LAPACK Users' Guide, 3rd ed.,”
Philadelphia: Society for Industrial and Applied Mathematics, January 1999.
“
Parallel and Distributed Scientific Computing: A Numerical Linear Algebra Problem Solving Environment Designer's Perspective,”
Handbook on Parallel and Distributed Processing, January 1999.
(323.01 KB)
“

Towards An Efficient, Scalable Replication Mechanism for the I2-DSI Project,”
University of North Carolina School of Library and Information Science Technical Report, no. TR-1999-01, January 1999.
“
Logistical Quality of Service in NetSolve,”
Computer Communications, vol. 22, no. 11, pp. 1034-1044, January 1999.
(168.39 KB)
“

Static Tiling for Heterogeneous Computing Platforms,”
Parallel Computing, vol. 25, no. 5, pp. 547-568, January 1999.
(301.17 KB)
“

Stochastic Performance Prediction for Iterative Algorithms in Distributed Environments,”
Journal of Parallel and Distributed Computing, vol. 98, no. 1, pp. 68-91, January 1999.
(257.5 KB)
“

Scalable Networked Information Processing Environment (SNIPE),”
Journal on Future Generation Computer Systems, vol. 15, no. 5/6, pp. 595-605, January 1999.
(189.21 KB)
“

Algorithmic Issues on Heterogeneous Computing Platforms,”
Parallel Processing Letters, vol. 9, no. 2, pp. 197-213, January 1999.
(301.17 KB)
“

Tiling on Systems with Communication/Computation Overlap,”
Concurrency: Practice and Experience, vol. 11, no. 3, pp. 139-153, January 1999.
(286.14 KB)
“

A Comparison of Parallel Solvers for General Narrow Banded Linear Systems (LAPACK Working Note 142),”
University of Tennessee Computer Science Technical Report, no. UT-CS-99-414, January 1999.
(304.96 KB)
“

Atlanta Organizers Put Mathematics to Work For the Math Sciences Community,”
SIAM News, vol. 32, no. 6, January 1999.
(45.98 KB)
“

1998
Automatically Tuned Linear Algebra Software,”
1998 ACM/IEEE conference on Supercomputing (SC '98), Orlando, FL, IEEE Computer Society, November 1998.
“
MPI - The Complete Reference, Volume 1: The MPI Core
, Second, Cambridge, MA, USA, MIT Press, pp. 426, August 1998.
National HPCC Software Exchange (NHSE): Uniting the High Performance Computing and Communications Community,”
D-Lib Magazine, January 1998.
(56.15 KB)
“

Numerical Linear Algebra for High-Performance Computers,”
Software, Environments and Tools: SIAM, 1998.
DOI: 10.1137/1.9780898719611
“
1996
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance,”
Computer Physics Communications, vol. 97, issue 1-2, pp. 1-15, August 1996.
DOI: 10.1016/0010-4655(96)00017-3
“