Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators

Sukkari, Dalal; Gates, Mark; Al Farhan, Mohammed; Anzt, Hartwig; Dongarra, Jack

Submitted by claxton on Mon, 11/13/2023 - 11:07

Title	Task-Based Polar Decomposition Using SLATE on Massively Parallel Systems with Hardware Accelerators
Publication Type	Conference Paper
Year of Publication	2023
Authors	Sukkari, D., M. Gates, M. Al Farhan, H. Anzt, and J. Dongarra
Conference Name	SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
Date Published	2023-11
Publisher	ACM
Conference Location	Denver, CO
ISBN Number	9798400707858
Abstract	We investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies. We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
URL	https://dl.acm.org/doi/proceedings/10.1145/3624062
DOI	10.1145/3624062.3624248

Project Tags:

slate

External Publication Flag: