HPL-MxP NOVEMBER 2022

<table>
<thead>
<tr>
<th></th>
<th>SITE</th>
<th>CORES</th>
<th>HPL Rank EFLOPS</th>
<th>TOP500 Rank</th>
<th>HPL-AI EFLOPS</th>
<th>SPEEDUP</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Frontier HPE Cray EX325a</td>
<td>DOE/SC/ORNL USA</td>
<td>8,730,112</td>
<td>7.942</td>
<td>1</td>
<td>1.1020</td>
</tr>
<tr>
<td>2</td>
<td>LUMI HPE Cray EX235a</td>
<td>EuroHPC/CSC FINLAND</td>
<td>2,174,976</td>
<td>2.168</td>
<td>3</td>
<td>0.3091</td>
</tr>
<tr>
<td>3</td>
<td>Fugaku Fujitsu A64FX</td>
<td>RIKEN Center for Computational Science JAPAN</td>
<td>7,630,848</td>
<td>2.000</td>
<td>2</td>
<td>0.4420</td>
</tr>
<tr>
<td>4</td>
<td>Leonardo Bull Sequana X10000</td>
<td>EuroHPC/CINECA ITALY</td>
<td>1,463,616</td>
<td>1.842</td>
<td>4</td>
<td>0.1682</td>
</tr>
<tr>
<td>5</td>
<td>Summit AC922 IBM POWER9</td>
<td>DOE/SC/ORNL USA</td>
<td>2,414,592</td>
<td>1.411</td>
<td>5</td>
<td>0.1486</td>
</tr>
<tr>
<td>6</td>
<td>Selene DGX A100</td>
<td>NVIDIA USA</td>
<td>555,520</td>
<td>0.630</td>
<td>9</td>
<td>0.0630</td>
</tr>
<tr>
<td>7</td>
<td>Perlmutter HPE Cray EX235s</td>
<td>DOE/SC/LBNL/NERSC USA</td>
<td>761,856</td>
<td>0.590</td>
<td>8</td>
<td>0.0709</td>
</tr>
<tr>
<td>8</td>
<td>JUWELS Booster Module Bull Sequana X10000</td>
<td>Forschungszentrum Jülich (FZJ) GERMANY</td>
<td>449,280</td>
<td>0.470</td>
<td>12</td>
<td>0.0440</td>
</tr>
<tr>
<td>9</td>
<td>Adastra HPE Cray EX235a</td>
<td>GENCI-CINES FRANCE</td>
<td>319,072</td>
<td>0.303</td>
<td>11</td>
<td>0.0461</td>
</tr>
<tr>
<td>10</td>
<td>Setonix HPE Cray EX235a</td>
<td>Pawsley Supercomputing Centre AUSTRALIA</td>
<td>181,248</td>
<td>0.175</td>
<td>15</td>
<td>0.0272</td>
</tr>
</tbody>
</table>

OVERVIEW

The HPL-MxP benchmark seeks to highlight the emerging convergence of high-performance computing (HPC) and artificial intelligence (AI) workloads. While traditional HPC focused on simulation runs for modeling phenomena in physics, chemistry, biology, and so on, the mathematical models that drive these computations require, for the most part, 64-bit accuracy. On the other hand, the machine learning methods that fuel advances in AI achieve desired results at 32-bit and even lower floating-point precision formats. This lesser demand for accuracy fueled a resurgence of interest in new hardware platforms that deliver a mix of unprecedented performance levels and energy savings to achieve the classification and recognition fidelity afforded by higher-accuracy formats.

HPL-MxP strives to unite these two realms by delivering a blend of modern algorithms and contemporary hardware while simultaneously connecting to the solver formulation of the decades-old HPL framework of benchmarking the largest supercomputing installations in the world. The solver method of choice is a combination of LU factorization and iterative refinement performed afterwards to bring the solution back to 64-bit accuracy. The innovation of HPL-MxP lies in dropping the requirement of 64-bit computation throughout the entire solution process and instead opting for low-precision (likely 16-bit) accuracy for LU, and a sophisticated iteration to recover the accuracy lost in factorization. The iterative method guaranteed to be numerically stable is the generalized minimal residual method (GMRES), which uses application of the L and U factors to serve as a preconditioner. The combination of these algorithms is demonstrably sufficient for high accuracy and may be implemented in a way that takes advantage of the current and upcoming devices for accelerating AI workloads.

PERFORMANCE

Xgetrf routine

- FP16-TC (Tensor Cores) hgetrl LU
- FP16 hgetrf LU
- FP32 sgetrf LU
- FP64 dgetrf LU

PUBLICATIONS

Azzam Haidar, Stanimire Tomov, Jack Dongarra, Nicholas J. Higham
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
https://dl.acm.org/citation.cfm?id=3291719

Erin Carson and Nicholas J. Higham
Accelerating the Solution of Linear Systems by Iterative Refinement in Three Preconditioning

Erin Carson and Nicholas J. Higham
A New Analysis of Iterative Refinement and its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems
2017.
http://eprints.maths.manchester.ac.uk/2537/

Nicholas J. Higham, Srikara Pranesh, and Mawusi Zounon
Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems
2018.
http://eprints.maths.manchester.ac.uk/2678/

Pierre Blanchard, Nicholas J. Higham, Florent Lopez, Theo Mary, and Srikara Pranesh
Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores
2018.
http://eprints.maths.manchester.ac.uk/2733/
MIXED-PRECISION BENCHMARK

RULES

The idea for the benchmark is to solve a system of linear equations to 64-bit floating point accuracy by doing a mixed-precision factorization of a matrix and compute an approximate solution from the low-precision factorization (LU decomposition), and then use an iterative method like GMRES in 64-bit precision to iterate with the approximate low-precision solution to compute a final solution obtaining the accuracy one would have achieved by LU decomposition in 64-bit floating point arithmetic. The low-precision LU factors should be used as a preconditioner in the iterative algorithm.

The benchmark should use the HPL benchmark harness (https://www.netlib.org/benchmark/hpl/) with a modification of the matrix generator. The generator will produce a non-symmetric matrix with the diagonal entries being the sum of the off-diagonal rows, this will force the matrix to be diagonally dominant.

\[ r_n \rightarrow \frac{2}{3} n^3 \quad \frac{1}{T_{\text{tot}}} \]

In an attempt to obtain uniformity across all computers in performance reporting, the algorithm used in solving the low-precision system of equations in the benchmark procedure must numerically conform to an LU factorization with partial pivoting. In particular, the operation count for the algorithm must be \( \frac{2}{3} n^3 + O(n^2) \) double precision floating point operations even though double-precision arithmetic is not required.

The HPL harness computes a backward-error:
\[ \frac{1}{\| A \| \| x \| + \| b \|} \times (\varepsilon \times n^3) \]

where \(\varepsilon\) is the machine precision in 64-bit floating point arithmetic (on IEEE machines this is \(\varepsilon = 2^{-53}\)) and \( n \) is the size of the problem. There is no restriction on the problem size.

The implementation is allowed to do balancing to get the numbers within range of the floating point format, but the time to do the balancing must be included in the time to solution.

The factorization can use mixed precision during its construction, e.g., the panel factorization and triangular solves can be done in 32-bit arithmetic and the Schur complement (matrix-matrix multiply) can be computed in 16-bit arithmetic with 32-bit accumulation.

The computation rate is based on the time to solve the problem: factor the matrix in lower precision, perhaps balance the matrix to prevent overflow, perform GMRES in 64-bit floating point arithmetic using the LU factors as a preconditioner. If the implementation takes more than 50 iterations, the method should trigger a failure and the run is not valid.

In computing a rate of execution, \( \frac{2}{3} n^3 + \frac{3}{2} n^2 \) operations \( \frac{2}{3} n^3 - \frac{1}{2} n^2 \) accounts for LU factorization and \( 2n^2 \) for the subsequent back- and forward-solves will be divided by the complete time to solution to achieve operations per second.

As part of the submission of results we expect the submitter to provide a detailed explanation of the algorithm used in the submission.

We have provided a reference implementation whose purpose is to show how the benchmark could be implemented. We do not expect this to be used in actually running of the benchmark. Optimizations should be applied to achieve higher performance than the reference implementation could achieve. The reference implementation can be found here on Bitbucket (https://bitbucket.org/icl/hpl-ai/)