# Agenda - Part 1: Heterogeneous Computing - Part 2: Mixed-Precision scientific computing with Tensor Cores - Closing Remarks #### Heterogeneous Hardware Specialization of hardware components - heterogeneous - adjective US: /ˌhetjə.roʊˈdʒiː.ni.əs/ UK: /ˌhet.ər.əˈdʒiː.ni.əs/ - consisting of parts or things that are very different from each other GPU **L0 Instruction Cache** Dispatch Unit (32 thread/clk) Register File (16,384 x 32-bit) INT32 FP32 FP32 FP64 FP64 FP64 FP32 FP32 FP64 FP64 FP64 **TENSOR CORE** FP64 FP32 FP32 4th GENERATION FP64 FP32 FP32 FP64 FP32 FP32 FP64 FP64 FP32 FP32 FP64 FP64 FP64 FP64 CPU Memory Communication Multi-chip NVLink/NVSwitch Multi-node NVLink/NVSwitch Infiniband #### Global Access to All Data Cache-coherent access via NVLink C2C from either processor to either physical memory Grace directly reading Hopper's memory CPU fetches GPU data into CPU L3 cache Cache remains **coherent** with GPU memory Changes to GPU memory **evict** cache line Hopper directly reading Grace's memory GPU loads CPU data via CPU L3 cache CPU and GPU can both hit on cached data Changes to CPU memory update cache line ### Grace/Hopper Unified Memory Address Translation Service (ATS) allows full access to all CPU & GPU allocations ATS creates a single page table for the whole system NVLink C2C allows access to all physical memory without migration Hopper can access Grace memory at **full CPU memory speed** of 500 GB/sec But Hopper can access its own memory at **full HBM speed** of 4000 GB/sec The system can **automatically** migrate both managed and CPU-allocated memory in order to optimize access speed ATS shared page table means that both CPU and GPU automatically access X in its new location after migration ### Grace/Hopper SuperChip Performance On Different Workloads # Multi-Chip Systems ### NVLink Connects Up To 256 Superchips Multi-die NVLINK C2C Multi-chip NVLINK+NVSWITCH Multi-node (MNNVL) NVLINK+NVSWITCH up to 256 GPUs + Infiniband #### Multi-Node NVLink cuFFTMp #### Performance Projections - MNNVL offers significant performance on MGMN FFTs - Drop in performance between 128 and 256 GPU, because IB is used (left to right Green bars) - Using CPU memory on Grace Hopper allows for bigger sizes, 32k<sup>3</sup> starting at 1024 GPUs #### Tensor Core Performance Across GPU Generations ## What types of things can we do to take advantage of Tensor Cores? #### Non-Exhaustive Short List - Tensor Cores have provided mixed-precision algorithms algorithm developers increased opportunities and motivation [1] - Algorithms that facilitate drop-in replacement for common functions - Matrix-multiply implementations - Emulate full range and accuracy - Works for all use cases and can be made default - Emulate partial range and accuracy [2] - Works for some use cases but cannot be made default - FFTs [3] - New algorithms that use mixed-precision - Iterative refinement linear system solver with fallbacks [4] - Can be drop-in for LAPACK <T>GETRF/S <sup>[1]</sup> Abdelfattah A, Anzt H, Boman EG, et al. A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. The International Journal of High Performance Computing Applications. 2021;35(4):344-369. doi:10.1177/10943420211003313 <sup>[4]</sup> Haidar Azzam, Bayraktar Harun, Tomov Stanimire, Dongarra Jack and Higham Nicholas J. 2020Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems, Proc. R. Soc. A.4762020011020200110 <a href="https://royalsocietypublishing.org/doi/10.1098/rspa.2020.0110">https://royalsocietypublishing.org/doi/10.1098/rspa.2020.0110</a> <sup>[2]</sup> Hiroyuki Ootomo and Rio Yokota, Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance, 2022 https://arxiv.org/pdf/2203.03341.pdf <sup>[3]</sup> L. Pisha and Ł. Ligowski, "Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores," 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Portland, OR, USA, 2021, pp. 507-516, doi: 10.1109/IPDPS49936.2021.00059. #### Tensor Core Accelerated Iterative Refinement Solver FP64 accuracy linear system solution - Iterative refinement method [1] - Dense Linear Solver for Ax=b - Can substitute <T>GETRF/S from LAPACK - Main idea: - Move O(n³) operations during LU factorization to mixed-precision Tensor Cores - Use the factor as a preconditioner for an iterative solver at FP64 where operations have O(n²) complexity - Reduces reliance on FP64 compute throughput while delivering better performance - Top500 benchmark HPL-MxP (formerly HPL-AI) uses this approach #### Tensor Core Accelerated Iterative Refinement Solver Speed-ups relative to full FP64 baseline solver #### Goal #### Accelerate single precision (FP32) matrix multiplies without any loss of range or accuracy - Algorithm that uses BF16 Tensor Cores - Study accuracy of implementation for corner cases - Prototype implementation in cuBLAS and cuTENSOR - Validate accuracy with real world applications Weather Forecasting Quantum Computing Baseline implementations in cuBLAS and cuTENSOR use single precision IEEE754 FMA instructions Condensed Matter Physics ### Algorithm Description BF16/9 Algorithm • The FP32 inputs are decomposed into 3 scaled BF16 components $$a = a0 + 2^{-8}.a1 + 2^{-16}.a2$$ $b = b0 + 2^{-8}.b1 + 2^{-16}.b2$ - The Inputs are decomposed using the CUDA cores - The multiply-add operation is computed as a sum of 9 scaled partial products a \* b + c = $$a0.b0 + 2^{-8} \cdot a0.b1 + 2^{-16} \cdot a0.b2$$ + $2^{-8} \cdot a1.b0 + 2^{-16} \cdot a1.b1 + 2^{-24} \cdot a1.b2$ + $2^{-16} \cdot a2.b0 + 2^{-24} \cdot a2.b1 + 2^{-32} \cdot a2.b2 + c$ - The partial products are computed in the BF16 Tensor cores - The partial products are scaled appropriately in the CUDA cores - The tensor cores and CUDA cores work in parallel - The effective FP32 FLOPs is 1/9<sup>th</sup> that of the BF16 tensor core FLOPs - On H100 119 vs 67 TFLOP/s → ~1.8X maximum speed-up ### Measuring Accuracy Smoke test: testing different uniform-exponent combinations Matrix mul dimensions: $[512 \times 2048] = [512 \times 1024] * [1024 \times 2048]$ $$RMS = \sqrt{\frac{\sum_{r,c} \{Result_{r,c} - Result.fp64_{rc}\}^{2}}{\sum_{r,c} \{Result.fp64_{.r,c}\}^{2}}} \qquad SNR = -20.0 * log_{10} \sqrt{\frac{P_{err}}{P_{ref}}} = -20.0 * log_{10} RMS$$ #### Numerical Accuracy Study Results for two different data sets [1] Hiroyuki Ootomo and Rio Yokota, Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance, 2022 <a href="https://arxiv.org/pdf/2203.03341.pdf">https://arxiv.org/pdf/2203.03341.pdf</a> #### Weather Forecast Simulation Accelerate Spectral Transforms - IFS: Integrated Forecast System from ECMWF - Weather models moving to higher and higher resolution (e.g. TCo3999 ~ 2.5km global resolution). Spectral transform is a major bottleneck (>50% of the cost) - Spectral transfrom = All2All + FFT + All2All + GEMM Scaling Bottlenecks are All2All and GEMMs - Mid-term we aim to run weather models at a resolution of ~1 km because that would allow explicitly resolving convection - Ectrans is the spectral transform library extracted from the full model. <a href="https://github.com/ecmwf-ifs/ectrans">https://github.com/ecmwf-ifs/ectrans</a> - We use Ectrans to check accuracy of our emulated FP32 matrix multiply implementation #### Weather Forecast Simulation #### Accelerate Spectral Transforms - Absolute deviations after 1000 iterations of forward and backward spherical transforms - Ground truth is the original input - In all cases, BF16X9 gives at least similar, for some variables significantly superior results - Velocities can behave different compared to temperature because they have an additional conversion from u/v (in grid-point space) to divergence/vorticity (in spectral space) - TF32 has significant (too large) deviations for temperature and explodes for velocities - TF32/3 implementation here is without scaling #### Quantum Computing Simulations Accelerate Tensor Network Contractions - Validation of a quantum processor requires cross-entropy benchmarking against a simulation of a random quantum circuit of specific depth - Simulation of a quantum circuit can be reduced to the contraction of a tensor network representing the circuit - cuTensorNet library from cuQuantum SDK accelerates tensor network contractions on NVIDIA GPUs - Tensor network contraction is expressed as a sequence of pairwise tensor contractions executed by the cuTensor library - Algorithm development and validation of modern quantum chips is an extremely computationally demanding task that significantly benefits from GPU acceleration #### Google's Sycamore Quantum Chip (53 qubits) Arute, F., Arya, K., Babbush, R. *et al.* Quantum supremacy using a programmable superconducting processor. *Nature* **574**, 505–510 (2019). #### Quantum Computing Simulations #### Accelerate Tensor Network Contractions We simulated the 53-qubit Sycamore quantum chip with 12 layers of random gates and computed probability amplitudes for 64 bit-strings $$\beta = \langle \psi_r | U | \psi_0 \rangle$$ - 649216 total pairwise tensor contractions; - 1.7% of the tensor contractions account for 95% of the total 0.83 PFLOPs (k-dim >= 16) - We offload these to BF16/9 - The relative error of the computed probability amplitudes with BF16/9 less than FP32 when compared to FP64 baseline - The variation of amplitude values due to the use of different tensor network contraction paths for FP32 compute introduces larger differences than BF16/9 #### Condensed Matter Physics Simulations #### Accelerate Tensor Network Contractions - Simulating electronic structure of materials is an extremely complex and computationally demanding task - Widespread use of phenomenological models and Hamiltonians to reduce the complexity of the task - The dimensionality of the corresponding linear Hilbert space grows **exponentially** with the number of simulated spins or electrons, thus mandating approximate solutions - Tensor network theory provides a powerful systematic theoretical framework for approximating spin or electronic states of materials in highly-dimensional linear spaces - The regular **linear** and **eigen**-solvers can be reformulated in the language of tensor network theory, resulting in a drastic reduction of the computational cost due to the use of **tensor factorization** - Transverse-field Ising spin Hamiltonian is used as a paradigmatic model for simulating a broad range of quantum phenomena Transverse-field spin Ising Hamiltonian $$H = -c \sum_{i} z_{i} z_{i} - g \sum_{i} x_{i}$$ #### Condensed Matter Physics Simulations Accelerate Tensor Network Contractions - We simulated the 16-site transverse-field Ising Hamiltonian - The ground spin state is factorized as a binary tensor tree with maximal bond dimension of 16 - The variational optimization of the ground spin state involves two main numerical steps: (1) Tensor network contraction; (2) Modified Gram-Schmidt orthogonalization - The Modified Gram-Schmidt orthogonalization step must be computed in FP64, otherwise the solver may diverge - Most Flops are spent in FP32 tensor network contractions --> offload to BF16/9 - The ground state energy is computed as a reduction over many terms --> Expect cancellation of individual term errors - The resulting ground state energy is the same up to the 5th digit after decimal point: FP32 and BF16/9 produce about the same error (< 1e-5) as compared to full FP64</li> Individual spin sites (physical degrees of freedom) Transverse-field spin Ising Hamiltonian $$H = -c \sum_{i} z_{i} z_{i} - g \sum_{i} x_{i}$$ # Ground state energy of the 16-site Ising Hamiltonian all converged to the same 2E-6 tolerance | Precision | Energy | Error | |-----------|--------------|-------| | FP64 | -17.02418(9) | O | | FP32 | -17.02419(7) | 8e-6 | | BF16/9 | -17.02418(3) | 6e-6 | #### Concluding Remarks #### Heterogeneous Computing At Multiple Levels - Heterogeneity for HPC is reality at many different levels - Within a server and processor - Data storage, locality and access - Across the network that connects processor - Software stack - Within algorithms - There are many challenges and opportunities for developing high-performance software - ... computing is a not a chip problem. It's a software and chip problem - Tensor Cores that power AI can also be leveraged to transparently accelerate applications that require higher precision without any loss of accuracy - TF32/3, BF16/9, and similar algorithms can be extended to create new range and precision modes that can be tailored for applications for further acceleration Are we ready to reap the benefits of higher-performance non-IEEE computations without sacrificing accuracy? It's also good for our planet!