Overview

The fast Fourier transform (FFT) is used in many domain applications—including molecular dynamics, spectrum estimation, fast convolution and correlation, signal modulation, and wireless multimedia applications. For example, distributed 3-D FFT is one of the most important kernels used in molecular dynamics computations, and its performance can affect an application’s scalability on larger machines. Similarly, the performance of the first principle calculations depends strongly on the performance of the FFT solver. Specifically, for the US Department of Energy (DOE), we found that more than a dozen Exascale Computing Project (ECP) applications use FFT in their codes.

The current state-of-the-art FFT libraries are not scalable on large heterogeneous machines with many nodes or even on one node with multiple high-performance GPUs (e.g., several NVIDIA V100 GPUs). Furthermore, these libraries require large FFTs in order to deliver acceptable performance on one GPU. Efforts to simply enhance classical and existing FFT packages with optimization tools and techniques—like autotuning and code generation—have so far not been able to provide the efficient, high-performance FFT library capable of harnessing the power of supercomputers with heterogeneous GPU-accelerated nodes. In particular, ECP applications that require FFT-based solvers might suffer from the lack of fast and scalable 3-D FFT routines for distributed heterogeneous parallel systems, which is the very type of system that will be used in upcoming Exascale machines.

We believe that the design of the existing libraries should be revisited and studied in order to develop a GPU-based, distributed, 3-D FFT library that can deliver high performance on current and future supercomputers. The main objective of the FFT-ECP project is to design and implement a fast and robust 2-D and 3-D FFT library that targets large-scale heterogeneous systems with multi-core processors and hardware accelerators and to do so as a co-design activity with other ECP application developers. The work involves studying and analyzing current FFT software from vendors and open-source developers in order to understand, design, and develop a 3-D FFT-ECP library that could benefit from these existing optimized FFT kernels or will rely on new optimized kernels developed under this framework. We will also study ECP application needs and define a suitable modular implementation that provides high-performance software.

Sponsored By
Exascale Computing Project
National Nuclear Security Administration
The United States Department of Energy

Papers

Tomov, S., A. Ayala, A. Haidar, and J. Dongarra, FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs , no. FFT-ECP STML13-27: Innovative Computing Laboratory, University of Tennessee, January 2020.  (9.71 MB)
Ayala, A., S. Tomov, X. Luo, H. Shaiek, A. Haidar, G. Bosilca, and J. Dongarra, Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,” Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.  (1.6 MB)
Tomov, S., A. Haidar, A. Ayala, H. Shaiek, and J. Dongarra, FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.  (4.14 MB)
Shaiek, H., S. Tomov, A. Ayala, A. Haidar, and J. Dongarra, GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,” EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.  (2.25 MB)
Tomov, S., A. Haidar, A. Ayala, D. Schultz, and J. Dongarra, Design and Implementation for FFT-ECP on Distributed Accelerated Systems,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.  (3.19 MB)
Tomov, S., A. Haidar, A. Ayala, D. Schultz, and J. Dongarra, FFT-ECP Fast Fourier Transform , Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.  (1.51 MB)
Cheng, X., A. Soma, E. D'Azevedo, K. Wong, and S. Tomov, Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision , Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster, November 2018.  (740.37 KB)
Tomov, S., A. Haidar, D. Schultz, and J. Dongarra, Evaluation and Design of FFT for Distributed Accelerated Systems,” ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.  (7.53 MB)

Acknowledgments

Funding

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation's exascale computing imperative.

Cycles

This research uses resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

Exascale Computing Project

FFT is part of ICL's involvment in the Exascale Computing Project (ECP). The ECP was established with the goals of maximizing the benefits of high-performance computing (HPC) for the United States and accelerating the development of a capable exascale computing ecosystem. Exascale refers to computing systems at least 50 times faster than the nation’s most powerful supercomputers in use today.

The ECP is a collaborative effort of two U.S. Department of Energy organizations – the Office of Science (DOE-SC) and the National Nuclear Security Administration (NNSA).