ADAPT: An Event-Based Adaptive Collective Communication Framework

TitleADAPT: An Event-Based Adaptive Collective Communication Framework
Publication TypeConference Paper
Year of Publication2018
AuthorsLuo, X., W. Wu, G. Bosilca, T. Patinyasakdikul, L. Wang, and J. Dongarra
Conference NameThe 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18)
Date Published2018-06
PublisherACM Press
Conference LocationTempe, Arizona
ISBN Number9781450357852

The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations.

Project Tags: 
External Publication Flag: