%0 Conference Paper %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %D 2019 %T Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training %A Jiali Li %A Bogdan Nicolae %A Justin M. Wozniak %A George Bosilca %X In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular. Despite the fact that high-performance computing (HPC) technologies have been designed to address such patterns efficiently, the behavior of data-parallel approaches on HPC platforms is not well understood. To address this issue, in this paper we study the behavior of Horovod, a popular data-parallel approach that relies on MPI, on Theta, a pre-Exascale machine at Argonne National Laboratory. Using two representative applications, we explore two aspects: (1) how performance and scalability is affected by important parameters such as number of nodes, number of workers, threads per node, batch size; (2) how computational phases are interleaved withall-reduce communication phases at fine granularity and what consequences this interleaving has in terms of potential bottlenecks. Our findings show that pipelining of back-propagation, gradient reduction and weight updates mitigate the effects of stragglers during all-reduce only partially. Furthermore, there can be significant delays between weights update, which can be leveraged to mask the overhead of additional background operations that are coupled with the training. %B 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) %I IEEE %C Denver, CO %8 2019-11 %G eng %R https://doi.org/10.1109/MLHPC49564.2019.00006