Predicting MPI Collective Communication Performance Using Machine Learning

TitlePredicting MPI Collective Communication Performance Using Machine Learning
Publication TypeConference Paper
Year of Publication2020
AuthorsHunold, S., A. Bhatele, G. Bosilca, and P. Knees
Conference Name2020 IEEE International Conference on Cluster Computing (CLUSTER)
Date Published2020-09
PublisherIEEE
Conference LocationKobe, Japan
KeywordsAuto-tuning, GAM, KNN, Machine Learning, message passing interface, Performance Prediction, XGBoost
Abstract

The Message Passing Interface (MPI) defines the semantics of data communication operations, while the implementing libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given communication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin.

DOI10.1109/CLUSTER49012.2020.00036
Project Tags: 
External Publication Flag: