Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

TitleImproving Performance of GMRES by Reducing Communication and Pipelining Global Collectives
Publication TypeConference Proceedings
Year of Publication2017
AuthorsYamazaki, I., M. Hoemmen, P. Luszczek, and J. Dongarra
Conference NameProceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award
Date Published2017-06
Conference LocationOrlando, FL

We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces.

Project Tags: 
External Publication Flag: