Submitted by webmaster on
Title | Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives |
Publication Type | Conference Proceedings |
Year of Publication | 2017 |
Authors | Yamazaki, I., M. Hoemmen, P. Luszczek, and J. Dongarra |
Conference Name | Proceedings of The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017), Best Paper Award |
Date Published | 2017-06 |
Conference Location | Orlando, FL |
Abstract | We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard GMRES, but also overlaps those all-reduces with other work. We implemented the thread-parallelism and communication-overlap in two different ways. The first uses nonblocking MPI collectives with thread-parallel computational kernels. The second relies on a shared-memory task scheduler. In our experiments, (l, t)-GMRES performed better than l-GMRES by factors of up to 1.67×. In addition, though we only used 50 nodes, when the latency cost became significant, our variant performed up to 1.22× better than s-GMRES by hiding all-reduces. |
DOI | 10.1109/IPDPSW.2017.65 |