Submitted by scrawford on
Title | Using Arm Scalable Vector Extension to Optimize Open MPI |
Publication Type | Conference Paper |
Year of Publication | 2020 |
Authors | Zhong, D., P. Shamis, Q. Cao, G. Bosilca, and J. Dongarra |
Conference Name | 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID 2020) |
Date Published | 2020-05 |
Publisher | IEEE/ACM |
Conference Location | Melbourne, Australia |
Keywords | ARMIE, datatype pack and unpack, local reduction, non-contiguous accesses, SVE, Vector Length Agnostic |
Abstract | As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms. In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu’s A64FX processor demonstrates that the solution is at the same time generic and efficient. |
DOI | 10.1109/CCGrid49817.2020.00-71 |