Submitted by webmaster on
Title | Mixed-precision Block Gram Schmidt Orthogonalization |
Publication Type | Conference Paper |
Year of Publication | 2015 |
Authors | Yamazaki, I., S. Tomov, J. Kurzak, J. Dongarra, and J. Barlow |
Conference Name | 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems |
Date Published | 2015-11 |
Publisher | ACM |
Conference Location | Austin, TX |
Abstract | The mixed-precision Cholesky QR (CholQR) can orthogonalize the columns of a dense matrix with the minimum communication cost. Moreover, its orthogonality error depends only linearly to the condition number of the input matrix. However, when the desired higher-precision is not supported by the hardware, the software-emulated arithmetics are needed, which could significantly increase its computational cost. When there are a large number of columns to be orthogonalized, this computational overhead can have a significant impact on the orthogonalization time, and the mixed-precision CholQR can be much slower than the standard CholQR. In this paper, we examine several block variants of the algorithm, which reduce the computational overhead associated with the software-emulated arithmetics, while maintaining the same orthogonality error bound as the mixed-precision CholQR. Our numerical and performance results on multicore CPUs with a GPU, as well as a hybrid CPU/GPU cluster, demonstrate that compared to the mixed-precision CholQR, such a block variant can obtain speedups of up to 7:1 while maintaining about the same order of the numerical errors. |
File: