Batched LU performance
Posted: Tue Dec 15, 2015 9:03 pm
I'm using the batched LU routine in MAGMA, though performance is perhaps less than I would have expected. I have using 32x32 matrices, with a batch size of 25,000, and running on an M6000 I am getting <20 GFLOPS. This compares to matrix inversion (magma_cgetri_outofplace_batched) of 120 GFLOPS. Is this performance expected? For such a large batch size I would have expected better performance.
I also tried the no-pivot variant, which seems marginally faster (10%), though since there is not a no-pivot variant of batched cgetri so I can't use it anyway.
Thanks
I also tried the no-pivot variant, which seems marginally faster (10%), though since there is not a no-pivot variant of batched cgetri so I can't use it anyway.
Thanks