I'm using the batched LU routine in MAGMA, though performance is perhaps less than I would have expected. I have using 32x32 matrices, with a batch size of 25,000, and running on an M6000 I am getting <20 GFLOPS. This compares to matrix inversion (magma_cgetri_outofplace_batched) of 120 GFLOPS. Is this performance expected? For such a large batch size I would have expected better performance.
I also tried the no-pivot variant, which seems marginally faster (10%), though since there is not a no-pivot variant of batched cgetri so I can't use it anyway.
Thanks
Batched LU performance
Re: Batched LU performance
Sorry for the late answer delay.
what precision was for LU (cgetrf ?)
what is the peak of your machine for this kind of precision?
if you are really interested by this size I will check if we can provide you a specific version that is special designed to 32x32.
Thanks
Azzam
what precision was for LU (cgetrf ?)
what is the peak of your machine for this kind of precision?
if you are really interested by this size I will check if we can provide you a specific version that is special designed to 32x32.
Thanks
Azzam