Hi all,
there are placeholders in the Doxygen documentation for a batched Cholesky inverse as well as some of the required helper functions. Unfortunately, it appears that no actual implementations exists. Are there any plans to add these in the near future?
I need the Cholesky inverse of several million small (5x5) SPD matrices. I'm currently using magma_dposv_batched() with an identity matrix for B but I guess this is much less efficient than an explicit dpotri probably due to the extra memory reads alone.
Cheers,
Rene
Batched dpotri?
Re: Batched dpotri?
Hi again,
I've been using a custom implementation based on LAPACK's dpotri translated into C++ with my own fixed-size matrix library for the past couple of weeks now. I'm using a naive CUDA kernel that processes one 5x5 matrix per thread in registers. I've experimented with shared memory prefetching but this wasn't faster. I've also applied some generic CUDA tricks to several MAGMA routines to speed things up. On two million random 5x5 matrices magma_dpotrf_batched() now takes about 12ms plus about 10ms for my own dpotri kernel on my test hardware (Titan X Pascal). Not bad compared with the previous 90-100ms when using magma_dposv_batched() to do the same thing.
Is this something the MAGMA authors would be interested in? The dpotri implementation is probably only useful for people using C++ as the matrix dimensions are template parameters that need to be known in the calling code at compile time.
The more generic CUDA tricks, on the other hand, could lead to speedups across the board. Is that something you think could be turned into an academic publication despite being in the technical rather than algorithmic domain? I typically publish on real-time 3D computer vision for robotics and would appreciate any guidance you might have or any opportunity to collaborate on a paper. If you think it's realistic to get this published I could devote more time to it, run benchmarks and submit patches.
Cheers,
Rene
I've been using a custom implementation based on LAPACK's dpotri translated into C++ with my own fixed-size matrix library for the past couple of weeks now. I'm using a naive CUDA kernel that processes one 5x5 matrix per thread in registers. I've experimented with shared memory prefetching but this wasn't faster. I've also applied some generic CUDA tricks to several MAGMA routines to speed things up. On two million random 5x5 matrices magma_dpotrf_batched() now takes about 12ms plus about 10ms for my own dpotri kernel on my test hardware (Titan X Pascal). Not bad compared with the previous 90-100ms when using magma_dposv_batched() to do the same thing.
Is this something the MAGMA authors would be interested in? The dpotri implementation is probably only useful for people using C++ as the matrix dimensions are template parameters that need to be known in the calling code at compile time.
The more generic CUDA tricks, on the other hand, could lead to speedups across the board. Is that something you think could be turned into an academic publication despite being in the technical rather than algorithmic domain? I typically publish on real-time 3D computer vision for robotics and would appreciate any guidance you might have or any opportunity to collaborate on a paper. If you think it's realistic to get this published I could devote more time to it, run benchmarks and submit patches.
Cheers,
Rene
-
Stan Tomov
- Posts: 283
- Joined: Fri Aug 21, 2009 10:39 pm
Re: Batched dpotri?
Hi Rene,
This sounds very good! We are interested in any improvement in these kernels. I will contact you with the procedure how to contribute it - we will have some software engineering requirements and we have to see if the code can be generalized (and tuned easily for other sizes and precisions).
Stan
This sounds very good! We are interested in any improvement in these kernels. I will contact you with the procedure how to contribute it - we will have some software engineering requirements and we have to see if the code can be generalized (and tuned easily for other sizes and precisions).
Stan