papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
Hello everyone,
I'm doing some research on the QR factorization and its implementation on GPUs for my thesis. I want to consider MAGMA's dgeqrf_gpu / dgeqrf routine from the latest version 2.0.1 and apply one of these routines to factorize big dense matrices with more rows than columns.
My first question is: What is the difference between these two routines? Does dgeqrf_gpu run entirely on the GPU?
Furthermore, even more important, I would like to understand how the QR factorization is implemented and how the GPU-CPU communication looks like. Are there any detailed documentations or papers on this topic? So far I only found explanations of the dgeqrf routine from version 1.0.0 or 1.1.0. Does anybody also know what has changed since then?
Maybe there is a paper proposing how to improve an earlier version which is now implemented in the current version?
I hope that you can help me and I'm looking forward to trying out some calculations on the GPU using MAGMA.
Thanks,
nahla
I'm doing some research on the QR factorization and its implementation on GPUs for my thesis. I want to consider MAGMA's dgeqrf_gpu / dgeqrf routine from the latest version 2.0.1 and apply one of these routines to factorize big dense matrices with more rows than columns.
My first question is: What is the difference between these two routines? Does dgeqrf_gpu run entirely on the GPU?
Furthermore, even more important, I would like to understand how the QR factorization is implemented and how the GPU-CPU communication looks like. Are there any detailed documentations or papers on this topic? So far I only found explanations of the dgeqrf routine from version 1.0.0 or 1.1.0. Does anybody also know what has changed since then?
Maybe there is a paper proposing how to improve an earlier version which is now implemented in the current version?
I hope that you can help me and I'm looking forward to trying out some calculations on the GPU using MAGMA.
Thanks,
nahla
Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
For magma_*geqrf_gpu, the matrix dA is in GPU memory. Thus it avoids allocating memory and a copy.
For magma_*geqrf, the matrix A is in CPU memory.
Both are hybrid routines, using the CPU and GPU together. In both cases, the panel is factored on the CPU by calling LAPACK geqrf, and the trailing matrix is updated on the GPU using gemm (inside larfb_gpu).
-mark
For magma_*geqrf, the matrix A is in CPU memory.
Both are hybrid routines, using the CPU and GPU together. In both cases, the panel is factored on the CPU by calling LAPACK geqrf, and the trailing matrix is updated on the GPU using gemm (inside larfb_gpu).
-mark
Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
Hi mark! How about LU decomposition magma_dgetrf and magma_dgetrf_gpu(magma_dgetrf_mgpu)?
Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
Hi mark,
thanks a lot so far. Do you also know some papers describing the latest magma_dgeqrf version?
nahla
thanks a lot so far. Do you also know some papers describing the latest magma_dgeqrf version?
nahla
Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
hey mark,
yeah I have seen them already of course. They were not exactly what I was looking for.
-nahla
yeah I have seen them already of course. They were not exactly what I was looking for.
-nahla
Re: papers to dgeqrf_gpu/dgeqrf algorithms, explanations?
Hi Nahla,
all the current Magma routine are hybrid meaning they use both CPU and GPU.
So overall, Cholesky, LU and QR follows the LAPACK fashion of factorization, meaning a panel facto followed by an update of the trailing matrix.
a general overview about it is explained in:
https://www.google.com/url?sa=t&rct=j&q ... 8183,d.dmo
or
http://ieeexplore.ieee.org/xpl/articleD ... er=6877282
a brief description:
CPU is used to factorize the panel while the GPU is used to performs the update.
In Magma we implement what we call lookahead panel meaning that the trailing matrix update is split over two portion:
updating the next panel (portion 1) in order to be sent to CPU to be factorized while the remaining trailing matrix is continued in the GPU (portion 2).
This way the while the GPU is updating portion 2 the CPU is perfoming the factorization of the next panel and resending data to GPU which results in hiding the cost of the panel factorization.
As consequence, the performance of the LU./QR will be close to the performance of the update (which is mostly some kind of GEMM kernel).
The algorithm will look like this:
for step=1, step<N step+=nb
0- send panel of step to CPU
1- factorize panel of step (ON CPU)
2- send factorized panel to GPU
3- update panel of step+1 (ON GPU)
4- update remaining (ON GPU)
Note that 0,1,2,3 go to 1 stream while 4 is on another stream such a way to be parallel and overlapped. there is dependency to be satisfied as well.
What is your interest ?
We have native code that run only on GPU but not released yet.
Thanks
Azzam
all the current Magma routine are hybrid meaning they use both CPU and GPU.
So overall, Cholesky, LU and QR follows the LAPACK fashion of factorization, meaning a panel facto followed by an update of the trailing matrix.
a general overview about it is explained in:
https://www.google.com/url?sa=t&rct=j&q ... 8183,d.dmo
or
http://ieeexplore.ieee.org/xpl/articleD ... er=6877282
a brief description:
CPU is used to factorize the panel while the GPU is used to performs the update.
In Magma we implement what we call lookahead panel meaning that the trailing matrix update is split over two portion:
updating the next panel (portion 1) in order to be sent to CPU to be factorized while the remaining trailing matrix is continued in the GPU (portion 2).
This way the while the GPU is updating portion 2 the CPU is perfoming the factorization of the next panel and resending data to GPU which results in hiding the cost of the panel factorization.
As consequence, the performance of the LU./QR will be close to the performance of the update (which is mostly some kind of GEMM kernel).
The algorithm will look like this:
for step=1, step<N step+=nb
0- send panel of step to CPU
1- factorize panel of step (ON CPU)
2- send factorized panel to GPU
3- update panel of step+1 (ON GPU)
4- update remaining (ON GPU)
Note that 0,1,2,3 go to 1 stream while 4 is on another stream such a way to be parallel and overlapped. there is dependency to be satisfied as well.
What is your interest ?
We have native code that run only on GPU but not released yet.
Thanks
Azzam