MAGMA SVD implementation on GPUs?
-
edmundberry
- Posts: 1
- Joined: Tue Sep 29, 2015 7:11 pm
MAGMA SVD implementation on GPUs?
Hello MAGMA experts,
What is the status of MAGMA's SVD implementation regarding GPUs?
- Does Magma use GPUs to implement SVD? It looks like MAGMA's dgesvd function only accepts CPU pointers for arguments (or maybe I have misinterpreted the documentation):
http://icl.cs.utk.edu/projectsfiles/mag ... river.html
- If I have a device pointer (pointer to a matrix on a GPU device), do I have to copy it back to the host (CPU) before I can use MAGMA's SVD functionality?
- Regardless, is there example code for dgesvd? I couldn't find any example code online.
Thank you!
Best,
Edmund
What is the status of MAGMA's SVD implementation regarding GPUs?
- Does Magma use GPUs to implement SVD? It looks like MAGMA's dgesvd function only accepts CPU pointers for arguments (or maybe I have misinterpreted the documentation):
http://icl.cs.utk.edu/projectsfiles/mag ... river.html
- If I have a device pointer (pointer to a matrix on a GPU device), do I have to copy it back to the host (CPU) before I can use MAGMA's SVD functionality?
- Regardless, is there example code for dgesvd? I couldn't find any example code online.
Thank you!
Best,
Edmund
Re: MAGMA SVD implementation on GPUs?
Yes, it is GPU accelerated. The input matrix is given in CPU memory, but internally gets copied to the GPU for computation.
Unfortunately, this means currently if your matrix is on the GPU, you have to copy it to the CPU before calling magma_dgesvd.
There are examples in magma/testing/testing_dgesvd.cpp and testing_dgesdd.cpp. I recommend using dgesdd (divide and conquer) instead of dgesvd (QR iteration), as dgesdd is faster in both MAGMA and LAPACK when computing singular vectors.
-mark
Unfortunately, this means currently if your matrix is on the GPU, you have to copy it to the CPU before calling magma_dgesvd.
There are examples in magma/testing/testing_dgesvd.cpp and testing_dgesdd.cpp. I recommend using dgesdd (divide and conquer) instead of dgesvd (QR iteration), as dgesdd is faster in both MAGMA and LAPACK when computing singular vectors.
-mark
Re: MAGMA SVD implementation on GPUs?
Hello,
I found it more appropriate to continue the discussion here rather than opening a new topic. In my case, I am using magma_dgesdd and it works when I "magma_malloc_cpu" all the arguments, but if fails if I have them in the GPU. My question is which arguments (if any) should be passed from the GPU in order to make the running time optimal.
Originally my A is in the CPU, but I was trying to send it to the GPU as well as the workspace variable "work", before calling magma_dgesdd.
Otherwise it looks a bit strange to believe that the calculations are done in the GPU if the workspace is in the CPU (maybe you can give an heuristical explanation of how this works?)
Thanks a lot!
I found it more appropriate to continue the discussion here rather than opening a new topic. In my case, I am using magma_dgesdd and it works when I "magma_malloc_cpu" all the arguments, but if fails if I have them in the GPU. My question is which arguments (if any) should be passed from the GPU in order to make the running time optimal.
Originally my A is in the CPU, but I was trying to send it to the GPU as well as the workspace variable "work", before calling magma_dgesdd.
Otherwise it looks a bit strange to believe that the calculations are done in the GPU if the workspace is in the CPU (maybe you can give an heuristical explanation of how this works?)
Thanks a lot!
Re: MAGMA SVD implementation on GPUs?
magma_dgesdd takes all its arguments on the CPU. It simply replaces lapack's dgesdd.
MAGMA is a hybrid CPU + GPU library. Some of its calculations are done on the CPU, so it needs workspace there. It also relies on some routines from LAPACK, such as dbdsdc (divide-and-conquer), which need CPU workspace. MAGMA internally allocates additional memory on the GPU.
Generally, MAGMA routines with no suffix take their input arguments in CPU memory, while routines with a _gpu suffix take (at least some of) their arguments in GPU memory. Generally, variables prefixed with "d" are on the GPU device, such as "dA" (on GPU) vs. "A" (on CPU). See the documentation:
http://icl.cs.utk.edu/projectsfiles/mag ... tines.html
http://icl.cs.utk.edu/projectsfiles/mag ... ables.html
-mark
MAGMA is a hybrid CPU + GPU library. Some of its calculations are done on the CPU, so it needs workspace there. It also relies on some routines from LAPACK, such as dbdsdc (divide-and-conquer), which need CPU workspace. MAGMA internally allocates additional memory on the GPU.
Generally, MAGMA routines with no suffix take their input arguments in CPU memory, while routines with a _gpu suffix take (at least some of) their arguments in GPU memory. Generally, variables prefixed with "d" are on the GPU device, such as "dA" (on GPU) vs. "A" (on CPU). See the documentation:
http://icl.cs.utk.edu/projectsfiles/mag ... tines.html
http://icl.cs.utk.edu/projectsfiles/mag ... ables.html
-mark
Re: MAGMA SVD implementation on GPUs?
Ok, I guess I was missinterpreting the documentation!
Thank you
Thank you
Re: MAGMA SVD implementation on GPUs?
If the input matrix is internally copied to the GPU for computation wouldn't it be relatively simple to create an additionial function with _gpu suffix that passes in an existing gpu matrix and omit the copy part? Or are the internals more complex that they require the matrix to be on the CPU at different times.mgates3 wrote:Yes, it is GPU accelerated. The input matrix is given in CPU memory, but internally gets copied to the GPU for computation.
Unfortunately, this means currently if your matrix is on the GPU, you have to copy it to the CPU before calling magma_dgesvd.
-mark
Re: MAGMA SVD implementation on GPUs?
The SVD is a rather complex code. It doesn't just allocate dA on the GPU, copy A to dA, then do work. It extensively use other routines like geqrf, gebrd, gelqf, unmqr, ungqr, bdsdc, etc. We would need to replace all of those with _gpu variants. Some already exist; some do not. Some do not yet even have GPU-accelerated implementations yet (like bdsdc).
So it's a good goal to have, and we may eventually get there, but it isn't trivial.
-mark
So it's a good goal to have, and we may eventually get there, but it isn't trivial.
-mark
Re: MAGMA SVD implementation on GPUs?
Yes, it's highly non-trivial. Right now im running tests for Magma dgesdd and it's surprising to find out that is more slow than a regular Maple (Lapack based) computation. I wonder if I'm doing something wrong or the code for SVD is not really optimized for GPU yet.
While Maple times (not even GPU time!) are:
Code: Select all
~/magma-2.0.2/testing$ ./testing_dgesdd
% MAGMA 2.0.2 compiled for CUDA capability >= 2.0, 32-bit magma_int_t, 64-bit pointer.
% CUDA runtime 7050, driver 7050. OpenMP threads 8. MKL 11.2.3, MKL threads 4.
% device 0: GeForce GTS 450, 1764.0 MHz clock, 1023.2 MB memory, capability 2.1
% Thu Aug 25 12:38:13 2016
% Usage: ./testing_dgesdd [options] [-h|--help]
% jobz M N CPU time (sec) GPU time (sec) |S1-S2| |A-USV^H| |I-UU^H|/M |I-VV^H|/N S sorted
%==========================================================================================================
N 1088 1088 --- 0.49 ---
N 2112 2112 --- 2.10 ---
N 3136 3136 --- 6.45 ---
N 4160 4160 --- 14.64 ---
N 5184 5184 --- 27.89 ---
N 6208 6208 --- 47.25 ---
N 7232 7232 --- 75.09 ---
Code: Select all
Maple time: n= 1100 0.285000
Maple time: n= 2100 2.770000
Maple time: n= 3100 2.959000
Maple time: n= 4100 6.757000
Re: MAGMA SVD implementation on GPUs?
GeForce cards are designed for graphics and gaming, which primarily use single-precision. Their support for double-precision math is slow -- perhaps 8x slower than single-precision -- while with a high-end Tesla card, double would only be 2x slower than single (same as CPU). You may have better results with single-precision.
BTW, you can add -l or --lapack flag to get LAPACK CPU times from testing_dgesdd or testing_sgesdd.
-mark
BTW, you can add -l or --lapack flag to get LAPACK CPU times from testing_dgesdd or testing_sgesdd.
-mark