Problem about magma_zgeqrf_gpu (A possible bug)
-
boruoshihao
- Posts: 6
- Joined: Thu May 21, 2015 11:48 am
Problem about magma_zgeqrf_gpu (A possible bug)
Hi guys,
I'm trying to do a QR decomposition for a two dimensional matrix a, I also need the determinant of the R matrix.
In Lapack, it can be done by:
1. Call zgeqrf ( m , n , a , lda , tau , work , lwork , info )
2. Determinant of R can be calculated by: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
3. Get the Q matrix by: call zungqr ( m , n , k , a , lda , tau , work , lwork , info )
In Magma, I have a few choices:
A. Use CPU interface:
1. Call magma_zgeqrf(m, n, a, lda, tau, work, lwork, info);
2. Get the determinant R: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
3. Get the Q matrix by magma_zungqr2( m , n , k , a , lda , tau , info )
B. Use GPU interface
0. Pass a to GPU memory da
1. Call magma_zgeqrf_gpu(m,n,da,lda,tau,dT,&info);
2. Pass da to CPU a
3. Get the determinant R: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
4. Get the Q matrix by: call magma_zungqr_gpu(m,n,n,da,lda,tau,dT,nb,&info);
5. Pass da to cpu a
Both A and B methods, gives same Q compare with original lapack.
While the determinant of R is different:
In small lattice size, A and B has the same determinant with original lapack.
In large lattice size, only A has the same determinant with original lapack, B has a totally different result.
Please see the following results (the matrix a is random generated.)
m , n, det form original lapack, det from A, det from B
8 8 29.1222 29.1222 29.1222
16 16 99287.8 99287.8 99287.8
32 32 2.92035e+14 2.92035e+14 2.92035e+14
64 64 2.81437e+38 2.81437e+38 2.53932e+14
128 128 1.23295e+96 1.23295e+96 7.51771e+14
When m>64, we start to see difference. I guess it might be a bug, has anyone looked into it before? I'm using cuda/5.5 and magma/1.6.1 built on acml.
Thanks,
Hao
I'm trying to do a QR decomposition for a two dimensional matrix a, I also need the determinant of the R matrix.
In Lapack, it can be done by:
1. Call zgeqrf ( m , n , a , lda , tau , work , lwork , info )
2. Determinant of R can be calculated by: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
3. Get the Q matrix by: call zungqr ( m , n , k , a , lda , tau , work , lwork , info )
In Magma, I have a few choices:
A. Use CPU interface:
1. Call magma_zgeqrf(m, n, a, lda, tau, work, lwork, info);
2. Get the determinant R: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
3. Get the Q matrix by magma_zungqr2( m , n , k , a , lda , tau , info )
B. Use GPU interface
0. Pass a to GPU memory da
1. Call magma_zgeqrf_gpu(m,n,da,lda,tau,dT,&info);
2. Pass da to CPU a
3. Get the determinant R: det=1.0; for(size_t i=0; i<m; i++) det*=a(i,i);
4. Get the Q matrix by: call magma_zungqr_gpu(m,n,n,da,lda,tau,dT,nb,&info);
5. Pass da to cpu a
Both A and B methods, gives same Q compare with original lapack.
While the determinant of R is different:
In small lattice size, A and B has the same determinant with original lapack.
In large lattice size, only A has the same determinant with original lapack, B has a totally different result.
Please see the following results (the matrix a is random generated.)
m , n, det form original lapack, det from A, det from B
8 8 29.1222 29.1222 29.1222
16 16 99287.8 99287.8 99287.8
32 32 2.92035e+14 2.92035e+14 2.92035e+14
64 64 2.81437e+38 2.81437e+38 2.53932e+14
128 128 1.23295e+96 1.23295e+96 7.51771e+14
When m>64, we start to see difference. I guess it might be a bug, has anyone looked into it before? I'm using cuda/5.5 and magma/1.6.1 built on acml.
Thanks,
Hao
Re: Problem about magma_zgeqrf_gpu (A possible bug)
For the GPU interface, use magma_zgeqrf2_gpu, which has LAPACK complaint arguments.
magma_zgeqrf3_gpu stores diagonal blocks of R separately in dT.
magma_zgeqrf_gpu also inverts the diagonal blocks of R in dT, for faster application when solving a linear system.
-mark
magma_zgeqrf3_gpu stores diagonal blocks of R separately in dT.
magma_zgeqrf_gpu also inverts the diagonal blocks of R in dT, for faster application when solving a linear system.
-mark
-
boruoshihao
- Posts: 6
- Joined: Thu May 21, 2015 11:48 am
Re: Problem about magma_zgeqrf_gpu (A possible bug)
Thank you Mark,
I guess I did not make my question clear enough before. I want A=Q*R, where Q is orthogonal and normalized, then mutiply the diagonal element of R will be the det(R).
If I use magma_zgeqrf2_gpu, which zungqr should be called to get final Q? magma_zungqr2 and magma_zungqr is cpu interface, magma_zungqr_gpu ask for dT, which is not in magma_zgeqrf2_gpu.
I just tried to get diagonal blocks of R from dT in both magma_zgeqrf3_gpu and magma_zgeqrf_gpu. While the det(R) is not correct, it looks like the normalization information is stored in A, not in R after we call zgeqrf.
Hao
I guess I did not make my question clear enough before. I want A=Q*R, where Q is orthogonal and normalized, then mutiply the diagonal element of R will be the det(R).
If I use magma_zgeqrf2_gpu, which zungqr should be called to get final Q? magma_zungqr2 and magma_zungqr is cpu interface, magma_zungqr_gpu ask for dT, which is not in magma_zgeqrf2_gpu.
I just tried to get diagonal blocks of R from dT in both magma_zgeqrf3_gpu and magma_zgeqrf_gpu. While the det(R) is not correct, it looks like the normalization information is stored in A, not in R after we call zgeqrf.
Hao
Re: Problem about magma_zgeqrf_gpu (A possible bug)
I'm not sure what you mean by normalized. Q is orthogonal, meaning Q^T * Q = I. An "orthogonal" matrix has orthonormal columns. (I don't know why it's not called an orthonormal matrix, but orthogonal is the standard terminology. See
http://en.wikipedia.org/wiki/Orthogonal_matrix)
From magma_zgeqrf2_gpu, if you only need R, then it is immediately available as the upper triangle of the matrix. There is no need to call zungqr. If you need Q, unfortunately there does not appear to be a corresponding zungqr2_gpu routine. I'll do some investigation to see if there is an alternative. You should be able to copy the dA matrix to the CPU and use zungqr2.
From magma_zgeqrf_gpu, the diagonal blocks of R are inverted in dT, so it requires extra work to get the original (non-inverted) diagonal of R.
From magma_zgeqrf3_gpu, it should be possible to get the original R diagonal blocks, but you would have to look in detail about how they are stored. I would not recommend this path.
-mark
http://en.wikipedia.org/wiki/Orthogonal_matrix)
From magma_zgeqrf2_gpu, if you only need R, then it is immediately available as the upper triangle of the matrix. There is no need to call zungqr. If you need Q, unfortunately there does not appear to be a corresponding zungqr2_gpu routine. I'll do some investigation to see if there is an alternative. You should be able to copy the dA matrix to the CPU and use zungqr2.
From magma_zgeqrf_gpu, the diagonal blocks of R are inverted in dT, so it requires extra work to get the original (non-inverted) diagonal of R.
From magma_zgeqrf3_gpu, it should be possible to get the original R diagonal blocks, but you would have to look in detail about how they are stored. I would not recommend this path.
-mark
Re: Problem about magma_zgeqrf_gpu (A possible bug)
Also, why are you computing the determinant? It's useful for theory, but not so often in numerical computations.
-mark
-mark
-
boruoshihao
- Posts: 6
- Joined: Thu May 21, 2015 11:48 am
Re: Problem about magma_zgeqrf_gpu (A possible bug)
Dear Mark,
Thank you for the explanation. In my simulation, a Matrix A is a physical wave function. When we apply QR to A, Q is a normalized wave function, the Determinant of R is just the norm of the original wave function. This why I need the Determinant of R.
I can always use CPU interface, it works fine. However it's better to use GPU interface, since we need to call both zgeqrf and zungqr for a complete QR, use CPU interface will cause additional pass between GPU and CPU. (My main code is wrote in CPU, only call magma for QR.)
1. For magma_zgeqrf2_gpu, if I need to copy dA matrix to the CPU and use zungqr2, I would rather to use magma_zgeqrf and magma_zungqr2, since the cost will be same for my code.
2. For magma_zgeqrf_gpu, I have tried to transfer dT to CPU. In the documentation, I can find: " It starts with MIN(M,N)*NB block that store the triangular T matrices, followed by the MIN(M,N)*NB block of the diagonal inverses for the R matrix. " Then I can calculate the det ( Inverse R) from the second MIN(M,N)*NB.
Suppose N<M, we pass dT to T in CPU, T is a one dimensional array has the same size with dT, determinant of R can be calculated:
det=1.0;
for(int i=0;i<N;i++)
{
det*=T(i+i*NB+NB*N);
}
det=1.0/det;
However the det is not correct compared with the one come from zgeqrf.
3. For magma_zgeqrf3_gpu, I tried the to calculate determinant of R similar to magma_zgeqrf_gpu, since dT store the original R: " It starts with MIN(M,N)*NB block that store the triangular T matrices, followed by the MIN(M,N)*NB block of the diagonal matrices for the R matrix. The rest of the array is used as workspace."
I just pass dT to T in cpu, T is a one dimensional array has the same size with dT, then calculate determinant from:
det=1.0;
for(int i=0;i<N;i++)
{
det*=T(i+i*NB+NB*N);
}
det is not correct neither.
Hope I did not bother you too much, the CPU interface works fine for me. I'm just curious about the GPU interface, it looks like the det is not correct in GPU interface. Thanks~~
Hao
Thank you for the explanation. In my simulation, a Matrix A is a physical wave function. When we apply QR to A, Q is a normalized wave function, the Determinant of R is just the norm of the original wave function. This why I need the Determinant of R.
I can always use CPU interface, it works fine. However it's better to use GPU interface, since we need to call both zgeqrf and zungqr for a complete QR, use CPU interface will cause additional pass between GPU and CPU. (My main code is wrote in CPU, only call magma for QR.)
1. For magma_zgeqrf2_gpu, if I need to copy dA matrix to the CPU and use zungqr2, I would rather to use magma_zgeqrf and magma_zungqr2, since the cost will be same for my code.
2. For magma_zgeqrf_gpu, I have tried to transfer dT to CPU. In the documentation, I can find: " It starts with MIN(M,N)*NB block that store the triangular T matrices, followed by the MIN(M,N)*NB block of the diagonal inverses for the R matrix. " Then I can calculate the det ( Inverse R) from the second MIN(M,N)*NB.
Suppose N<M, we pass dT to T in CPU, T is a one dimensional array has the same size with dT, determinant of R can be calculated:
det=1.0;
for(int i=0;i<N;i++)
{
det*=T(i+i*NB+NB*N);
}
det=1.0/det;
However the det is not correct compared with the one come from zgeqrf.
3. For magma_zgeqrf3_gpu, I tried the to calculate determinant of R similar to magma_zgeqrf_gpu, since dT store the original R: " It starts with MIN(M,N)*NB block that store the triangular T matrices, followed by the MIN(M,N)*NB block of the diagonal matrices for the R matrix. The rest of the array is used as workspace."
I just pass dT to T in cpu, T is a one dimensional array has the same size with dT, then calculate determinant from:
det=1.0;
for(int i=0;i<N;i++)
{
det*=T(i+i*NB+NB*N);
}
det is not correct neither.
Hope I did not bother you too much, the CPU interface works fine for me. I'm just curious about the GPU interface, it looks like the det is not correct in GPU interface. Thanks~~
Hao
Re: Problem about magma_zgeqrf_gpu (A possible bug)
Hi Hao,
The R and T matrix used in output of either qrf routine are special.
The R is not the whole R, it contains only N/NB diagonal block of size NBxNB each.
When is it inverted it is because it simplify the solve when it need to be solve for example Ax =b ==> QRx=b ==> Rx = Q'b
When it is stored, it is because it original position in A is zeroed to accelerate the application or generation of Q bu unmqr/ungqr.
So I think that, in order to compute the determinant you have to use the upper portion of A combined with the block diagonal.
Thanks
Azzam
The R and T matrix used in output of either qrf routine are special.
The R is not the whole R, it contains only N/NB diagonal block of size NBxNB each.
When is it inverted it is because it simplify the solve when it need to be solve for example Ax =b ==> QRx=b ==> Rx = Q'b
When it is stored, it is because it original position in A is zeroed to accelerate the application or generation of Q bu unmqr/ungqr.
So I think that, in order to compute the determinant you have to use the upper portion of A combined with the block diagonal.
Thanks
Azzam