testing_zgesvd segmentation fault in V1.0 RC3 with mkl10.3

Open discussion for MAGMA library (Matrix Algebra on GPU and Multicore Architectures)
Post Reply
addee
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

testing_zgesvd segmentation fault in V1.0 RC3 with mkl10.3

Post by addee » Wed Jan 26, 2011 9:29 pm

hi,

I compiled magma V1.0 RC3 with the make.inc.shared (fix mkl path accordingly). Other test drive executable such as testing_sgeqrf works fine. However, the testing_zgesvd will crash with Segmentation Fault:
$>./testing_zgesvd
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
Usage:
testing_zgesvd -M 1024 -N 1024

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================
Segmentation fault
the backtrace from gdb is :
#0 0x00007ffff4ad8d2b in mkl_lapack_zlacpy ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_core.so
#1 0x00007ffff5e8caf8 in zlacpy_ ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_intel_lp64.so
#2 0x00000000004011a0 in main ()
The testing_zgesvd is commented out in the testing/Makefile by default. Is it a working version?
Do you guys have any insight about this? Thanks!


PS: my environment is
Ubuntu 10.10 with MKL 10.3, GCC 4.4.5
Duo socket Intel i7 Xeon
Two GTX 470

Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by Stan Tomov » Thu Jan 27, 2011 9:47 pm

Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan

brom
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by brom » Fri Jan 28, 2011 12:43 pm

This seg faults for me too using Atlas (still can't compile with MKL).

addee
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by addee » Sun Jan 30, 2011 4:39 pm

Stan Tomov wrote:Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan
Thanks Stan. Yes, it seems to be the memory limitation, as smaller matrix (for example 1024*1024) will work.
What's the rule of thumb about how large the marix magma zgesvd can handle per magma implementation? Does the entire matrix is shipped on the Device memory? And how much extra workspace storage needed on the Device?
For example, the 8064*8064 double complex matrix in the testing_zgesvd is 992MB which fails magma on my GTX 470 with 1280MB GPU memory.


Another question: Is the sgesvd a working version?
I also tried using the sgesvd in the testing_zgesvd with the variables changed to float and replacing "z" to "s" in the relevant lapack function names. The sgesvd produces different errors for different size of matrices. I inject some printf checkpoint after every major function calls.

10*10: Segfaults at releasing the memory. But the error of 1.0 is too large.
$>./testing_sgesvd -M 10 -N 10
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
testing_sgesvd -M 10 -N 10

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
10 0.00 0.00 1.000000e+00
Segmentation fault
100*100: segfaults at calling magma_sgesvd
$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 100 -N 100

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.
Segmentation fault

1000*1000: The first call of magma_sgesvd looks good, but "can not bind to texture" error comes out repeatly at the second call of magma_sgesvd

$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 1000 -N 1000

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.
can not bind to texture
can not bind to texture
..........(thousands of lines of "can not bind to texture")
can not bind to texture
can not bind to texture

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
1000 4.36 6.70 nan
Segmentation fault
Thank you.

fletchjp
Posts: 203
Joined: Mon Dec 27, 2010 7:29 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by fletchjp » Mon Jan 31, 2011 6:42 am

addee

The single precision equivalent of zgesvd would be cgesvd. Have you tried that?

I am interested in the 'can not bind to texture' messages. I tried researching it on google but only found references to my own messages on this list!!

If anyone knows more about it please post something.

Best wishes

John

addee
Posts: 3
Joined: Wed Jan 26, 2011 9:10 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by addee » Mon Jan 31, 2011 4:33 pm

The single precision equivalent of zgesvd would be cgesvd. Have you tried that?
Hi John, I didn't try the cgesvd, because I need to handle Real entry matrix and used sgesvd.

The "can not bind to texture" is not coming from the magma code as I have tried grep the sentence from the magma src folder. I think the error is more likely reporting from cuBLAS.

Are there anyone with successful experience with the magma_sgesvd? The usage in the sgesvd.cpp is the same as zgesvd that says the input matrix A is COMPLEX*16 array. I wonder if it's auto generated code. (No offensive ;) )

brom
Posts: 18
Joined: Tue Jan 25, 2011 8:20 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by brom » Mon Jan 31, 2011 5:10 pm

"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.

fletchjp
Posts: 203
Joined: Mon Dec 27, 2010 7:29 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by fletchjp » Mon Jan 31, 2011 6:25 pm

brom wrote:"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.
Is there any CUDA or other NVIDIA documentation on this error message and the context which might cause it? I have cases which sometimes give it and sometimes not, and I suspect that memory on the CPU or GPU is getting into an inconsistent state.

Does anyone know of any NVIDIA tools to help with this sort of problem? I have been using cuda-memcheck but it does not seem to be finding the problems.

I am working on Ubuntu Linux 10.04 (64 bit).

Thanks

John

Stan Tomov
Posts: 283
Joined: Fri Aug 21, 2009 10:39 pm

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Post by Stan Tomov » Mon Jan 31, 2011 6:40 pm

Yes, the code is generated for the different precisions starting from double complex. We are still fixing this routine in real arithmetic.
Stan

Post Reply