Page 1 of 1

testing_zgesvd segmentation fault in V1.0 RC3 with mkl10.3

Posted: Wed Jan 26, 2011 9:29 pm
by addee
hi,

I compiled magma V1.0 RC3 with the make.inc.shared (fix mkl path accordingly). Other test drive executable such as testing_sgeqrf works fine. However, the testing_zgesvd will crash with Segmentation Fault:
$>./testing_zgesvd
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
Usage:
testing_zgesvd -M 1024 -N 1024

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================
Segmentation fault
the backtrace from gdb is :
#0 0x00007ffff4ad8d2b in mkl_lapack_zlacpy ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_core.so
#1 0x00007ffff5e8caf8 in zlacpy_ ()
from /opt/intel/composerxe-2011.1.107/mkl/lib/intel64/libmkl_intel_lp64.so
#2 0x00000000004011a0 in main ()
The testing_zgesvd is commented out in the testing/Makefile by default. Is it a working version?
Do you guys have any insight about this? Thanks!


PS: my environment is
Ubuntu 10.10 with MKL 10.3, GCC 4.4.5
Duo socket Intel i7 Xeon
Two GTX 470

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Thu Jan 27, 2011 9:47 pm
by Stan Tomov
Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Fri Jan 28, 2011 12:43 pm
by brom
This seg faults for me too using Atlas (still can't compile with MKL).

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Sun Jan 30, 2011 4:39 pm
by addee
Stan Tomov wrote:Hi,
I was wondering if the problem is due to a memory limitation (in which case we have forgotten to check somewhere the result of GPU memory allocation). Can you check if it would work for a fixed smaller size problem, e.g.,
./testing_zgesvd -M 1024 -N 1024
Thanks,
Stan
Thanks Stan. Yes, it seems to be the memory limitation, as smaller matrix (for example 1024*1024) will work.
What's the rule of thumb about how large the marix magma zgesvd can handle per magma implementation? Does the entire matrix is shipped on the Device memory? And how much extra workspace storage needed on the Device?
For example, the 8064*8064 double complex matrix in the testing_zgesvd is 992MB which fails magma on my GTX 470 with 1280MB GPU memory.


Another question: Is the sgesvd a working version?
I also tried using the sgesvd in the testing_zgesvd with the variables changed to float and replacing "z" to "s" in the relevant lapack function names. The sgesvd produces different errors for different size of matrices. I inject some printf checkpoint after every major function calls.

10*10: Segfaults at releasing the memory. But the error of 1.0 is too large.
$>./testing_sgesvd -M 10 -N 10
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
testing_sgesvd -M 10 -N 10

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
10 0.00 0.00 1.000000e+00
Segmentation fault
100*100: segfaults at calling magma_sgesvd
$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 100 -N 100

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.
Segmentation fault

1000*1000: The first call of magma_sgesvd looks good, but "can not bind to texture" error comes out repeatly at the second call of magma_sgesvd

$>./testing_sgesvd -M 100 -N 100
device 0: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory
device 1: GeForce GTX 470, 1215.0 MHz clock, 1279.7 MB memory

testing_sgesvd -M 1000 -N 1000

N CPU Time(s) GPU Time(s) ||R||_F / ||A||_F
==========================================================

check point: passed lapackf77_slarnv.

check point: passed lapackf77_slacpy.

check point: passed first magma_sgesvd.

check point: passed h_R=h_A.
can not bind to texture
can not bind to texture
..........(thousands of lines of "can not bind to texture")
can not bind to texture
can not bind to texture

check point: passed second magma_sgesvd.

check point: passed lapackf77_sgesvd.
1000 4.36 6.70 nan
Segmentation fault
Thank you.

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Mon Jan 31, 2011 6:42 am
by fletchjp
addee

The single precision equivalent of zgesvd would be cgesvd. Have you tried that?

I am interested in the 'can not bind to texture' messages. I tried researching it on google but only found references to my own messages on this list!!

If anyone knows more about it please post something.

Best wishes

John

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Mon Jan 31, 2011 4:33 pm
by addee
The single precision equivalent of zgesvd would be cgesvd. Have you tried that?
Hi John, I didn't try the cgesvd, because I need to handle Real entry matrix and used sgesvd.

The "can not bind to texture" is not coming from the magma code as I have tried grep the sentence from the magma src folder. I think the error is more likely reporting from cuBLAS.

Are there anyone with successful experience with the magma_sgesvd? The usage in the sgesvd.cpp is the same as zgesvd that says the input matrix A is COMPLEX*16 array. I wonder if it's auto generated code. (No offensive ;) )

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Mon Jan 31, 2011 5:10 pm
by brom
"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Mon Jan 31, 2011 6:25 pm
by fletchjp
brom wrote:"can not bind to texture" is a CUDA error that happens when, well, a texture can't be bound. usually due to hardware limitations.
Is there any CUDA or other NVIDIA documentation on this error message and the context which might cause it? I have cases which sometimes give it and sometimes not, and I suspect that memory on the CPU or GPU is getting into an inconsistent state.

Does anyone know of any NVIDIA tools to help with this sort of problem? I have been using cuda-memcheck but it does not seem to be finding the problems.

I am working on Ubuntu Linux 10.04 (64 bit).

Thanks

John

Re: testing_zgesvd segmentation fault in V1.0 RC3 with mkl10

Posted: Mon Jan 31, 2011 6:40 pm
by Stan Tomov
Yes, the code is generated for the different precisions starting from double complex. We are still fixing this routine in real arithmetic.
Stan