LAPACK/ScaLAPACK Development

by **CyLith** » Tue Aug 03, 2010 2:24 pm

I have a program which computes eigensystems in separate (POSIX) threads using zgeev. I have noticed incorrect/inconsistent results when I do this. If I put a mutex around the call the zgeev, then the results are consistently correct, but when I don't, each time I run it, the results are different. For example, the output is

Code: Select all: $ uname -a Linux hostname 2.6.32-22-server #36-Ubuntu SMP Thu Jun 3 20:38:33 UTC 2010 x86_64 GNU/Linux $ make good g++ -DGOOD main.cpp /usr/lib/liblapack.a /usr/lib/libblas.a -lgfortran -z muldefs -pthread $ ./a.out Info1,2 = 0,0 -7050.245589,-8479.237956 -8580.257527,-9209.117716 -6301.087406,-6399.428666 $ ./a.out Info1,2 = 0,0 -7050.245589,-8479.237956 -8580.257527,-9209.117716 -6301.087406,-6399.428666 $ make bad g++ main.cpp /usr/lib/liblapack.a /usr/lib/libblas.a -lgfortran -z muldefs -pthread $ ./a.out Info1,2 = 0,0 -7050.355400,-8478.864949 -8457.115555,-8501.374698 1703.121368,-3150.274644 $ ./a.out Info1,2 = 0,0 -7106.927073,-8501.507914 -8580.257527,-9209.117716 -6299.486160,-6398.683343

The compiler flag

Code: Select all: -z muldefs

is to suppress the multiply defined xerbla symbol.

I can run the good scenario many times, and it always produces the right output, while every time I run the bad scenario (without a mutex around the call), it always produces different results.

I have attached the simplest skeleton code with a test matrix that shows this behavior. The makefile and example output is embedded in the comments in the code. As you can see from the above output, I am running this on a 64-bit Ubuntu server. I have double checked the integer sizes (int is 32-bit, long int is 64-bit, and it actually doesn't make a difference which integer type I use). I ran a simple test code to call ilaver, and it returned 3.2.1. I believe the libraries are the default ones packaged by Ubuntu. I am wondering if anyone can verify this or else I hope someone can point what I am doing wrong, since the Lapack 3.2 release notes say that everything should be thread-safe (and I initialize dlamch/slamch before threads start).

Edit: I should add that if I don't do a workspace query and use the minimal workspace, then I always get consistently good results. It appears that the problem is in the blocked code. I found this problem running valgrind's DRD tool, which reported (vague) errors that indicate the problem is in zlahr2.

Edit: It appears that the local array T is probably allocated statically (it exceeds the maximum gfortran stack variable size). Is it possible to make T part of the workspace?

by **Julien Langou** » Tue Aug 03, 2010 4:40 pm

Hello,
1) I can reproduce the bug with your codes, it works great,
2) yes I found as well that adding the line lwork = 2*n, after the lwork computation was fixing everything thing, so I confirm your second edits,
3) looking at the third edit now,
... really interesting problems, thanks a lot for the comprehensive code ...
Julien.

(Please answer without editing your previous post.)

by **Julien Langou** » Tue Aug 03, 2010 4:54 pm

So I believe you are blaming the static allocation of the array T online 121 of DGEHRD?

Code: Select all: INTEGER NBMAX, LDT PARAMETER ( NBMAX = 64, LDT = NBMAX+1 ) [ ... ] DOUBLE PRECISION T( LDT, NBMAX )

Yes we can make T part of the workspace ...
Not that much of an easy change ... (change the interface in some sense, changes the minimal workspace for the routine)
Let us first try to see if this "works".

Actually they are other good reasons to keep the "T".
(We recompute it quite often and we could save flops by keeping them.)

It would be nice if you can confirm that your doubt are on this static allocation, then it's "easy" to check that this is the problem.
I'll try to work some more on this tomorrow.

Cheers,
Julien.

by **CyLith** » Tue Aug 03, 2010 6:54 pm

I will try to recompile my MingW lapack-3.2.2 with -frecursive and -fmax-stack-var-size to see if that fixes it. From what I understand from the documentation, it should.

I don't understand why this would change the minimal workspace. The unblocked code does not use T, and you could simply change the workspace query to add on the extra space for T. This means of course that there will be a range of workspace sizes between the current minimum blocked size and the new size for which the unblocked code would be used. However, I expect that most uses of the blocked code would simply perform a workspace query so this is largely not an issue.

It would be nice if there were a collected list of all known threading gotchas; I know that in terms of pure code, Lapack is thread safe, but I was not aware of issues like local arrays, which have compiler-dependent behavior. For example, the dlamch/slamch initialization is not well documented (I first found the solution on an Apple forum), etc.

by **Julien Langou** » Tue Aug 03, 2010 7:14 pm

I will try to recompile my MingW lapack-3.2.2 with -frecursive and -fmax-stack-var-size to see if that fixes it. From what I understand from the documentation, it should.

Great, I tried to play this afternoon with 'ulimit -s' and 'fmax-stack-var-size' with not much success, but please try, also note that there is static allocation in zgehrd but it's fairly small, and there is no static allocation in zlahr2. So I am not sure this comes from here. In any case, your help is much appreciated and any suggestions, experiments, reports, bug fixes (!) as well.

I don't understand why this would change the minimal workspace.

Right. I messed up. It will not change the interface. That'll work just fine.

It would be nice if there were a collected list of all known threading gotchas; I know that in terms of pure code, Lapack is thread safe, but I was not aware of issues like local arrays, which have compiler-dependent behavior. For example, the dlamch/slamch initialization is not well documented (I first found the solution on an Apple forum), etc.

(1) Well, you are the first bug report on this issue, and we have no idea where the bug at this point. There is an Errata webpage with all the bug reports at http://www.netlib.org/lapack/Errata/index.html where we collect all bug reports and current fixes in the svn. You have a good chance to end up there! We also just released a bug fix (3.2.2), see: http://www.netlib.org/lapack/lapack-3.2.2.html.

(2) We have fixed xLAMCH in the svn repository on July 02nd. (We should report it in the Errata webpage ...) A priori your use of LAPACK-3.2.1 is thread safe. We have no idea why your code does not work at this point. Your code seems correct. So there is probably a bug somewhere. If this is a static array allocation on the stack from LAPACK, we'll be happy to remove this asap.

So to repeat the only stack allocation I see at this point is in ZGEHRD, it's "small" so I do not see why that should be the problem. Since we do not have anymore hints at this point, tomorrow, I'll probably try to change this static allocation with a dynamic Fortran 90 allocation and see if that changes something.

Thanks for your help.
Julien.

by **CyLith** » Tue Aug 03, 2010 8:58 pm

I have tried recompiling just the file zgehrd.f with -frecursive as well as -fmax-stack-var-size=66560 and -fmax-stack-var-size=66559. With The first two flags, the test runs correctly, while the last one produces bad results since the size of T is 64*65*16 bytes. This was done on my Windows 7 machine with Mingw32 and gfortran 4.4.1. I hope this shows the problem clearly.

I have also modified zgehrd to expand the workspace query in the way I indicated and the patch is attached. I haven't thought through it completely to see if every case is handled correctly however. I only believe it to be correct. Essentially, I tack on the space for T at the end of the work array. I'm not sure the patch is really necessary since it seems like we should rely more on compiler flags for the right behavior. Unfortunately, I would guess that most package maintainers don't bother fiddling with the flags or are unaware of these issues.

Also, thanks for pointing out the new dlamch in svn. I didn't realize there were intrinsics for those things and I was about to hard code the constants in (as I probably should have in the first place). This should greatly reduce the number of (spurious) problems reported by valgrind!

by **Julien Langou** » Wed Aug 04, 2010 1:06 pm

Awesome. So I guess that's it.

(I am using gfortran on MAC OS and I can not use frecursive with fmax-stack-var-size so I only used fmax-stack-var-size.)

I confirm what you wrote.
main.cpp is a correct multithreaded code that spawns two simultaneous execution of zgeev (on two distinct threads)
1) Compiling main.cpp as is => NO
2) Compiling main.cpp with -DGOOD option (=mutex protection of each zgeev call) => OK
3) Compiling zgehrd with -fmax-stack-var-size=66560 then linking with main.cpp => OK
4) Compiling zgehrd with -fmax-stack-var-size=66559 then linking with main.cpp => NO
5) Using your pacth on zgehrd then linking with main.cpp => OK
(NO: means that the code runs in two threads returns an unpredectiable uncorrect answer. Very fun to observe.)

Guilty lines of code in dgehrd.f:

Code: Select all: INTEGER NBMAX, LDT PARAMETER ( NBMAX = 64, LDT = NBMAX+1 ) [...] * .. * .. Local Arrays .. DOUBLE PRECISION T( LDT, NBMAX )

which is an allocation of an array on the stack of size 64*65*16 = 66560 bytes which is greater than 32768. (The current (4.5.0) default value for gfortran max-stack-var-size.) The LAPACK code is overflowing the maximum
size in bytes of the largest array that can be put on the stack with gfortran. Great.

So I will contact other lapackers on what best to do with the issue. We have several option to fix this: dynamic allocation, going with the workspace as you did, requesting a stack bigger than 66560 bytes ... (other options?). My preference would be to go with the workspace as you did.

I have just added the bug as bug#0061 on the LAPACK Errata webpage.

The problem is potentially not only zgehrd.f. I have checked the LAPACK codes (only ./SRC/), the potential problematic subroutines (allocation on the stack) can be identified by the string of characters: "Local Arrays".

Code: Select all: grep -H -A 4 -B 0 "Local Arrays" ./SRC/*f

This gives 311 routines. A brief look at them reveal that a lot of them allocate very small arrays on the stack (e.g., 4 DOUBLE PRECISION). With a threshold at 64 bytes, I found that they are 60 subroutines left. They are given below. It sounds that we are going to have a lots of fun.

A question is: what is the maximum size in bytes of the largest array that can be put on the stack with standard fortran compilers?

Cheers
--julien.

Code: Select all: ======================================================================================= COMPLEX PRECISION ROUTINES ======================================================================================= cgbtrf.f:* .. Local Arrays .. cgbtrf.f- COMPLEX WORK13( LDWORK, NBMAX ), cgbtrf.f- $ WORK31( LDWORK, NBMAX ) ======================================================================================= cgehrd.f:* .. Local Arrays .. cgehrd.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= chseqr.f:* .. Local Arrays .. chseqr.f- COMPLEX HL( NL, NL ), WORKL( NL ) ======================================================================================= clarnv.f:* .. Local Arrays .. clarnv.f- REAL U( LV ) ======================================================================================= clatdf.f:* .. Local Arrays .. clatdf.f- REAL RWORK( MAXDIM ) clatdf.f- COMPLEX WORK( 4*MAXDIM ), XM( MAXDIM ), XP( MAXDIM ) ======================================================================================= cpbtrf.f:* .. Local Arrays .. cpbtrf.f- COMPLEX WORK( LDWORK, NBMAX ) ======================================================================================= ctgex2.f:* .. Local Arrays .. ctgex2.f- COMPLEX S( LDST, LDST ), T( LDST, LDST ), WORK( 8 ) ======================================================================================= ctgsy2.f:* .. Local Arrays .. ctgsy2.f- INTEGER IPIV( LDZ ), JPIV( LDZ ) ctgsy2.f- COMPLEX RHS( LDZ ), Z( LDZ, LDZ ) ======================================================================================= cunmlq.f:* .. Local Arrays .. cunmlq.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmql.f:* .. Local Arrays .. cunmql.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmqr.f:* .. Local Arrays .. cunmqr.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmrq.f:* .. Local Arrays .. cunmrq.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmrz.f:* .. Local Arrays .. cunmrz.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= ======================================================================================= REAL PRECISION ROUTINES ======================================================================================= dgbtrf.f:* .. Local Arrays .. dgbtrf.f- DOUBLE PRECISION WORK13( LDWORK, NBMAX ), dgbtrf.f- $ WORK31( LDWORK, NBMAX ) ======================================================================================= dgehrd.f:* .. Local Arrays .. dgehrd.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dhseqr.f:* .. Local Arrays .. dhseqr.f- DOUBLE PRECISION HL( NL, NL ), WORKL( NL ) ======================================================================================= dlaexc.f:* .. Local Arrays .. dlaexc.f- DOUBLE PRECISION D( LDD, 4 ), U( 3 ), U1( 3 ), U2( 3 ), dlaexc.f- $ X( LDX, 2 ) ======================================================================================= dlaln2.f:* .. Local Arrays .. dlaln2.f- LOGICAL RSWAP( 4 ), ZSWAP( 4 ) dlaln2.f- INTEGER IPIVOT( 4, 4 ) dlaln2.f- DOUBLE PRECISION CI( 2, 2 ), CIV( 4 ), CR( 2, 2 ), CRV( 4 ) ======================================================================================= dlarnv.f:* .. Local Arrays .. dlarnv.f- DOUBLE PRECISION U( LV ) ======================================================================================= dlaruv.f:* .. Local Arrays .. dlaruv.f- INTEGER MM( LV, 4 ) ======================================================================================= dlasrt.f:* .. Local Arrays .. dlasrt.f- INTEGER STACK( 2, 32 ) ======================================================================================= dlatdf.f:* .. Local Arrays .. dlatdf.f- INTEGER IWORK( MAXDIM ) dlatdf.f- DOUBLE PRECISION WORK( 4*MAXDIM ), XM( MAXDIM ), XP( MAXDIM ) ======================================================================================= dormlq.f:* .. Local Arrays .. dormlq.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormql.f:* .. Local Arrays .. dormql.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormqr.f:* .. Local Arrays .. dormqr.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormrq.f:* .. Local Arrays .. dormrq.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormrz.f:* .. Local Arrays .. dormrz.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dpbtrf.f:* .. Local Arrays .. dpbtrf.f- DOUBLE PRECISION WORK( LDWORK, NBMAX ) ======================================================================================= dtgex2.f:* .. Local Arrays .. dtgex2.f- INTEGER IWORK( LDST ) dtgex2.f- DOUBLE PRECISION AI( 2 ), AR( 2 ), BE( 2 ), IR( LDST, LDST ), dtgex2.f- $ IRCOP( LDST, LDST ), LI( LDST, LDST ), dtgex2.f- $ LICOP( LDST, LDST ), S( LDST, LDST ), ======================================================================================= dtgsy2.f:* .. Local Arrays .. dtgsy2.f- INTEGER IPIV( LDZ ), JPIV( LDZ ) dtgsy2.f- DOUBLE PRECISION RHS( LDZ ), Z( LDZ, LDZ ) =======================================================================================

by **CyLith** » Tue Nov 13, 2012 4:41 am

I keep running into this problem, making parallelization with OpenMP or any other threading method completely impossible. Since there has been no apparent action on this errata item, I thought I would try to get the ball rolling. I'm posting the full function list here. I did a similar grep and manually weeded out the trivially small local arrays and also arrived at a figure of almost 60 affected functions. I did not include things like xerbla_array which hold a fixed size string, but I did include functions that keep a small stack of length 32. My list is:

Code: Select all: *cgbtrf (c 64*65) *cgehrd (c 64*65) chseqr (c 49*49) clarnv (s 128) clatdf (c 4*2) cpbtrf (c 32*33) ctgex2 (c 2*2) ctgsy2 (c 2*2) *cunmlq (c 64*65) *cunmql (c 64*65) *cunmqr (c 64*65) *cunmrq (c 64*65) *cunmrz (c 64*65) *dgbtrf (d 64*65) *dgehrd (d 64*65) dhseqr (d 49*49) dlaexc (d 4*4) dlarnv (d 128) dlaruv (d 128*4) dlasrt (i 2*32) dlatdf (d 4*8) *dormlq (d 64*65) *dormql (d 64*65) *dormqr (d 64*65) *dormrq (d 64*65) *dormrz (d 64*65) dpbtrf (d 32*33) dtgex2 (d 4*4) dtgsy2 (d 8*8) sgbtrf (s 64*65) sgehrd (s 64*65) shseqr (s 49*49) slaexc (s 4*4) slarnv (s 128) slaruv (s 128*4) slasrt (i 2*32) slatdf (s 4*8) sormlq (s 64*65) sormql (s 64*65) sormqr (s 64*65) sormrq (s 64*65) sormrz (s 64*65) spbtrf (s 32*33) stgex2 (s 4*4) stgsy2 (s 8*8) *zgbtrf (z 64*65) *zgehrd (z 64*65) *zhseqr (z 49*49) zlatdf (z 2*4) zpbtrf (z 32*33) ztgex2 (z 2*2) ztgsy2 (z 2*2) *zunmlq (z 64*65) *zunmql (z 64*65) *zunmqr (z 64*65) *zunmrq (z 64*65) *zunmrz (z 64*65)

Next to each function is a description of the largest array declared. The names with an asterisk in front are the functions which would be affected with the current GCC default of 32768 bytes. Since all the orthogonal/unitary QR-like routines all have similar sizes, a simple modification of any one of those should carry over well to all the others, and eliminate most of the problems. The other major culprits are the hessenberg reduction routines (xGEHRD), the hessenberg QR driver (xHSEQR, due to the odd limited subdiagonal space problem), and the general banded triangular factorizations (xGBTRF). In fact just dealing with these 4 classes of routines would solve all existing GCC issues.

I have brought the issue up with some package maintainers, namely the need to compile with -frecursive, but I have not received any responses, and all the default repository packages seem to still have this problem. Therefore, I see fixing the problem at the source the only way to get things fixed.

by **CyLith** » Tue Nov 13, 2012 7:00 am

I have gone through all the orthogonal/unitary LQ, QL, QR, RQ, and generation routines, as well as the Hessenberg reduction routines and fixed the bug by requring a larger workspace for blocking. Old code which makes assumptions about the blocking workspace size will simply degrade to using unblocked versions. Specifying the minimum workspace size should still work fine, and proper workspace queries will return the size needed for full blocking. I have attached the diff against the SVN trunk as of the time of this post.

The remaining 3 classes of routines present interesting challenges. First, near and dear to me is the Hessenberg QR driver (xHSEQR). The local array is only used for small matrices that fail with xLAHQR, which are then copied into this temporary array. One way we can go about this is to simply bail on small matrices which don't provide enough workspace, but this could break old code. The comments suggest that the local array can be made much smaller than it currently is, so that is another route. Alternatively, this can just stay unfixed, with the caveat that thread safety cannot be guaranteed for small matrices; they must be embedded in a bigger matrix (which I'm fine with now that I know).

The general and positive banded triangular factorization routines have no workspace parameter to begin with, so there is not much that can be done.

These problems could be solved by just making a new set of driver routines so that interface compatibility can be broken. The banded routines can just demand a workspace. xHSEQR can demand a larger bare minimum workspace size, but it will kind of be dependent on an ILAENV parameter, which is not great. Even if these remaining issues aren't fixed, I would at least like to see the fixes I have already described get commited.

by **nicholas3972** » Tue Dec 11, 2012 6:31 am

I ran a simple test code to call ilaver, and it returned 3.2.1. I believe the libraries are the default ones packaged by Ubuntu. I am wondering if anyone can verify this or else I hope someone can point what I am doing wrong, since the Lapack 3.2 release notes say that everything should be thread-safe (and I initialize dlamch/slamch before threads start).

------------------------------------------------------
estetik

by **CyLith** » Tue Dec 11, 2012 4:50 pm

Bevore version 3.3, the xLAMCH functions were not thread safe due to using statically allocated variables to cache their results. Since then, this has been fixed, and there are no explicitly statically allocated variables in lapack, so it should in theory be thread safe. The problem I have described is that some variables become statically allocated because of the compiler, due to their enormous size.

I am using a fresh install of Linux Mint 14, with Lapack 3.4.1 as returned by ilaver, and running the test code below still shows inconsistent results between runs.

by **mbanck** » Thu May 02, 2013 7:02 pm

I rebuilt lapack with -fopenmp and the problem went away at least for another OpenMP-enabled application. Probably not the right fix, though?

by **mbanck** » Thu May 02, 2013 7:28 pm

I recompiled lapack with -fopenmp and the problem went away, at least for a OpenMP-enabled application I saw it with. Not sure whether this is a proper fix, though?

by **rashaeg** » Mon Sep 30, 2013 12:37 pm

Julien Langou wrote:Awesome. So I guess that's it.

(I am using gfortran on MAC OS and I can not use frecursive with fmax-stack-var-size so I only used fmax-stack-var-size.)

I confirm what you wrote.
main.cpp is a correct multithreaded code that spawns two simultaneous execution of zgeev (on two distinct threads)
1) Compiling main.cpp as is => NO
2) Compiling main.cpp with -DGOOD option (=mutex protection of each zgeev call) => OK
3) Compiling zgehrd with -fmax-stack-var-size=66560 then linking with main.cpp => OK
4) Compiling zgehrd with -fmax-stack-var-size=66559 then linking with main.cpp => NO
5) Using your pacth on zgehrd then linking with main.cpp => OK
(NO: means that the code runs in two threads returns an unpredectiable uncorrect answer. Very fun to observe.)

Guilty lines of code in dgehrd.f:
Code: Select all
INTEGER NBMAX, LDT PARAMETER ( NBMAX = 64, LDT = NBMAX+1 ) [...] * .. * .. Local Arrays .. DOUBLE PRECISION T( LDT, NBMAX )

which is an allocation of an array on the stack of size 64*65*16 = 66560 bytes which is greater than 32768. (The current (4.5.0) default value for gfortran max-stack-var-size.) The LAPACK code is overflowing the maximum
size in bytes of the largest array that can be put on the stack with gfortran. Great.

So I will contact other lapackers on what best to do with the issue. We have several option to fix this: dynamic allocation, going with the workspace as you did, requesting a stack bigger than 66560 bytes ... (other options?). My preference would be to go with the workspace as you did.

I have just added the bug as bug#0061 on the LAPACK Errata webpage.

The problem is potentially not only zgehrd.f. I have checked the LAPACK codes (only ./SRC/), the potential problematic subroutines (allocation on the stack) can be identified by the string of characters: "Local Arrays".
Code: Select all
grep -H -A 4 -B 0 "Local Arrays" ./SRC/*f

This gives 311 routines. A brief look at them reveal that a lot of them allocate very small arrays on the stack (e.g., 4 DOUBLE PRECISION). With a threshold at 64 bytes, I found that they are 60 subroutines left. They are given below. It sounds that we are going to have a lots of fun.

A question is: what is the maximum size in bytes of the largest array that can be put on the stack with standard fortran compilers?

Cheers
--julien.

Code: Select all
======================================================================================= COMPLEX PRECISION ROUTINES ======================================================================================= cgbtrf.f:* .. Local Arrays .. cgbtrf.f- COMPLEX WORK13( LDWORK, NBMAX ), cgbtrf.f- $ WORK31( LDWORK, NBMAX ) ======================================================================================= cgehrd.f:* .. Local Arrays .. cgehrd.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= chseqr.f:* .. Local Arrays .. chseqr.f- COMPLEX HL( NL, NL ), WORKL( NL ) ======================================================================================= clarnv.f:* .. Local Arrays .. clarnv.f- REAL U( LV ) ======================================================================================= clatdf.f:* .. Local Arrays .. clatdf.f- REAL RWORK( MAXDIM ) clatdf.f- COMPLEX WORK( 4*MAXDIM ), XM( MAXDIM ), XP( MAXDIM ) ======================================================================================= cpbtrf.f:* .. Local Arrays .. cpbtrf.f- COMPLEX WORK( LDWORK, NBMAX ) ======================================================================================= ctgex2.f:* .. Local Arrays .. ctgex2.f- COMPLEX S( LDST, LDST ), T( LDST, LDST ), WORK( 8 ) ======================================================================================= ctgsy2.f:* .. Local Arrays .. ctgsy2.f- INTEGER IPIV( LDZ ), JPIV( LDZ ) ctgsy2.f- COMPLEX RHS( LDZ ), Z( LDZ, LDZ ) ======================================================================================= cunmlq.f:* .. Local Arrays .. cunmlq.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmql.f:* .. Local Arrays .. cunmql.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmqr.f:* .. Local Arrays .. cunmqr.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmrq.f:* .. Local Arrays .. cunmrq.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= cunmrz.f:* .. Local Arrays .. cunmrz.f- COMPLEX T( LDT, NBMAX ) ======================================================================================= ======================================================================================= REAL PRECISION ROUTINES ======================================================================================= dgbtrf.f:* .. Local Arrays .. dgbtrf.f- DOUBLE PRECISION WORK13( LDWORK, NBMAX ), dgbtrf.f- $ WORK31( LDWORK, NBMAX ) ======================================================================================= dgehrd.f:* .. Local Arrays .. dgehrd.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dhseqr.f:* .. Local Arrays .. dhseqr.f- DOUBLE PRECISION HL( NL, NL ), WORKL( NL ) ======================================================================================= dlaexc.f:* .. Local Arrays .. dlaexc.f- DOUBLE PRECISION D( LDD, 4 ), U( 3 ), U1( 3 ), U2( 3 ), dlaexc.f- $ X( LDX, 2 ) ======================================================================================= dlaln2.f:* .. Local Arrays .. dlaln2.f- LOGICAL RSWAP( 4 ), ZSWAP( 4 ) dlaln2.f- INTEGER IPIVOT( 4, 4 ) dlaln2.f- DOUBLE PRECISION CI( 2, 2 ), CIV( 4 ), CR( 2, 2 ), CRV( 4 ) ======================================================================================= dlarnv.f:* .. Local Arrays .. dlarnv.f- DOUBLE PRECISION U( LV ) ======================================================================================= dlaruv.f:* .. Local Arrays .. dlaruv.f- INTEGER MM( LV, 4 ) ======================================================================================= dlasrt.f:* .. Local Arrays .. dlasrt.f- INTEGER STACK( 2, 32 ) ======================================================================================= dlatdf.f:* .. Local Arrays .. dlatdf.f- INTEGER IWORK( MAXDIM ) dlatdf.f- DOUBLE PRECISION WORK( 4*MAXDIM ), XM( MAXDIM ), XP( MAXDIM ) ======================================================================================= dormlq.f:* .. Local Arrays .. dormlq.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormql.f:* .. Local Arrays .. dormql.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormqr.f:* .. Local Arrays .. dormqr.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormrq.f:* .. Local Arrays .. dormrq.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dormrz.f:* .. Local Arrays .. dormrz.f- DOUBLE PRECISION T( LDT, NBMAX ) ======================================================================================= dpbtrf.f:* .. Local Arrays .. dpbtrf.f- DOUBLE PRECISION WORK( LDWORK, NBMAX ) ======================================================================================= dtgex2.f:* .. Local Arrays .. dtgex2.f- INTEGER IWORK( LDST ) dtgex2.f- DOUBLE PRECISION AI( 2 ), AR( 2 ), BE( 2 ), IR( LDST, LDST ), dtgex2.f- $ IRCOP( LDST, LDST ), LI( LDST, LDST ), dtgex2.f- $ LICOP( LDST, LDST ), S( LDST, LDST ), ======================================================================================= dtgsy2.f:* .. Local Arrays .. dtgsy2.f- INTEGER IPIV( LDZ ), JPIV( LDZ ) dtgsy2.f- DOUBLE PRECISION RHS( LDZ ), Z( LDZ, LDZ ) =======================================================================================

that's amazing Julien , great info , thx alot

by **asiakon** » Mon Nov 11, 2013 11:43 am

CyLith wrote:I have tried recompiling just the file zgehrd.f with -frecursive as well as -fmax-stack-var-size=66560 and -fmax-stack-var-size=66559. With The first two flags, the test runs correctly, while the last one produces bad results since the size of T is 64*65*16 bytes. This was done on my Windows 7 machine with Mingw32 and gfortran 4.4.1. I hope this shows the problem clearly.

I have also modified zgehrd to expand the workspace query in the way I indicated and the patch is attached. I haven't thought through it completely to see if every case is handled correctly however. I only believe it to be correct. Essentially, I tack on the space for T at the end of the work array. I'm not sure the patch is really necessary since it seems like we should rely more on compiler flags for the right behavior. Unfortunately, I would guess that most package maintainers don't bother fiddling with the flags or are unaware of these issues.

Also, thanks for pointing out the new dlamch in svn. I didn't realize there were intrinsics for those things and I was about to hard code the constants in (as I probably should have in the first place). This should greatly reduce the number of (spurious) problems reported by valgrind!

LAPACK/ScaLAPACK Development

inconsistent results with zgeev+pthreads

inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Estetik

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Re: inconsistent results with zgeev+pthreads

Who is online