LAPACK/ScaLAPACK Development

by **jakobus** » Thu Jun 12, 2008 5:03 am

Hi,

we have a problem that for large runs (typically using more than 10
parallel processes, full dense matrix of size typically larger than about
50 000 unknowns) we get on certain clusters errors inside of the
PZGETRF() and PCGETRF() calls of ScaLAPACK. The runs then just stop
after a few hours with this message:

[cli_0]: aborting job:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(179)..............................: MPI_Recv(buf=0x2b0ac70386b8,
count=1, dtype=USER<vector>, src=6, tag=9976, comm=0x84000001,
status=0x2431a70) failed
MPIDI_CH3_Progress(904)....................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(1016):
MPIDU_Socki_handle_pollhup(418)............: connection closed by peer
(set=0,sock=11)

Several observations in this regard:

- This error is not deterministically reproducable. Sometimes runs
finish fine, sometimes this error occurs early in the solution,
sometimes later. Typically the larger the problem or the more
parallel processes are used, the higher the chance to get this
error.

- This error has so far been regularly observed on three clusters,
one IA32/Linux and two EM64T/Linux clusters, different Linux
distributions, all use GigaBit Ethernet, though.

- The error occurs both with MPICH2 and also with Intel MPI, although
the chances to trigger this error with Intel MPI are much higher as
compared to MPICH2 (but also with MPICH2 when using say 92 parallel
processes we get this error for a large matrix, here 120 000
unknowns, basically all the time, just the exact time when it occurs
differs).

- The error is there both in ScaLAPACK 1.7 and also ScaLAPACK 1.8,
and we also get if we use ScaLAPACK and PBLACS out of the
Intel ClusterMKL library (i.e. don't compile our own). Used here was
MKL 10.0.

- Not sure if this is related, but if we run the Intel MPI Message Checker
on PCGETRF, then this reports the following error for just any call
(even just obtaining the LU decomposition of a small 10x10 matrix
with only 2 parallel processes):

[0] ERROR: no progress observed in any process for over 1:00 minutes, aborting application
[0] WARNING: starting premature shutdown
[0] WARNING: proceeding although 1 out of 2 processes are in an unknown state

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR: Application aborted because no progress was observed for over 1:00 minutes,
[0] ERROR: check for real deadlock (cycle of processes waiting for data) or
[0] ERROR: potential deadlock (processes sending data to each other and getting blocked
[0] ERROR: because the MPI might wait for the corresponding receive).
[0] ERROR: [0] no progress observed for over 1:00 minutes, process is currently in MPI call:
[0] ERROR: MPI_Recv(*buf=0xbfffdea0, count=1, datatype=0xcc010000, source=1, tag=9976, comm=0xffffffff84000004 DUP CREATE COMM_WORLD [0:1], *status=0x9c10030)
[0] ERROR: BI_Srecv
[0] ERROR: Ccgerv2d
[0] ERROR: pcamax_
[0] ERROR: pcgetf2_
[0] ERROR: pcgetrf_
[0] ERROR: scalagau_
[0] ERROR: bestx_
[0] ERROR: MAIN__
[0] ERROR: main
[0] ERROR: __libc_start_main (/lib/i686/libc.so.6)
[0] ERROR: [1] no progress observed for over 1:00 minutes, process is currently in MPI call:
[0] ERROR: MPI_Bcast(*buffer=0xbfffde90, count=1, datatype=0xcc000000, root=0, comm=0xffffffffc4000000 SPLIT CREATE COMM_WORLD [0:1])
[0] ERROR: Ccgebr2d
[0] ERROR: pcamax_
[0] ERROR: pcgetf2_
[0] ERROR: pcgetrf_
[0] ERROR: scalagau_
[0] ERROR: bestx_
[0] ERROR: MAIN__
[0] ERROR: main
[0] ERROR: __libc_start_main (/lib/i686/libc.so.6)

[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

- This output of the Intel Message checker suggests that something
inside of PCAMAX is may be not right. It works for small matrices and
small parallel processes, but then trying to solve matrices larger
than say 50 000 unknowns in big clusters there might be timing
issues which then trigger this deadlock?

- Also interesting is that on Myrinet / Infiniband clusters as well as on
parallel machines like SGI Altix (using NumaFlex) we have never
seen this error, only on GigaBit Ethernet. This might also confirm
the fact that some timing related problem inside of PCAMAX?

- For reference, the BLACS tester works fine.

- Also, we ran the ScaLAPACK level 1 PBLAS tester xcpblas1tst which
includes PCAMAX inside of the Intel MPI Message Checker. This,
however, does then not report any possible deadlock.

Any help / suggestions would be highly appreciated.

Thanks,

Ulrich

LAPACK/ScaLAPACK Development

Potential deadlock (RECV/BCAST mismatch) in PZGETRF/PZAMAX?

Potential deadlock (RECV/BCAST mismatch) in PZGETRF/PZAMAX?

Who is online