MPI+MAGMA
Posted: Fri Mar 20, 2020 7:03 am
Dear all,
I'm currently trying to build up a MPI/Magma Code for soe numerical computations. I have 2 GPU's per Node and I'm testing on three nodes currently. The following problem occurs at the Matrix size 3126:
I call magma_dsyevd_m with MAGMANOVEC as argument. The returned eigenvalues are always NAN's. The Matrix can be diagonalised, and additionally the algorithm works if executed on just one Node without MPI. The error seems to be wrongly set visible devices I would suspect, this would explain why it works only up to a certain Matrix size. CUDA_SET_VISIBLE_DEVICES gets set in the environment and is read out as 0,1(for all nodes). How are the GPU's counted? Is this system wide or on each node?
Greetings,
Jonas
I'm currently trying to build up a MPI/Magma Code for soe numerical computations. I have 2 GPU's per Node and I'm testing on three nodes currently. The following problem occurs at the Matrix size 3126:
I call magma_dsyevd_m with MAGMANOVEC as argument. The returned eigenvalues are always NAN's. The Matrix can be diagonalised, and additionally the algorithm works if executed on just one Node without MPI. The error seems to be wrongly set visible devices I would suspect, this would explain why it works only up to a certain Matrix size. CUDA_SET_VISIBLE_DEVICES gets set in the environment and is read out as 0,1(for all nodes). How are the GPU's counted? Is this system wide or on each node?
Greetings,
Jonas