Page 1 of 1

Scaling parallel, banded LU decomposition and solve

PostPosted: Tue May 29, 2012 3:20 pm
by sgildea
I have written a simple program to factor and solve a banded matrix in ScaLAPACK using PDGBTSV and PDGBTRS. Basically, I invoke PDGBTSV, and then use the output from the internal call of PDGBTRF for additional calls to PDGBTRS. I have run the program with 1, 2, 4, and 8 processes (everything regarding the matrix and rhs kept the same) and find that the solve (PDGBTRS) portion does get faster as I increase use more processors(scales approx. as P^-.63), but the factorization step (PGBTRF called from PDGBSV) scales as P^3.0 ! I would like to run with more processors, but I am concerned about the factorization scaling...With one process, for this problem, the PDGBSV call takes about 2.6 seconds, but for 8 processes on the same matrix/rhs, I did the factorization once and it took nearly 40 minutes!

My sample problem as N=40000 unknowns, with BWL and BWU both equal to 200 (this is emulating part of a much larger code, and I am considering PDGBSV to obtain the factorization and then repeatedly apply PDGBTRS to get an important quantity, on the order of several million times). I can live with a costly factorization b/c I only do that part once, but other parts of the code need to use a larger number of processors.

My actual question is:
Is the scaling I'm seeing inherent to PDGBTRF or can I fix this by changing how I compile my code, or something else I haven't thought of? If you need more information about the environment I am using the code in, or the code itself, I'd be happy to provide them. What have others with more experience found when using PDGBTRF?

Thank you,

Steve

Re: Scaling parallel, banded LU decomposition and solve

PostPosted: Tue May 29, 2012 6:59 pm
by sgildea
After discussing things with a friend of mine, he suggested I try setting the environment variable OMP_NUM_THREADS to 1. With this in place, the time for the factorization(PDGBTRF) with one processor increased by about a factor of ten, doubled when run with 2 processors, but then has decreased when using more processors all the way up to 80 (scaling as P^(-.5)). However, the matrix I was using as a test case probably wasn't big enough to see strong benefits from more processors at that point because N/P was approaching the same order as my matrix bandwidth.

With OMP_NUM_THREADS at 1, the solve calls (PDGBTRS) scaled well enough (~ P^-.6 or so), just as before.

In summary, setting OMP_NUM_THREADS completely solved the performance issue I was seeing. If I am able to see a benefit from OMP with different settings, I will be sure to update.

Re: Scaling parallel, banded LU decomposition and solve

PostPosted: Tue May 29, 2012 7:14 pm
by Julien Langou
Great! Sorry for not spotting this on our end. It's a classic. Yes you absolutely do need to set the OMP_NUM_THREADS to 1 for ScaLAPACK if you run your application with one MPI process per core.
This is because otherwise the BLAS will (in general) spawn as many threads as core available on a node. So if you have a sixteen-core node, your application spawn sixteen MPI processes (that's you)
each MPI processes spawning sixteen threads (that's the BLAS). This is a recipe for performance disaster. (Note: it would make sense to have four MPI processes per node each spawning four threads at the
BLAS level. You would do this with the appropriate mpirun -np xxxx and then setting OMP_NUM_THREADS to 4. In general OMP_NUM_THREADS to 1 is close to best. So not worth worrying.) (Note:
OMP_NUM_THREADS takes care of most of the BLAS libraries but not all, some have their own environment variable to control the number of threads they are running on.)
Cheers, Julien.