LAPACK/ScaLAPACK Development

Posted: **Tue May 29, 2012 3:20 pm**

I have written a simple program to factor and solve a banded matrix in ScaLAPACK using PDGBTSV and PDGBTRS. Basically, I invoke PDGBTSV, and then use the output from the internal call of PDGBTRF for additional calls to PDGBTRS. I have run the program with 1, 2, 4, and 8 processes (everything regarding the matrix and rhs kept the same) and find that the solve (PDGBTRS) portion does get faster as I increase use more processors(scales approx. as P^-.63), but the factorization step (PGBTRF called from PDGBSV) scales as P^3.0 ! I would like to run with more processors, but I am concerned about the factorization scaling...With one process, for this problem, the PDGBSV call takes about 2.6 seconds, but for 8 processes on the same matrix/rhs, I did the factorization once and it took nearly 40 minutes!

My sample problem as N=40000 unknowns, with BWL and BWU both equal to 200 (this is emulating part of a much larger code, and I am considering PDGBSV to obtain the factorization and then repeatedly apply PDGBTRS to get an important quantity, on the order of several million times). I can live with a costly factorization b/c I only do that part once, but other parts of the code need to use a larger number of processors.

My actual question is:
Is the scaling I'm seeing inherent to PDGBTRF or can I fix this by changing how I compile my code, or something else I haven't thought of? If you need more information about the environment I am using the code in, or the code itself, I'd be happy to provide them. What have others with more experience found when using PDGBTRF?

Thank you,

Steve

Posted: **Tue May 29, 2012 6:59 pm**

After discussing things with a friend of mine, he suggested I try setting the environment variable OMP_NUM_THREADS to 1. With this in place, the time for the factorization(PDGBTRF) with one processor increased by about a factor of ten, doubled when run with 2 processors, but then has decreased when using more processors all the way up to 80 (scaling as P^(-.5)). However, the matrix I was using as a test case probably wasn't big enough to see strong benefits from more processors at that point because N/P was approaching the same order as my matrix bandwidth.

With OMP_NUM_THREADS at 1, the solve calls (PDGBTRS) scaled well enough (~ P^-.6 or so), just as before.

In summary, setting OMP_NUM_THREADS completely solved the performance issue I was seeing. If I am able to see a benefit from OMP with different settings, I will be sure to update.

Posted: **Tue May 29, 2012 7:14 pm**

Great! Sorry for not spotting this on our end. It's a classic. Yes you absolutely do need to set the OMP_NUM_THREADS to 1 for ScaLAPACK if you run your application with one MPI process per core.
This is because otherwise the BLAS will (in general) spawn as many threads as core available on a node. So if you have a sixteen-core node, your application spawn sixteen MPI processes (that's you)
each MPI processes spawning sixteen threads (that's the BLAS). This is a recipe for performance disaster. (Note: it would make sense to have four MPI processes per node each spawning four threads at the
BLAS level. You would do this with the appropriate mpirun -np xxxx and then setting OMP_NUM_THREADS to 4. In general OMP_NUM_THREADS to 1 is close to best. So not worth worrying.) (Note:
OMP_NUM_THREADS takes care of most of the BLAS libraries but not all, some have their own environment variable to control the number of threads they are running on.)
Cheers, Julien.

LAPACK/ScaLAPACK Development

Scaling parallel, banded LU decomposition and solve

Scaling parallel, banded LU decomposition and solve

Re: Scaling parallel, banded LU decomposition and solve

Re: Scaling parallel, banded LU decomposition and solve