Scaling parallel, banded LU decomposition and solve
I have written a simple program to factor and solve a banded matrix in ScaLAPACK using PDGBTSV and PDGBTRS. Basically, I invoke PDGBTSV, and then use the output from the internal call of PDGBTRF for additional calls to PDGBTRS. I have run the program with 1, 2, 4, and 8 processes (everything regarding the matrix and rhs kept the same) and find that the solve (PDGBTRS) portion does get faster as I increase use more processors(scales approx. as P^-.63), but the factorization step (PGBTRF called from PDGBSV) scales as P^3.0 ! I would like to run with more processors, but I am concerned about the factorization scaling...With one process, for this problem, the PDGBSV call takes about 2.6 seconds, but for 8 processes on the same matrix/rhs, I did the factorization once and it took nearly 40 minutes!
My sample problem as N=40000 unknowns, with BWL and BWU both equal to 200 (this is emulating part of a much larger code, and I am considering PDGBSV to obtain the factorization and then repeatedly apply PDGBTRS to get an important quantity, on the order of several million times). I can live with a costly factorization b/c I only do that part once, but other parts of the code need to use a larger number of processors.
My actual question is:
Is the scaling I'm seeing inherent to PDGBTRF or can I fix this by changing how I compile my code, or something else I haven't thought of? If you need more information about the environment I am using the code in, or the code itself, I'd be happy to provide them. What have others with more experience found when using PDGBTRF?
Thank you,
Steve
My sample problem as N=40000 unknowns, with BWL and BWU both equal to 200 (this is emulating part of a much larger code, and I am considering PDGBSV to obtain the factorization and then repeatedly apply PDGBTRS to get an important quantity, on the order of several million times). I can live with a costly factorization b/c I only do that part once, but other parts of the code need to use a larger number of processors.
My actual question is:
Is the scaling I'm seeing inherent to PDGBTRF or can I fix this by changing how I compile my code, or something else I haven't thought of? If you need more information about the environment I am using the code in, or the code itself, I'd be happy to provide them. What have others with more experience found when using PDGBTRF?
Thank you,
Steve