I'm currently working on the optimization of the assembly algorithm for using scalapack on a large number of processors (About 1000-10000 or more) (And more than 400 000 degrees of freedom for the problem).
The actual algorithm we are using is not scalable, the time will even increase with the number of processors after a certain size.
A simplified version of this algorithm is :
- Code: Select all
/* ne is the total number of elements */
DO first_element = 1, ne, step
DO ie = first_element, first_element + step
DO je = ie, ne
mat_elem = Build_the_elementary_matrix(ie,je)
DO ddlie = 1, nb_ddl(ie)
I = ddls(ddlie, ie)
DO ddlje = 1, nb_ddl(je)
J = ddls(ddlje, je)
ie_lines(compute_idx(I,J)) = aCoef * mat_elem(I,J)
ie_columns(compute_idx(J,I)) = anothercoef * mat_elem(I,J)
END DO
END DO
END DO
END DO
DO ie = first_element, first_element + step
DO p = 1, nb_proc
SEND_DATA_TO(P)
RECV_DATA_FROM(nb_proc+1-P)
ADD_DATA_TO_LOCAL_MTX
END DO
END DO
END DO
I have been working on an other algorithm using no communication.
- Code: Select all
! We construct 2 arrays which will say if an element impact local lines or local columns
islocal_lig, islocal_col = computations
DO ie = 1, ne
IF (.not. islocal_lig(ie)) CYCLE
Do je 1, ne
IF (.not. islocal_col(je)) CYCLE
IF (je > ie .OR. (.NOT. (isloca_lig(je) .AND. islocal_col(ie))) THEN
mat_elem = Build_the_elementary_matrix(ie,je)
DO ddlie = 1, nb_ddl(ie)
I = ddls(ddlie, ie)
DO ddlje = 1, nb_ddl(je)
J = ddls(ddlje, je)
if (local(I,J)) matrix(compute_idx(I,J)) = aCoef * mat_elem(I,J)
if (local(J,I) matrix(compute_idx(J,I)) = anothercoef * mat_elem(I,J)
END DO
END DO
END IF
END DO
END DO
This one is scalable but the problem is that it computes about 4 times more (3x3) elementary matrix, and so it cost about 4 times much on little number of processors...
The solution could be to switch between the two algorithm in function of the size of the problem and the number of processors but do you know if I could found somewhere documentation on a existing scalable assembly for scalapack.
An other solution i'm thinking of is to choose for each elementary matrix, one of the processors which need it to compute it and communicate to the others.
Thanks for any idea or comments.
Xavier Lacoste.

