LAPACK/ScaLAPACK Development

by **dontpullajeff** » Tue Nov 13, 2012 2:11 pm

Hello, first time poster here.

My work includes finding the QR decomposition of a p-cyclic matrix M. It is square, L blocks high, L blocks wide, and each block is N by N. Each block column has two nonzero blocks on and directly below the diagonal, like so

Code: Select all: M11 M1L M21 M22 M32 M33 ... ... MLL-1 MLL

I am using a Block Orthogonal Factorization method to find the QR but want to compare it to DGEQRF in terms of time and speed. The following code shows this:

Code: Select all: double * getBlockStart(double * A, int N, int L, int i, int j) { return A + N * L * N * (j-1) + N * (i-1); } void bruteQR(double * M, int N) { double * tau = new double[N]; double ini; int lwork = -1, info; dgeqrf_(&N, &N, M, &N, tau, &ini, &lwork, &info); lwork = ini; double * work = new double[lwork]; dgeqrf_(&N, &N, M, &N, tau, work, &lwork, &info); free(work); free(tau); } void bsof(double * M, int N, int L, double * taus) { int NL = N*L, N2 = 2 * N; double ini; int lwork = -1, info; //Obtain the optimal value for lwork, does no work dgeqrf_(&N2, &N, getBlockStart(M, N, L, 1, 1), &NL, taus, &ini, &lwork, &info); lwork = ini; double * work = new double[lwork]; for(int i = 1; i <= L-2; i++) { dgeqrf_(&N2, &N, getBlockStart(M, N, L, i, i), &NL, taus+((i-1)*N), work, &lwork, &info); dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,i+1),&NL,work,&lwork,&info); dormqr_("L","T",&N2,&N,&N,getBlockStart(M,N,L,i,i),&NL,taus+((i-1)*N),getBlockStart(M,N,L,i,L),&NL,work,&lwork,&info); } free(work); lwork = -1; dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), &ini, &lwork, &info); lwork = ini; double * work2 = new double[lwork]; dgeqrf_(&N2, &N2, getBlockStart(M, N, L, L-1, L-1), &NL, taus+(NL-2*N), work2, &lwork, &info); free(work2); }

Now my problem is that although BSOF always has a better timing than DGEQRF, as L increases and the amount of zeroes in M increases, DGEQRF gets much much faster in terms of GFlop/s. Below are results from a test where the size of M is constant at 10,000 and L grows as N decreases. If DGEQRF were unaffected by structure, its speed and execution time would be the same for each test, but it is not. So my question is why is DGEQRF going so fast? My theory is that there is some heuristic which skips some amount of flops when it sees some formation of zeroes, making my flop count incorrect and leading to a bad GFlop rate. But I have no idea how/where this is being done.

Code: Select all: N L QRF t QRF GF BSOF t BSOF GF 2500 4 5.97 223.53 5.93 108.95 1250 8 3.24 411.16 2.11 94.88 1000 10 2.82 473.48 1.45 92.05 500 20 2.06 647.74 0.48 74.51 400 25 1.89 704.28 0.37 63.74 250 40 1.69 787.52 0.17 53.88 200 50 1.6 832.28 0.13 44.94 125 80 1.51 884.11 0.09 26.06 100 100 1.45 917.95 0.08 20.12 80 125 1.38 967.95 0.07 14.78 50 200 1.35 989.34 0.05 8.08 40 250 1.31 1014.52 0.04 6.48 25 400 1.3 1022.98 0.03 3.16 20 500 1.29 1035.39 0.03 2.05 10 1000 1.28 1044.39 0.01 1.23 8 1250 1.27 1046.45 0.01 0.92 4 2500 1.25 1064.49 0.01 0.28

Thanks!

by **CyLith** » Thu Nov 15, 2012 2:32 am

Due to the large size of your matrices, DGEQRF is using the blocked code path, where instead of computing the QR factorization one column at a time, it computes a block of columns at a time. The steps in the block factorization can be seen in http://www.netlib.org/lapack/double/dgeqrf.f, and there are 3 main steps: (1) compute a QR factorization of the block of columns (tall skinny matrix), with Q represented as single column Householder reflections, (2) Combine the individual reflectors into a block reflector with a diagonal block T (in routine DLARFT). The details of the algorithm are not at all obvious from the code, and there does not seem to be much literature on the subject (I myself am interested in knowing more about it, but I have found this paper: http://www.sciencedirect.com/science/article/pii/0045782576900323. Finally (3), the block reflector is applied.

Throughout this process, there are many optimizations performed which exploit the zero structure of the intermediate factors. In DLARFT, the code checks to see if any Tau's are zero (the scale factor of the Householder reflector), and if so, saves some time on filling in T. This will occur when the diagonal blocks are low rank, but only saves a small amount of time since the size of the diagonal block is relatively small. In DLARFB, the code checks for reflectors with trailing zeros, and saves time there. This happens if your matrix was roughly upper triangular to begin with so that the panel QR in step (1) has trailing zeros, which is true for you, thus saving a lot of time. So in this sense, what DGEQRF is doing is not a whole lot different from your BSOF routine, except it has to explicitly check for zero structure. In the limit of large block sizes, the trailing zero estimate returned by ILADLR becomes exactly correct and makes the amount of work between the two methods roughly the same.

by **ptomulik** » Wed Mar 20, 2013 1:17 pm

Hi,

here is a link to (publicly available) paper by Schreiber and Van Loan ('87)
regarding the theory of blocked I - V*T*V**T representation of QR factorization:

http://ecommons.library.cornell.edu/handle/1813/6704

Hope this helps.

Regards!
--
Pawel Tomulik

LAPACK/ScaLAPACK Development

DGEQRF performance on block p-cyclic matrix

DGEQRF performance on block p-cyclic matrix

Re: DGEQRF performance on block p-cyclic matrix

Re: DGEQRF performance on block p-cyclic matrix

Who is online