
  -- Innovative Computing Laboratory
  -- Computer Science Department
  -- University of Tennessee
  -- (C) Copyright 2006

  Purpose:

    The purpose of this code is to demonstrate the performance advantages of the
    mixed-precision iterative refinement approach to solution of linear systems
    of equations. The algorithm takes advantage of higher performance of single
    precision calculations over double precision. This approach is motivated by
    the fact that on many modern architectures single precision operations are
    twice as fast as double precision, and on the Cell Broadband Engine they are
    an order of magnitude faster.

  Method:

    The code solves a SPD system of linear equations Ax=b using Cholesky
    factorization in single precision arithmetic and iteratively refines
    the solution to achieve the full 64-bit double precision accuracy.
    More information is available at the URL: http://icl.cs.utk.edu/iter-ref/.

    The input of the algorithm is the coefficient matrix A in double precision
    in column major format and the right hand side vector b in double precision.
    The output of the algorithm is the solution vector x in double precision.

    The algorithm requires a copy of the coefficient matrix in both single and
    double precision. To achieve the desired performance, row major block data
    layout is used, with 64 x 64 blocks in single precision, and 32 x 32 blocks
    in double precision. The algorithm is performed entirely on the block data
    representation.

  Restrictions:

    In general the algorithm can be used to solve systems with many right hand
    sides. This particular implementation for the Cell Broadband Engine solves
    a system with a single right hand side. Due to the representation of the
    coefficient matrix in block layout, the algorithm only allows for systems
    which size is a multiplicity of 64. The code requires the use of huge pages
    (16MB). There are situations when this method will fail and performing the
    solve entirely in double precision will succeed. It may happen if the
    condition number of the coefficient matrix is too high.

  Usage:

    iter_ref_cholesky <dimension in blocks> <number of SPUs>
        <log_10 of the condition number> <number of refinement interations>

    For instance the call:

        iter_ref 10 8 2 3

    solves Ax = b, where A is a 640 x 640 SPD matrix on 8 SPEs. The condition
    number of A is 10^2. Three itarations are used to refine the solution.

  Output:

    The output of the program is the time for all the constituent routines in
    microseconds, as well as performance in Gflop/s for the routines where it
    is appropriate. The program also prints the norm-wise backward error of the
    final solution, as well as component-wise backward error. For most of the
    routines their names are those used in LAPACK and BLAS libraries.
    The others are:

        d2s   - translation of the coefficient matrix A from double precision to
                single precision,

        l2b   - translation of the coefficient matrix A in single precision from
                standard column major data layout to row major block data layout
                with blocks of 64 x 64,

        l2b_d - translation of the coefficient matrix A in double precision from
                standard column major data layout to row major block data layout
                with blocks of 32 x 32,

    Also, an execution trace is written to an SVG file.

