We have to finish the release by Wednesday. As we go, here is where
we would put tasks to be done. Column 'Who' shows who is doing it -
if it is empty and you start doing it put your name there, if your
name is there it is suggested that you do it. As you go and discover 
problems and other tasks you can add them to the list. 
Here is what I started and will keep adding. Thanks.


To do list for MAGMA 1.2 release:

Task  Done      Who       What
=================================================================================
1.                         Find another way to set all the nb, inside this MAGMA_init 
                           another one and provide the user a way to change it.
                           Partially done: get_nb has been moved to src andit's ok 
                           for tomorrow.
                           But we should create a set of functions: 
                                  magma_zgetib( int func, int M  )
                                  magma_zsetib( int func, int ib )
---------------------------------------------------------------------------------
2.                         Add zlansy.cu functions (Relies on task 20)
---------------------------------------------------------------------------------
3.                         zclaswp interface should be:	
			   ZCLASWP( N, A, LDA, SA, LDSA, K1, K2, IPIV, INCX ) 
---------------------------------------------------------------------------------
4.                         magma module describing the interfaces for use 
                           from fortran 
---------------------------------------------------------------------------------
5.                         Update interfaces from:
                                latrd
                                unmtr
---------------------------------------------------------------------------------
6.                         Possible bug in zcgeqrsv_gpu at least on Tesla for
                           sizes not divisible by 32, e.g.,

[tomov@cumin testing]$ ./testing_dsgeqrsv_gpu -N 1025 -nrhs 1
device 0: GeForce GTX 280, 1296.0 MHz clock, 1023.8 MB memory
device 1: Quadro NVS 290, 918.0 MHz clock, 255.3 MB memory
Epsilon(double): 1.110223e-16
Epsilon(single): 5.960464e-08

        CPU GFlop/s         G P U  GFlop/s   
  N         DP          DP       SP       MP    ||b-Ax||/||A||  NumIter
=======================================================================
 1025     20.80       20.31    35.93    28.18    0.000000e+00     0

   * this is a problem on Tesla. Apparently comes from sgemms used in slarfb.
     I can not reproduce the problem with just QR factorization or solver.
     Even here somethimes runs correctly but sometimes the card goes into
     a "weird" state and gives these results.
    
     Any ideas on this are welcome. Obviously it would require more testing.

---------------------------------------------------------------------------------
7.             Stan         Bug in zcgesv_gpu when N is not multiple of 32,
                            nrhs > 1, and ld for X and RHS > N.
---------------------------------------------------------------------------------
8.             Peng         Bug in dtrsm for tesla
     
[tomov@cumin testing]$ ./testing_dgetrf_gpu -M 287 -N 579                                        device 0: GeForce GTX 280, 1296.0 MHz clock, 1023.8 MB memory
device 1: Quadro NVS 290, 918.0 MHz clock, 255.3 MB memory
  testing_dgetrf -M 287 -N 579



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
  287   579    1.91         4418.97         3.925370e-04

If I use cublasDtrsm in dgetrf_gpu I don't get the error.
---------------------------------------------------------------------------------
9.             Peng         Bug in strsm on Fermi, e.g.,

tomov:disco /mnt/scratch/tomov/sc_release/testing> ./testing_sgetrf_gpu -M 1024 -N 2048    <-  1:59AM
device 0: Tesla C2050, 1147.0 MHz clock, 3071.7 MB memory
device 1: Quadro NVS 290, 918.0 MHz clock, 255.7 MB memory
  testing_sgetrf -M 1024 -N 2048



  M     N   CPU GFlop/s    GPU GFlop/s   ||PA-LU||/(||A||*N)
============================================================
 1024  2048   31.81          97.18         1.544789e-04
 

---------------------------------------------------------------------------------
10.            Peng         To check if this bug is related to the dtrsm bug.
 
http://icl.cs.utk.edu/magma/forum/viewtopic.php?f=2&t=127&sid=d2479bd1fcd4f88d39111f9611810a8e

*** Also looking in the code dX (and probably dB) are used sometimes with lddx
equal to N and sometimes lddx. In the tester lddx is passed as N so we can not
detect the bug. I (Stan) am not fixing it now as the problem may involve some
other kernels (I remeber there was an issue somewhere)
---------------------------------------------------------------------------------
11.                         Remove DA from the description of function zhetrd
                            after we pass all tests (the work space is now allocated
                            in the routine itself).
---------------------------------------------------------------------------------
12.                         There could be a bug in zgeqrf2_gpu. There is a call to 
                            zlarfb that uses V with 0s in the upper triangular
                            part, and immediately after this we call asynchronously
                            (in another stream) a kernel that will fix V to be in 
                            LAPACK layout. 
 
                            I don't see errors now; this is just a note where to look
                            in case we start seeing. 
---------------------------------------------------------------------------------
13.              Rajib ?   When I run testing_sgeev through error checking mode
                           sometimes (e.g., when running not at specified sizes)
                           I get error. If I block the use magmablas_sgemm in
                           function slahru the error disappears. This is happening
                           on Tesla (cumin in particular).
                           I have blocked the use of magma_sgemm for this routine
                           (on both Fermi and Tesla) until further investigation.
                           
                           Stan
---------------------------------------------------------------------------------
14.              Rajib?    There seems to be loss of accuracy when using ssyr2k
                 Tim?      on Fermi when used in ssytrd, e.g.,
                           using CUBLAS I get
tomov:disco /mnt/scratch/tomov/magma_1.0.0/testing> ./testing_ssytrd -N 2048               
device 0: Tesla C2050 / C2070, 1147.0 MHz clock, 3071.7 MB memory
device 1: Quadro NVS 290, 918.0 MHz clock, 255.7 MB memory
  testing_ssytrd -L|U -N 2048

  N    CPU GFlop/s    GPU GFlop/s   |A-QHQ'|/N|A|  |I-QQ'|/N 
=============================================================
 2048    12.11         24.69       1.092290e-08 2.556555e-08

while using magma blas I get
  N    CPU GFlop/s    GPU GFlop/s   |A-QHQ'|/N|A|  |I-QQ'|/N 
=============================================================
 2048    12.00         24.97       1.978159e-06 2.506212e-08

                           I am taking it out for now until is fixed.

                           Stan

---------------------------------------------------------------------------------
15.              Rajib ?   There is another problem with magmablas sgemm (see also
                 Tim ?     the above). Testing for example
                           ./testing_ssyevd -N 3072
                           (and other sizes) gives wrong results. 
                           Using cublasSgemm is fine. I had to
                           comment out the use of MAGMABLAS in slarfb_gpu.
---------------------------------------------------------------------------------
16.               Stan    Add zungtr_gpu. Add the upper case in zungtr.
                          Add using the routine in the SVD. Make sure
                          that the matrix is prepared and passed to magma_zungqr
                          in the format desired. The same for T - do we have it 
                          from the bidiagonalization?
---------------------------------------------------------------------------------
17.                       cublasStatus_t is not defined before CUDA 4. It should
                          be possible to detect the version and if needed define 
                          it, e.g.,
                          #define cublasStatus cublasStatus_t
---------------------------------------------------------------------------------



