PLASMA RELEASE NOTES
     __________________________________________________________________

Summary of Current Features

     * Solution  of  standard  and  generalized dense symmetric Eigenvalue
       Problem  in real space and complex space, via two-stage tridiagonal
       reduction which has been proved to be upto 10 times faster than the
       standard    tridiagonalisation.    Today   both   eigenvalues   and
       eigenvectors  are  supported.  The overall speedup of the two-stage
       dense  symmetric  Eigenvalue  algorithm vary between two (when both
       eigenpair  are  needed)  and 10 (when only eigenvalues are needed).
       The  new  routines are: plasma_zheev, plasma_zheevd, plasma_zheevr,
       plasma_zhegvd,  plasma_zhegv, plasma_zhetrd. More details about the
       technique can be found in:
          + H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra. Solving the
            Generalized Symmetric Eigenvalue Problem using Tile Algorithms
            on  Multicore  Architectures.  Advances  in Parallel Computing
            Volume 22, 2012.
          + A.  Haidar,  H.  Ltaief and J. Dongarra. Parallel Memory-Aware
            Fine-Grained   Reduction  to  Condensed  Forms  for  Symmetric
            Eigenvalue   Problems.   International   Conference  for  High
            Performance   Computing,  Networking,  Storage  and  Analysis,
            IEEE-SC 2011.
     * Solution  of  dense  Singular Value Decomposition in real space and
       complex  space,  via  two-stage bidiagonal reduction which has been
       proved   to   be   upto   10   times   faster   than  the  standard
       tridiagonalisation.  Today  both  singular  values  and vectors are
       supported.  The  overall  speedup  of  the two-stage dense Singular
       Value  Decomposition algorithm vary between two (when both singular
       vectors  are needed) and 10 (when only singular values are needed).
       The  new routines are: plasma_zgesvd, plasma_zgesdd, plasma_zgebrd.
       More details about the technique can be found in:
          + A.   Haidar,  H.  Ltaief,  P.  Luszczek  and  J.  Dongarra.  A
            Comprehensive   Study   of   Task   Coalescing  for  Selecting
            Parallelism  Granularity in a Two-Stage Bidiagonal Reduction A
            Comprehensive   Study   of   Task   Coalescing  for  Selecting
            Parallelism  Granularity  in a Two-Stage Bidiagonal Reduction.
            IEEE IPDPS 2012
          + A. Haidar, P. Luszczek, J. Kurzak and J. Dongarra. An Improved
            Parallel  Singular  Value Algorithm and Its Implementation for
            Multicore   Hardware.   International   Conference   for  High
            Performance   Computing,  Networking,  Storage  and  Analysis,
            IEEE-SC 2013.
     * Solution  of  dense  systems  of  linear equations and least square
       problems  in  real  space and complex space, using single precision
       and   double   precision,   via   the   Cholesky,  LU,  QR  and  LQ
       factorizations
     * Solution  of  dense  linear  systems of equations in real space and
       complex  space  using  the  mixed-precision  algorithm based on the
       Cholesky, LU, QR and LQ factorizations
     * Multiple implementations of the LU factorization algorithm: Partial
       pivoting  based  on  recursive  parallel panel, tournament pivoting
       with LU partial pivoting and incremental pivoting.
     * Tree-based  QR  and  LQ  factorizations and Q matrix generation and
       application (“tall and skinny”)
     * Tree-based bidiagonal reduction (“tall and skinny”)
     * Explicit   matrix   inversion   based   on  Cholesky  factorization
       (symmetric positive definite)
     * Parallel   and   cache-efficient   in-place   layout   translations
       (Gustavson et at.)
     * Complete  set  of Level 3 BLAS routines for matrices stored in tile
       layout
     * Simple  LAPACK-like interface for greater productivity and advanced
       (tile) interface for full control and maximum performance; Routines
       for  conversion  between  LAPACK  matrix layout and PLASMA’s tile
       layout
     * Dynamic  scheduler  QUARK  (QUeuing  And  Runtime  for Kernels) and
       dynamically   scheduled  versions  of  all  computational  routines
       (alongside statically scheduled ones)
     * Asynchronous interface for launching dynamically scheduled routines
       in  a  non-blocking  mode.  Sequence  and  request  constructs  for
       controlling progress and checking errors
     * Automatic  handling  of  workspace  allocation;  A set of auxiliary
       functions to assist the user with workspace allocation
     * A simple set of "sanity" tests for all numerical routines including
       Level 3 BLAS routines for matrices in tile layout
     * An  advanced  testing suite for exhaustive numerical testing of all
       the  routines  in all precisions (based on the testing suite of the
       LAPACK library)
     * Basic timing suite for most of the routines in all precisions
     * Thread safety
     * Support for Make and CMake build systems
     * LAPACK-style comments in the source code using the Doxygen system
     * Native  support  for  Microsoft  Windows using WinThreads through a
       thin OS interaction layer
     * Installer capable of downloading from Netlib and installing missing
       components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACKE
       C API)
     * Extensive documentation including Installation Guide, Users' Guide,
       Reference  Manual  and  an  HTML  code  browser, a guide on running
       PLASMA  with  the  TAU  package,  Contributors' Guide, a README and
       Release Notes.
     * A comprehensive set of usage examples
     __________________________________________________________________

New Features by Release

  2.8.0, November, 2015

     * Fix a synchronization problem in STEDC functions.
     * Reduce  the  amount of computation performed in UNGQR/UNGLQD family
       routines  by  taking advantage of the Identity structure. It is not
       amore required to initialize the Q to identity before calling those
       functions.
     * New   routines   PLASMA_[sdcz]lascal[_Tile[_Async]],   similar   to
       Scalapack  p[sdcz]lascal,  to  scale a matrix by a constant factor.
       This  function  does  not  handle  numerical  overflow/underflow as
       zlascl from Lapack.
     * New       routines       PLASMA_[sdcz]geadd[Tile[_Async]],      and
       PLASMA[sdcz]tradd[_Tile[_Async]],      similar     to     Scalapack
       p[sdcz]geadd,  and p[sdcz]tradd to add two general, or trapezoidal,
       matrices together.
     * Add  functions to the API in order to give the opportunity to users
       to  mix  submission  to  the  Quark  runtime system of asynchronous
       Plasma calls and personal kernels.
     * Update Lapacke interface to 3.6.0
     * Bug fix on Frobenius norm
     * Add  a  missing  check on descriptors alignment with tiles that may
       cause  unreported trouble to user using sub-descriptors, especially
       with recursive algorithms.

  2.7.1, April, 2015

     * Bug fix with infinite loop in LU recursive panel kernels.
     * Update the eztrace module to be compliant with EZTrace 1.0.6
     * Fix the F77 interface to handle Tile descriptor correctly
     * Update  the Lapack_to_Tile/Tile_to Lapack routine family to support
       both in place and out-of-place layout translation

  2.7.0, March, 2015

     * Parallel  tri-diagonal  divide  and  conquer  solver for eigenvalue
       problems.

  2.6.0, November, 2013

     * libcoreblas  has  been  made fully independent. All dependencies to
       libplasma  and  libquark  have  been removed. A pkg-config file has
       been  added  to  ease compilation of projects using the stand-alone
       coreblas library.
     * New  routines  PLASMA_[sdcz]pltmg[_Tile[_Async]],  for  PLasma Test
       Matrices  Generation,  have  been  added  to  create  special  test
       matrices  from the Matlab gallery. This includes Cauchy, Circulant,
       Fiedler,  Foster,  Hadamard,  Hankel,  Householder  and  many other
       matrices.
     * Add      norms     computation     for     triangular     matrices:
       PLASMA_[sdcz]lantr[_Tile[_Async]], and dependent kernels.
     * Doxygen documentation of coreblas kernels have been updated.
     * Fix  problem  reported  by  J.  Dobson  from NAG on thread settings
       modification  made  in  singular values, eigen values toutines when
       MKL is used.

  2.5.2, September, 2013

     * Add  -m  and  -n  options  to timing routines to define matrix size
       without using ranges
     * Fix  a  minor  bug  that appears when combining muti-threaded tasks
       with  thread-masks  in  Quark.  Previously, the thread mask was not
       respected  when  the  tasks  of  the multi-threaded task were being
       assigned to threads.
     * Fix illegal division by 0 that occured when matrix size was smaller
       than   the   tile  size  during  inplace  layout  translation.  See
       [1]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=23
       74#p2374
     * Fix  the  QUARK_REGION  bug  that was limiting performance of QR/LQ
       factorization in the last release.
     * Fix illegal division by 0 when first numa node detected by HwLoc is
       empty.   Thanks   to   Jim   for   those   two   bug  reports,  see
       [2]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680.
     * Fix  integer  size  that  was  creating  overflow  in tile pointers
       computation. Thanks to SGI for the bug report.

  2.5.1, July, 2013

     * Add  LU  factorization with tournament pivoting. Each tournament is
       based    on    the    classical    partial    pivoting   algorithm.
       PLASMA_[sdcz]getrf_tntpiv[_Tile[_Async]].    The   size   of   each
       subdomain involved in the tournament can be set through the call to
       "PLASMA_Set( PLASMA_TNTPIVOTING_SIZE, nt );". The default is 4. See
       LAWN 226.
     * Add LU factorization with no pivoting:
       PLASMA_[sdcz]getrf_nopiv[_Tile[_Async]].  WARNING:  your matrix has
       to diagonal dominant to use it or the result might be wrong.
     * Add QR with rank revealing routine:
       PLASMA_[sdcz]geqp3[_Tile[_Async]].
     * Fix many comments in the Doxygen documentation
     * Complete documentation on DAG and execution traces generation
     * Add  the  dense  hermetian  eigenvalue  problem routines: Note that
       these  routines  requires  mulithreaded BLAS. For that, the user is
       required to tell PLASMA that he is using multithreaded BLAS library
       and   so   specify   which   library   is   being  used  by  adding
       -DPLASMA_WITH_XXX  to  the  compilation  flags.  Current  supported
       library  are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy
       to  add  morelibrary,  please  contact  PLASMA  team if you require
       addtional libraries to be supported. 1- PLASMA_[sdch]hetrd: compute
       the  tridiagonal  reduction  of  a dense hermetian matrix using the
       2-stage  algorithm  A = QTQ^H. It also has the feature to generates
       the  complex  matrix  Q with orthonormal columns used to reduce the
       matrix  A  to  tridiagonal.  This function is similar to the ZHETRD
       routine  combined  with the ZUNGQR routine (when Q is generated) of
       LAPACK.   2-   PLASMA[sdch]heev:   computes  all  eigenvalues  and,
       optionally,  eigenvectors  of  a  complex  Hermitian matrix A. This
       function   is   similar   to   the  ZHEEV  routine  of  LAPACK.  3-
       PLASMA[sdch]heevd:   computes   all  eigenvalues  and,  optionally,
       eigenvectors  of  a complex Hermitian matrix A. If eigenvectors are
       desired,  it  uses a divide and conquer algorithm. This function is
       similar  to  the  ZHEEVD  routine  of LAPACK. 4- PLASMA[sdch]heevr:
       computes  selected  eigenvalues  and, optionally, eigenvectors of a
       complex  Hermitian  matrix  A.  Eigenvalues and eigenvectors can be
       selected  by  specifying  either  a  range  of values or a range of
       indices  for  the  desired  eigenvalues.  Whenever possible, ZHEEVR
       calls  ZSTEMR  to  compute  eigenspectrum  using  Relatively Robust
       Representations  (MRRR).  This  function  is  similar to the ZHEEVR
       routine   of   LAPACK.   5-  PLASMA[sdch]_hegv:  computes  all  the
       eigenvalues,   and   optionally,  the  eigenvectors  of  a  complex
       generalized   Hermitian-definite   eigenproblem,   of   the   form:
       A*x=(lambda)*B*x,  A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and
       B are assumed to be Hermitian and B is also positive definite. This
       function uses the QR algorithm, and is similar to the ZHEGV routine
       of LAPACK.
6- PLASMA_[sdch]_hegvd: computes all the eigenvalues, and optionally, the eigenv
ectors
   the eigenvectors of a complex generalized Hermitian-definite
   eigenproblem, of the form:
   A*x=(lambda)*B*x,  A*Bx=(lambda)*x,  or B*A*x=(lambda)*x.
   Here A and B are assumed to be Hermitian and B is also positive definite.
   If eigenvectors are desired, it uses a divide and conquer algorithm,
   and is similar to the ZHEGVD routine of LAPACK.
     * Add  the  singular  value  decomposition  (SVD) routines: Note that
       these  routines  requires  mulithreaded BLAS. For that, the user is
       required to tell PLASMA that he is using multithreaded BLAS library
       and   so   specify   which   library   is   being  used  by  adding
       -DPLASMA_WITH_XXX  to  the  compilation  flags.  Current  supported
       library  are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy
       to  add  morelibrary,  please  contact  PLASMA  team if you require
       addtional libraries to be supported. 1- PLASMA_[sdch]gebrd: compute
       the  bidiagonal  reduction  of  a  dense  general  matrix using the
       2-stage  algorithm  A = QBP^H. It also has the feature to generates
       the complex matrix Q and PH with orthonormal columns used to reduce
       the  matrix A to bidiagonal. This function is similar to the ZGEBRD
       routine  combined  with the ZUNGBR routine (when Q is generated) of
       LAPACK.   2-   PLASMA[sdch]gesvd:   computes   the  singular  value
       decomposition (SVD) of a complex matrix A, optionally computing the
       left  and/or  right  singular  vectors.  The SVD is written A = U *
       SIGMA  *  conjugate-transpose(V).  This  routine  use  the implicit
       zero-shift  QR  algorithm  and  is similar to the ZGESVD routine of
       LAPACK.   3-   PLASMA[sdch]_gesdd:   computes  the  singular  value
       decomposition (SVD) of a complex matrix A, optionally computing the
       left  and/or  right  singular  vectors.  The SVD is written A = U *
       SIGMA  *  conjugate-transpose(V).  This  routine use the divide and
       conquer algorithm and is similar to the ZGESDD routine of LAPACK.

  2.5.0, November, 2012

     * Introduce  condition  estimators  for General and Positive Definite
       cases (xGECON, xPOCON)
     * Fix recurring with lapack release number in plasma-installer
     * Fix  out-of-order  computation  in  QR/LQ  factorization  that were
       causing numerical issues with dynamic scheduling
     * Fix many comments in the Doxygen documentation
     * Correct some performance issues with in-place layout translation

  2.4.6, August 20th, 2012

     * Add  eigenvectors  support  in eigensolvers for symmetric/hermitian
       problems and generalized problems.
     * Add support of Frobenius norm.
     * Release  the  precision  generation  script  used  to  generate the
       precision s, d and c from z, as well as ds from zc
     * Add all Fortran90 for mixed precision routines.
     * Add  all  Fortran90  wrappers  to  tile  interface and asynchronous
       interface. Thanks to NAG for providing those wrappers.
     * Add 4 examples with Fortran90 interface.
     * Add support for all computational functions in F77 wrappers.
     * Fix  memory  leaks  related  to  fake  dependencies  in dynamically
       scheduled algorithms.
     * Fix interface issues in eigensolvers routines.
     * Fixed returned info in PLASMA_zgetrf function
     * Fixed bug with matrices of size 0.
     * WARNING:  all  lapack interfaces having a T or L argument for QR or
       LU  factorization  have  been  changed  to  take  a descriptor. The
       workspace  allocation  has been changed to match those requirements
       and   all   functions   PLASMA_Alloc_Workspace_XXXXX_Tile  are  now
       deprecated    and   users   are   encouraged   to   move   to   the
       PLASMA_Alloc_Workspace_XXXXX version.

  2.4.5, November 22nd, 2011

     * Add  LU  inversion functions: PLASMA_zgetri, PLASMA_zgetri_Tile and
       PLASMA_zgetri_Tile_Async   using   the   recursive  parallel  panel
       implementation of LU factorization.
     * The  householder  reduction  trees for QR and LQ factorizations can
       now  work on general cases and not only on matrices with M multiple
       of MB.
     * Matrices  generation  has been changed in every timing, testing and
       example  files to use a parallel initialization generating a better
       distribution  of  the data on the architecture, especially for Tile
       interface. “numactl” is not required anymore.
     * Timing  routines  can  now generate DAGs with the --dag option, and
       traces with --trace option if EZTRACE is present.

  2.4.2, September 14th, 2011

     * New  version  of quark removing active waiting and allowing user to
       bind tasks to set of cores.
     * Installer:  Fix  compatibility  issues between plasma-installer and
       PGI compiler reported on Kraken by A. Bouteiller.
     * Fix one memory leak with Hwloc.
     * Introduce  a  new  kernel  for  the  recursive LU operation on tile
       layout which reduces cache misses.
     * Fix  several  bugs and introduce new features thanks to people from
       Fujitsu and NAG :
          + The  new  LU factorization with partial pivoting introduced in
            release 2.4 is now working on rectangular matrices.
          + Add missing functions to Fortran 77 interface.
          + Add  a  new  Fortran  90  interface  to  all  LAPACK  and Tile
            interface. Asynchronous interface and mixed precision routines
            are not available yet.
          + Fix arguments order in header files to fit implementation.

  2.4.1, July 8th, 2011

     * Fix   bug   with   Fujitsu   compiler   reported   on   the   forum
       ([3]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108)
     * Unbind  threads  in  PLASMA_Finalize to avoid problem of binding in
       OpenMP  section  following  PLASMA calls (still possible on Mac and
       AIX  without hwloc). A better fix is to create the OpenMP thread in
       the  user  code before any call to PLASMA thanks to a fake parallel
       section.

  2.4.0, June 6th, 2011

     * Tree-based  QR  and  LQ factorizations: routines for application of
       the   Q  matrix  support  all  combinations  of  input  parameters:
       Left/Right, NoTrans/Trans/ConjTrans
     * Symmetric    Engenvalue    Problem   using:   tile   reduction   to
       band-tridiagonal  form, reduction to "standard" tridiagonal form by
       bulge   chasing,   finding   eigenvalues  using  the  QR  algorithm
       (eigenvectors currently not supported)
     * Singular    Value    Decomposition   using:   tile   reduction   to
       band-bidiagonal form, reduction to “standard” bidiagonal form by
       bulge  chasing,  finding  singular  values  using  the QR algorithm
       (singular vectors currently no supported)
     * Gaussian  Elimination  with  partial  pivoting  (as  opposed to the
       incremental  pivoting  in  the  tile LU factorization) and parallel
       panel  (using  Quark  extensions  for  nested parallelism) WARNING:
       Following  the  integration  of  this new feature, the interface to
       call  LU  factorization has changed. Now, PLASMA_zgetrf follows the
       LAPACK  interface  and corresponds to the new partial pivoting. Old
       interface  related to LU factorization with incremental pivoting is
       now renamed PLASMA_zgetrf_incpiv.

  2.3.1, November 30th, 2010

     * Add  functions to generate random matrices (plrnt, plghe and plgsy)
       ⇒ fix the problem with time_zpotri_tile.c reported by Katayama on
       the forum
       ([4]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59)
     * Fix a deadlock in norm computations with static scheduling
     * Installer:  fix  the LAPACK version when libtmg is the only library
       to be install Thanks to Henc.
       ([5]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60)

  2.3.0, November 15th, 2010

     * Parallel   and   cache-efficient   in-place   layout   translations
       (Gustavson et al.)
     * Tree-based  QR  factorization  and Q matrix generation (“tall and
       skinny”)
     * Explicit   matrix   inversion   based   on  Cholesky  factorization
       (symmetric positive definite)
     * Replacement of LAPACK C Wrapper with LAPACKE C API by Intel

  2.2.0, July 9th, 2010

     * Dynamic  scheduler  QUARK  (QUeuing  And  Runtime  for Kernels) and
       dynamically   scheduled  versions  of  all  computational  routines
       (alongside statically scheduled ones)
     * Asynchronous interface for launching dynamically scheduled routines
       in  a  non-blocking  mode.  Sequence  and  request  constructs  for
       controlling progress and checking errors
     * Removal  of CBLAS and pieces of LAPACK from PLASMA’s source tree.
       BLAS,  CBLAS,  LAPACK and Netlib LAPACK C Wrapper become PLASMA’s
       software dependencies required prior to the installation of PLASMA
     * Installer capable of downloading from Netlib and installing missing
       components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACK
       C Wrapper)
     * Complete  set  of Level 3 BLAS routines for matrices stored in tile
       layout

  2.1.0, November 15th, 2009

     * Native support for Microsoft Windows using WinThreads
     * Support for Make and CMake build systems
     * Performance-optimized  mixed-precision  routine for the solution of
       linear systems of equations using the LU factorization
     * Initial timing code (PLASMA_dgesv only)
     * Release notes

  2.0.0, July 4th, 2008

     * Support  for  real  and  complex  arithmetic  in  single and double
       precision
     * Generation  and  application  of  the  Q  matrix from the QR and LQ
       factorizations
     * Prototype  of  mixed-precision  routine  for the solution of linear
       systems  of equations using the LU factorization (not optimized for
       performance)
     * Simple interface and native interface
     * Major code cleanup and restructuring
     * Redesigned workspace allocation
     * LAPACK testing
     * Examples
     * Thread safety
     * Python installer
     * Documentation:   Installation  Guide,  Users'  Guide  with  routine
       reference  and an HTML code browser, a guide on running PLASMA with
       the  TAU  package,  initial  draft of Contributors' Guide, a README
       file and a LICENSE file

  1.0.0, November 15th, 2008

     * Double  precision  routines  for  the solution of linear systems of
       equations  and  least square problems using Cholesky, LU, QR and LQ
       factorizations
     __________________________________________________________________

   Last updated 2015-12-03 10:08:41 MST

Références

   1. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=2374#p2374
   2. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680
   3. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108
   4. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59
   5. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60