PLASMA RELEASE NOTES __________________________________________________________________ Summary of Current Features * Solution of standard and generalized dense symmetric Eigenvalue Problem in real space and complex space, via two-stage tridiagonal reduction which has been proved to be upto 10 times faster than the standard tridiagonalisation. Today both eigenvalues and eigenvectors are supported. The overall speedup of the two-stage dense symmetric Eigenvalue algorithm vary between two (when both eigenpair are needed) and 10 (when only eigenvalues are needed). The new routines are: plasma_zheev, plasma_zheevd, plasma_zheevr, plasma_zhegvd, plasma_zhegv, plasma_zhetrd. More details about the technique can be found in: + H. Ltaief, P. Luszczek, A. Haidar and J. Dongarra. Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures. Advances in Parallel Computing Volume 22, 2012. + A. Haidar, H. Ltaief and J. Dongarra. Parallel Memory-Aware Fine-Grained Reduction to Condensed Forms for Symmetric Eigenvalue Problems. International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2011. * Solution of dense Singular Value Decomposition in real space and complex space, via two-stage bidiagonal reduction which has been proved to be upto 10 times faster than the standard tridiagonalisation. Today both singular values and vectors are supported. The overall speedup of the two-stage dense Singular Value Decomposition algorithm vary between two (when both singular vectors are needed) and 10 (when only singular values are needed). The new routines are: plasma_zgesvd, plasma_zgesdd, plasma_zgebrd. More details about the technique can be found in: + A. Haidar, H. Ltaief, P. Luszczek and J. Dongarra. A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction. IEEE IPDPS 2012 + A. Haidar, P. Luszczek, J. Kurzak and J. Dongarra. An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware. International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE-SC 2013. * Solution of dense systems of linear equations and least square problems in real space and complex space, using single precision and double precision, via the Cholesky, LU, QR and LQ factorizations * Solution of dense linear systems of equations in real space and complex space using the mixed-precision algorithm based on the Cholesky, LU, QR and LQ factorizations * Multiple implementations of the LU factorization algorithm: Partial pivoting based on recursive parallel panel, tournament pivoting with LU partial pivoting and incremental pivoting. * Tree-based QR and LQ factorizations and Q matrix generation and application (“tall and skinny”) * Tree-based bidiagonal reduction (“tall and skinny”) * Explicit matrix inversion based on Cholesky factorization (symmetric positive definite) * Parallel and cache-efficient in-place layout translations (Gustavson et at.) * Complete set of Level 3 BLAS routines for matrices stored in tile layout * Simple LAPACK-like interface for greater productivity and advanced (tile) interface for full control and maximum performance; Routines for conversion between LAPACK matrix layout and PLASMA’s tile layout * Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and dynamically scheduled versions of all computational routines (alongside statically scheduled ones) * Asynchronous interface for launching dynamically scheduled routines in a non-blocking mode. Sequence and request constructs for controlling progress and checking errors * Automatic handling of workspace allocation; A set of auxiliary functions to assist the user with workspace allocation * A simple set of "sanity" tests for all numerical routines including Level 3 BLAS routines for matrices in tile layout * An advanced testing suite for exhaustive numerical testing of all the routines in all precisions (based on the testing suite of the LAPACK library) * Basic timing suite for most of the routines in all precisions * Thread safety * Support for Make and CMake build systems * LAPACK-style comments in the source code using the Doxygen system * Native support for Microsoft Windows using WinThreads through a thin OS interaction layer * Installer capable of downloading from Netlib and installing missing components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACKE C API) * Extensive documentation including Installation Guide, Users' Guide, Reference Manual and an HTML code browser, a guide on running PLASMA with the TAU package, Contributors' Guide, a README and Release Notes. * A comprehensive set of usage examples __________________________________________________________________ New Features by Release 2.8.0, November, 2015 * Fix a synchronization problem in STEDC functions. * Reduce the amount of computation performed in UNGQR/UNGLQD family routines by taking advantage of the Identity structure. It is not amore required to initialize the Q to identity before calling those functions. * New routines PLASMA_[sdcz]lascal[_Tile[_Async]], similar to Scalapack p[sdcz]lascal, to scale a matrix by a constant factor. This function does not handle numerical overflow/underflow as zlascl from Lapack. * New routines PLASMA_[sdcz]geadd[Tile[_Async]], and PLASMA[sdcz]tradd[_Tile[_Async]], similar to Scalapack p[sdcz]geadd, and p[sdcz]tradd to add two general, or trapezoidal, matrices together. * Add functions to the API in order to give the opportunity to users to mix submission to the Quark runtime system of asynchronous Plasma calls and personal kernels. * Update Lapacke interface to 3.6.0 * Bug fix on Frobenius norm * Add a missing check on descriptors alignment with tiles that may cause unreported trouble to user using sub-descriptors, especially with recursive algorithms. 2.7.1, April, 2015 * Bug fix with infinite loop in LU recursive panel kernels. * Update the eztrace module to be compliant with EZTrace 1.0.6 * Fix the F77 interface to handle Tile descriptor correctly * Update the Lapack_to_Tile/Tile_to Lapack routine family to support both in place and out-of-place layout translation 2.7.0, March, 2015 * Parallel tri-diagonal divide and conquer solver for eigenvalue problems. 2.6.0, November, 2013 * libcoreblas has been made fully independent. All dependencies to libplasma and libquark have been removed. A pkg-config file has been added to ease compilation of projects using the stand-alone coreblas library. * New routines PLASMA_[sdcz]pltmg[_Tile[_Async]], for PLasma Test Matrices Generation, have been added to create special test matrices from the Matlab gallery. This includes Cauchy, Circulant, Fiedler, Foster, Hadamard, Hankel, Householder and many other matrices. * Add norms computation for triangular matrices: PLASMA_[sdcz]lantr[_Tile[_Async]], and dependent kernels. * Doxygen documentation of coreblas kernels have been updated. * Fix problem reported by J. Dobson from NAG on thread settings modification made in singular values, eigen values toutines when MKL is used. 2.5.2, September, 2013 * Add -m and -n options to timing routines to define matrix size without using ranges * Fix a minor bug that appears when combining muti-threaded tasks with thread-masks in Quark. Previously, the thread mask was not respected when the tasks of the multi-threaded task were being assigned to threads. * Fix illegal division by 0 that occured when matrix size was smaller than the tile size during inplace layout translation. See [1]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=23 74#p2374 * Fix the QUARK_REGION bug that was limiting performance of QR/LQ factorization in the last release. * Fix illegal division by 0 when first numa node detected by HwLoc is empty. Thanks to Jim for those two bug reports, see [2]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680. * Fix integer size that was creating overflow in tile pointers computation. Thanks to SGI for the bug report. 2.5.1, July, 2013 * Add LU factorization with tournament pivoting. Each tournament is based on the classical partial pivoting algorithm. PLASMA_[sdcz]getrf_tntpiv[_Tile[_Async]]. The size of each subdomain involved in the tournament can be set through the call to "PLASMA_Set( PLASMA_TNTPIVOTING_SIZE, nt );". The default is 4. See LAWN 226. * Add LU factorization with no pivoting: PLASMA_[sdcz]getrf_nopiv[_Tile[_Async]]. WARNING: your matrix has to diagonal dominant to use it or the result might be wrong. * Add QR with rank revealing routine: PLASMA_[sdcz]geqp3[_Tile[_Async]]. * Fix many comments in the Doxygen documentation * Complete documentation on DAG and execution traces generation * Add the dense hermetian eigenvalue problem routines: Note that these routines requires mulithreaded BLAS. For that, the user is required to tell PLASMA that he is using multithreaded BLAS library and so specify which library is being used by adding -DPLASMA_WITH_XXX to the compilation flags. Current supported library are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy to add morelibrary, please contact PLASMA team if you require addtional libraries to be supported. 1- PLASMA_[sdch]hetrd: compute the tridiagonal reduction of a dense hermetian matrix using the 2-stage algorithm A = QTQ^H. It also has the feature to generates the complex matrix Q with orthonormal columns used to reduce the matrix A to tridiagonal. This function is similar to the ZHETRD routine combined with the ZUNGQR routine (when Q is generated) of LAPACK. 2- PLASMA[sdch]heev: computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. This function is similar to the ZHEEV routine of LAPACK. 3- PLASMA[sdch]heevd: computes all eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. If eigenvectors are desired, it uses a divide and conquer algorithm. This function is similar to the ZHEEVD routine of LAPACK. 4- PLASMA[sdch]heevr: computes selected eigenvalues and, optionally, eigenvectors of a complex Hermitian matrix A. Eigenvalues and eigenvectors can be selected by specifying either a range of values or a range of indices for the desired eigenvalues. Whenever possible, ZHEEVR calls ZSTEMR to compute eigenspectrum using Relatively Robust Representations (MRRR). This function is similar to the ZHEEVR routine of LAPACK. 5- PLASMA[sdch]_hegv: computes all the eigenvalues, and optionally, the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form: A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and B are assumed to be Hermitian and B is also positive definite. This function uses the QR algorithm, and is similar to the ZHEGV routine of LAPACK. 6- PLASMA_[sdch]_hegvd: computes all the eigenvalues, and optionally, the eigenv ectors the eigenvectors of a complex generalized Hermitian-definite eigenproblem, of the form: A*x=(lambda)*B*x, A*Bx=(lambda)*x, or B*A*x=(lambda)*x. Here A and B are assumed to be Hermitian and B is also positive definite. If eigenvectors are desired, it uses a divide and conquer algorithm, and is similar to the ZHEGVD routine of LAPACK. * Add the singular value decomposition (SVD) routines: Note that these routines requires mulithreaded BLAS. For that, the user is required to tell PLASMA that he is using multithreaded BLAS library and so specify which library is being used by adding -DPLASMA_WITH_XXX to the compilation flags. Current supported library are -DPLASMA_WITH_MKL or -DPLASMA_WITH_ACML but it is easy to add morelibrary, please contact PLASMA team if you require addtional libraries to be supported. 1- PLASMA_[sdch]gebrd: compute the bidiagonal reduction of a dense general matrix using the 2-stage algorithm A = QBP^H. It also has the feature to generates the complex matrix Q and PH with orthonormal columns used to reduce the matrix A to bidiagonal. This function is similar to the ZGEBRD routine combined with the ZUNGBR routine (when Q is generated) of LAPACK. 2- PLASMA[sdch]gesvd: computes the singular value decomposition (SVD) of a complex matrix A, optionally computing the left and/or right singular vectors. The SVD is written A = U * SIGMA * conjugate-transpose(V). This routine use the implicit zero-shift QR algorithm and is similar to the ZGESVD routine of LAPACK. 3- PLASMA[sdch]_gesdd: computes the singular value decomposition (SVD) of a complex matrix A, optionally computing the left and/or right singular vectors. The SVD is written A = U * SIGMA * conjugate-transpose(V). This routine use the divide and conquer algorithm and is similar to the ZGESDD routine of LAPACK. 2.5.0, November, 2012 * Introduce condition estimators for General and Positive Definite cases (xGECON, xPOCON) * Fix recurring with lapack release number in plasma-installer * Fix out-of-order computation in QR/LQ factorization that were causing numerical issues with dynamic scheduling * Fix many comments in the Doxygen documentation * Correct some performance issues with in-place layout translation 2.4.6, August 20th, 2012 * Add eigenvectors support in eigensolvers for symmetric/hermitian problems and generalized problems. * Add support of Frobenius norm. * Release the precision generation script used to generate the precision s, d and c from z, as well as ds from zc * Add all Fortran90 for mixed precision routines. * Add all Fortran90 wrappers to tile interface and asynchronous interface. Thanks to NAG for providing those wrappers. * Add 4 examples with Fortran90 interface. * Add support for all computational functions in F77 wrappers. * Fix memory leaks related to fake dependencies in dynamically scheduled algorithms. * Fix interface issues in eigensolvers routines. * Fixed returned info in PLASMA_zgetrf function * Fixed bug with matrices of size 0. * WARNING: all lapack interfaces having a T or L argument for QR or LU factorization have been changed to take a descriptor. The workspace allocation has been changed to match those requirements and all functions PLASMA_Alloc_Workspace_XXXXX_Tile are now deprecated and users are encouraged to move to the PLASMA_Alloc_Workspace_XXXXX version. 2.4.5, November 22nd, 2011 * Add LU inversion functions: PLASMA_zgetri, PLASMA_zgetri_Tile and PLASMA_zgetri_Tile_Async using the recursive parallel panel implementation of LU factorization. * The householder reduction trees for QR and LQ factorizations can now work on general cases and not only on matrices with M multiple of MB. * Matrices generation has been changed in every timing, testing and example files to use a parallel initialization generating a better distribution of the data on the architecture, especially for Tile interface. “numactl” is not required anymore. * Timing routines can now generate DAGs with the --dag option, and traces with --trace option if EZTRACE is present. 2.4.2, September 14th, 2011 * New version of quark removing active waiting and allowing user to bind tasks to set of cores. * Installer: Fix compatibility issues between plasma-installer and PGI compiler reported on Kraken by A. Bouteiller. * Fix one memory leak with Hwloc. * Introduce a new kernel for the recursive LU operation on tile layout which reduces cache misses. * Fix several bugs and introduce new features thanks to people from Fujitsu and NAG : + The new LU factorization with partial pivoting introduced in release 2.4 is now working on rectangular matrices. + Add missing functions to Fortran 77 interface. + Add a new Fortran 90 interface to all LAPACK and Tile interface. Asynchronous interface and mixed precision routines are not available yet. + Fix arguments order in header files to fit implementation. 2.4.1, July 8th, 2011 * Fix bug with Fujitsu compiler reported on the forum ([3]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108) * Unbind threads in PLASMA_Finalize to avoid problem of binding in OpenMP section following PLASMA calls (still possible on Mac and AIX without hwloc). A better fix is to create the OpenMP thread in the user code before any call to PLASMA thanks to a fake parallel section. 2.4.0, June 6th, 2011 * Tree-based QR and LQ factorizations: routines for application of the Q matrix support all combinations of input parameters: Left/Right, NoTrans/Trans/ConjTrans * Symmetric Engenvalue Problem using: tile reduction to band-tridiagonal form, reduction to "standard" tridiagonal form by bulge chasing, finding eigenvalues using the QR algorithm (eigenvectors currently not supported) * Singular Value Decomposition using: tile reduction to band-bidiagonal form, reduction to “standard” bidiagonal form by bulge chasing, finding singular values using the QR algorithm (singular vectors currently no supported) * Gaussian Elimination with partial pivoting (as opposed to the incremental pivoting in the tile LU factorization) and parallel panel (using Quark extensions for nested parallelism) WARNING: Following the integration of this new feature, the interface to call LU factorization has changed. Now, PLASMA_zgetrf follows the LAPACK interface and corresponds to the new partial pivoting. Old interface related to LU factorization with incremental pivoting is now renamed PLASMA_zgetrf_incpiv. 2.3.1, November 30th, 2010 * Add functions to generate random matrices (plrnt, plghe and plgsy) ⇒ fix the problem with time_zpotri_tile.c reported by Katayama on the forum ([4]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59) * Fix a deadlock in norm computations with static scheduling * Installer: fix the LAPACK version when libtmg is the only library to be install Thanks to Henc. ([5]http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60) 2.3.0, November 15th, 2010 * Parallel and cache-efficient in-place layout translations (Gustavson et al.) * Tree-based QR factorization and Q matrix generation (“tall and skinny”) * Explicit matrix inversion based on Cholesky factorization (symmetric positive definite) * Replacement of LAPACK C Wrapper with LAPACKE C API by Intel 2.2.0, July 9th, 2010 * Dynamic scheduler QUARK (QUeuing And Runtime for Kernels) and dynamically scheduled versions of all computational routines (alongside statically scheduled ones) * Asynchronous interface for launching dynamically scheduled routines in a non-blocking mode. Sequence and request constructs for controlling progress and checking errors * Removal of CBLAS and pieces of LAPACK from PLASMA’s source tree. BLAS, CBLAS, LAPACK and Netlib LAPACK C Wrapper become PLASMA’s software dependencies required prior to the installation of PLASMA * Installer capable of downloading from Netlib and installing missing components of PLASMA’s software stack (BLAS, CBLAS, LAPACK, LAPACK C Wrapper) * Complete set of Level 3 BLAS routines for matrices stored in tile layout 2.1.0, November 15th, 2009 * Native support for Microsoft Windows using WinThreads * Support for Make and CMake build systems * Performance-optimized mixed-precision routine for the solution of linear systems of equations using the LU factorization * Initial timing code (PLASMA_dgesv only) * Release notes 2.0.0, July 4th, 2008 * Support for real and complex arithmetic in single and double precision * Generation and application of the Q matrix from the QR and LQ factorizations * Prototype of mixed-precision routine for the solution of linear systems of equations using the LU factorization (not optimized for performance) * Simple interface and native interface * Major code cleanup and restructuring * Redesigned workspace allocation * LAPACK testing * Examples * Thread safety * Python installer * Documentation: Installation Guide, Users' Guide with routine reference and an HTML code browser, a guide on running PLASMA with the TAU package, initial draft of Contributors' Guide, a README file and a LICENSE file 1.0.0, November 15th, 2008 * Double precision routines for the solution of linear systems of equations and least square problems using Cholesky, LU, QR and LQ factorizations __________________________________________________________________ Last updated 2015-12-03 10:08:41 MST Références 1. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1684&p=2374#p2374 2. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=1680 3. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=108 4. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=59 5. http://icl.cs.utk.edu/plasma/forum/viewtopic.php?f=2&t=60