News and Announcements

BDEC in HPCwire

bdec-wordmark

The Big Data and Extreme Scale Computing (BDEC) effort was recently mentioned in HPCwire as BDEC co-PI Pete Beckman weighed in. The BDEC workshop series, organized by ICL and our partners, is premised on the need to systematically map out the ways in which the major issues associated with Big Data intersect and interact with plans for achieving Exascale computing.

The most recent BDEC activity was a BoF at SC15 and a planning meeting at ISC15. In HPCwire, Pete Beckman talks about the BDEC efforts moving forward, how big data and extreme scale computing have similar needs, and how the convergence of extreme scale computing and big data pose huge challenges while also providing great opportunities for application scientists, engineers, and the HPC community at large. Read the entire HPCwire feature here.

ICL Winter Grads

grads_w_jack

Yulu Jia (left) earned his PhD in Computer Science with a minor in Computational Science last fall, and the University formally presented his diploma on December 10, 2015. Yulu is now working for Intel. Tingxing Dong (right) also earned his PhD in Computer Science with a minor in Computational Science and graduated on December 10th. Tingxing is now working at AMD in Austin, TX. Congratulations to Yulu and Tingxing!

Recent Releases

2016 ICL Annual Report

Annual-Report-Cover

For fifteen years, ICL has produced an annual report to provide a concise profile of our research, including information about the people and external organizations who make it all happen. Please download a copy and check it out.

MAGMA 2.0 Beta 3 Released

MAGMA 2.0 Beta 3 is now available.  MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards, e.g., LAPACK and BLAS, to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs/coprocessors to deliver the fastest possible time to accurate solution within given energy constraints.

MAGMA 2.0 Beta 3 includes a major interface change for all MAGMA BLAS functions; most higher level functions such as magma_zgetrf have not changed their interface.

Significant changes include:

  • Added queue argument to magmablas routines, and deprecated magmablas{Set,Get}KernelStream. This resolves a thread safety issue with using global magmablas{Set,Get}KernelStream.
  • Fixed bugs related to relying on CUDA NULL stream implicit synchronization.
  • Fixed memory leaks (zunmqr_m, zheevdx_2stage, etc.). Added -DDEBUG_MEMORY option to catch leaks.
  • Fixed geqrf*_gpu bugs for m == nb, n >> m (ex: -N 64,10000); and m >> n, n == nb+i (ex: -N 10000,129).
  • Fixed zunmql2_gpu for rectangular sizes.
  • Fixed zhegvdx_m itype 3.
  • Added zunglq, zungbr, zgeadd2 (which takes both alpha and beta).

MAGMA sparse changes:

  • Added QMR, TFQMR, preconditioned TFQMR.
  • Added CGS, preconditioned CGS.
  • Added kernel-fused versions for CGS/PCGS QMR, TFQMR/PTFQMR.
  • Changed relative stopping criterion to be relative to RHS.
  • Fixed bug in complex version of CG.
  • Accelerated version of Jacobi-CG.
  • Added very efficient IDR.
  • Performance tuning for SELLP SpMV.

Note: the Windows CMake port has not yet been implemented in the new Makefile structure. It will be available in the final release.

Visit the MAGMA software page to download the tarball.

PaRSEC / DPLASMA 2.0.0 RC2

PaRSEC / DPLASMA 2.0.0 RC2 is now available. PaRSEC (Parallel Runtime Scheduling and Execution Controller) is a generic framework for architecture-aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider are expressed as a Direct Acyclic Graph (DAG) of tasks, with edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried to discover data dependencies in a totally distributed fashion, a drastic shift from today’s programming models, which are based on sequential flow of execution.

RC2 includes many new additions. Some of the most notable are the new profiling interface with accessors for python and R, a tight integration with PAPI to extract all types of hardware counters, and the addition of a tasklet insertion interface.

Visit the PaRSEC software page to download the tarball.

PLASMA 2.8.0 Released

PLASMA 2.8.0 is now available! PLASMA (Parallel Linear Algebra Software for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing, designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state-of-the-art solutions in parallel algorithms, scheduling, and software engineering. Currently, PLASMA offers a collection of routines for solving linear systems of equations, least square problems, eigenvalue problems, and singular value problems.

PLASMA 2.8.0 integrates the following changes:

  • Fix a synchronization problem in STEDC functions.
  • Reduce the amount of computation performed in UNGQR/UNGLQD family routines by taking advantage of the Identity structure. It is no longer required to initialize the Q identity before calling those functions.
  • New routine, PLASMA_[sdcz]lascal[_Tile[_Async]]—similar to Scalapack p[sdcz]lascal—to scale a matrix by a constant factor. This function does not handle numerical overflow/underflow as zlascl from LAPACK.
  • New routines, PLASMA_[sdcz]geadd[Tile[_Async]], and PLASMA[sdcz]tradd[_Tile[_Async]]—similar to ScaLAPACK’s p[sdcz]geadd and p[sdcz]tradd—to add two general, or trapezoidal, matrices together.
  • Add functions to the API to allow users to mix submission to the QUARK runtime system of asynchronous PLASMA calls and personal kernels.
  • Update Lapacke interface to 3.6.0.
  • Bug fix on Frobenius norm.
  • Add a missing check on descriptors alignment with tiles that may cause unreported trouble for users utilizing sub-descriptors, especially with recursive algorithms.

Visit the PLASMA software page to download the tarballs.

PAPI 5.4.3 Released

PAPI 5.4.3 is now availablePAPI (the Performance API) provides simultaneous access to performance counters on CPUs, GPUs, and other components of interest (e.g., network and I/O systems). Provided as a linkable library or shared object, PAPI can be called directly in a user program, or used transparently through a variety of third-party tools, making it a de facto standard for hardware counter analysis. Industry liaisons with Bull, Cray, Intel, IBMNVIDIA, and others ensure seamless integration of PAPI with new architectures at or near their release.

PAPI 5.4.3 includes some new implementations of components and tools, some general enhancements, and a number of bug fixes.

New Implementations:

  • libmsr component: Using LLNL’s libmsr library to access Intel RAPL (Running Average Power Limit) library adds power capping abilities to PAPI.
  • CUDA PC sampling: A new standalone CUDA sampling tool (papi_cuda_sampling) has been added to the CUDA component (components/cuda/sampling/) and can be used as a preloader to perform PC sampling on NVIDIA GPUs which support the CUPTI sampling interface (e.g., Maxwell).
  • ARM Cortex A53 support: Event definitions added.

Enhancements:

  • Added Haswell-EP uncore support.
  • Initial Broadwell, Skylake support.
  • Added a general CUDA example (components/cuda/test) that uses LD_PRELOAD to attach to a running CUcontext.
  • Added “-check” flag to papi_avail and papi_native_avail to test counter availability/validity, unifying previous flags.

Information about the bug fixes can be found in the release notes and the ChangeLogP543.txt documents.

Last but not least, the PAPI team would like to extend a special thank you to all of their collaborators and contributors! PAPI would not be successful without your help!

Visit the PAPI software page to download the tarball.

Interview

Vince Weaver Then

Vince Weaver

Where are you from, originally?

I grew up in Joppatowne, Maryland, which is a suburb just northeast of Baltimore.

Can you summarize your educational background?

I obtained a BS in Electrical Engineering in 2000 from the University of Maryland, College Park. After a brief stint in industry that was cut short by the dot-com implosion, I started grad school at Cornell University (in the beautiful finger lakes region of New York) where I obtained my MS and PhD in Computer Engineering. My PhD research initially started out studying ways to address the memory wall problem in modern computer designs, but ended up being about using hardware performance counters to validate the (sadly fairly awful) accuracy and methodology used in academic cycle-accurate simulators.

How did you get introduced to ICL?

I don’t think it’s possible to do any research in the HPC arena without coming across one of the projects run by ICL. Besides the TOP500 list, I think I very quickly learned about PAPI once I started using hardware performance counters.

What did you work on during your time at ICL?

I mostly worked with the performance group developing PAPI. We had some major milestones, including the transition to the perf_event interface and libpfm4, as well as work on RAPL energy measurements and VM/cloud concerns.

What are some of your favorite memories from your time at ICL?

I always thought it was funny in these interviews how much people missed the Friday lunch talks, but I have to admit that in retrospect that was a big highlight.

Tell us where you are and what you’re doing now.

I’m currently an Assistant Professor of Electrical and Computer Engineering at the University of Maine. Assistant Professor means I’m tenure track so I have two years left to make the case that they should keep me around.

I teach classes in Embedded Systems, Operating Systems, Computer Architecture, and Cluster computing. My research includes work that I worked on at ICL (I still contribute to the PAPI project, and I am on an NSF grant along with ICL to extend PAPI). Other research includes trying to keep the Linux perf_event interface documented (I am the author of the perf_event_open() Linux manpage), keeping the Linux perf_event interface bug-free (I have written a system call fuzzer, perf_fuzzer, that has found numerous Linux bugs), as well as various projects involving power and performance analysis of both embedded systems and high performance computers.

Maine is a pretty place to live. The University is in Orono, just outside of Bangor, and only about an hour drive from Acadia National Park. On a clear day you can see Mount Kathadin from near my house; this is the end of the Appalachian Trail. The winter last year was record breaking in both bitter cold and snowfall, which has made this winter’s much milder weather seem pleasant by comparison. There are some things you have to get used to, such as the elementary schools send the kids outdoors for recess down to 10 degrees F (-12 C) and during recess they bring their sleds to school and go sledding.

Even in Maine it’s hard to get away from Tennessee. The professor whose office is next door to mine grew up in Oak Ridge and earned his PhD at UT. Also, you can have interesting experiences, such as dropping off your child at kindergarten on a freezing February morning and having someone yell “Go Vols!” at you from across the parking lot because you’re wearing an orange UT hat.

In what ways did working at ICL prepare you for what you do now, if at all?

Working at ICL was a great way to make connections with all of the other researchers in the HPC world.

Tell us something about yourself that might surprise some people.

In my office I’ve built a replica of the time circuits and flux capacitor from the Back to the Future movies, powered by a Raspberry Pi computer.

http://www.deater.net/weave/vmwprod/hardware/time_circuit/

Recent Papers

  1. Dongarra, J., M. A. Heroux, and P. Luszczek, A New Metric for Ranking High-Performance Computing Systems,” National Science Review, vol. 3, issue 1, pp. 30-35, January 2016. DOI: 10.1093/nsr/nwv084  (393.55 KB)
  2. Bosilca, G., T. Herault, and J. Dongarra, Context Identifier Allocation in Open MPI,” University of Tennessee Computer Science Technical Report, no. ICL-UT-16-01: Innovative Computing Laboratory, University of Tennessee, January 2016.  (490.89 KB)
  3. Abdelfattah, A., M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, et al., High-Performance Tensor Contractions for GPUs,” University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738: University of Tennessee, January 2016.  (2.36 MB)
  4. Herrmann, J., G. Bosilca, T. Herault, L. Marchal, Y. Robert, and J. Dongarra, Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results,” Parallel Computing, vol. 52, pp. 22-41, February 2016. DOI: doi:10.1016/j.parco.2015.09.005  (2.06 MB)
  5. Dongarra, J., M. A. Heroux, and P. Luszczek, High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,” International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016. DOI: 10.1177/1094342015593158  (277.51 KB)
  6. Abdelfattah, A., A. Haidar, S. Tomov, and J. Dongarra, Performance, Design, and Autotuning of Batched GEMM for GPUs,” University of Tennessee Computer Science Technical Report, no. UT-EECS-16-739: University of Tennessee, February 2016.  (1.27 MB)
  7. Anzt, H., E. Chow, J. Saak, and J. Dongarra, Updating Incomplete Factorization Preconditioners for Model Order Reduction,” Numerical Algorithms, vol. 73, issue 3, no. 3, pp. 611–630, February 2016. DOI: 10.1007/s11075-016-0110-2  (565.34 KB)

Recent Conferences

  1. FEB
    NESUS Winter School Timisoara, Rumania
    George Bosilca
    George
    George Bosilca
  2. FEB
    NSF SI2 PI Meeting Arlington, District of Columbia
    Anthony Danalis
    Anthony
    Anthony Danalis
  3. FEB
    George Bosilca
    George
    George Bosilca
  4. FEB
    MPI Forum Chicago, Illinois
    Aurelien Bouteiller
    Aurelien
    George Bosilca
    George
    Aurelien Bouteiller, George Bosilca

Upcoming Conferences

  1. MAR
    Thomas Herault
    Thomas
    Thomas Herault
  2. MAR
    Copper Mountain Conference Denver, Colorado
    Ichitaro Yamazaki
    Ichitaro
    Ichitaro Yamazaki
  3. MAR
    PAPI-EX PI Meeting Hilton, Tennessee
    Anthony Danalis
    Anthony
    Asim YarKhan
    Asim
    Heike Jagode
    Heike
    Phil Mucci
    Phil
    Anthony Danalis, Asim YarKhan, Heike Jagode, Phil Mucci
  4. APR
    Ahmad Abdelfattah Ahmad
    Ahmad
    Azzam Haidar
    Azzam
    Stanimire Tomov
    Stan
    Ahmad Abdelfattah Ahmad, Azzam Haidar, Stanimire Tomov
  5. APR
    -
    SIAM PP Paris, France
    Aurelien Bouteiller
    Aurelien
    George Bosilca
    George
    Ichitaro Yamazaki
    Ichitaro
    Aurelien Bouteiller, George Bosilca, Ichitaro Yamazaki

Recent Lunch Talks

  1. JAN
    8
    Yves Robert
    Yves Robert
    INRIA
    Which Verification for Silent Error Detection? PDF
  2. JAN
    14
    David Keffer
    David Keffer
    UTK Department of Materials Science and Engineering
    Algorithms for 3D-3D Registration with Known and Unknown References: Applications to Materials Science PDF
  3. JAN
    22
    Aurelien Bouteiller
    Aurelien Bouteiller
    Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery PDF
  4. JAN
    29
    Joe Dorris
    Joe Dorris
    PLASMA OpenMP on Xeon Phi and A Case Study with Cholesky Decomposition PDF
  5. FEB
    5
    Mathieu Faverge
    Mathieu Faverge
    INRIA
    Massively Parallel Cartesian Discrete Ordinates Method for Neutron Transport Simulation PDF
  6. FEB
    12
    Hartwig Anzt
    Hartwig Anzt
    A New Parallel Threshold ILU
  7. FEB
    19
    Thomas Herault
    Thomas Herault
    Practical Scalable Consensus for Pseudo Synchronous Distributed Systems PDF
  8. FEB
    26
    Peter Liaw
    Peter Liaw
    UTK Department of Materials Science and Engineering

Upcoming Lunch Talks

  1. MAR
    4
    Hartwig Anzt
    Hartwig Anzt
    Solving Sparse Linear Systems on GPUs - The Good, the Bad, and the Ugly
  2. MAR
    11
    Ichitaro Yamazaki
    Ichitaro Yamazaki
    Preconditioning a Communication-avoiding Krylov solver PDF
  3. MAR
    18
    Tim Davis
    Tim Davis
    Texas A&M University
    Sparse Matrix Algorithms: Combinatorics + Numerical Methods + Applications PDF
  4. MAR
    18
    Sanjay Ranka
    Sanjay Ranka
    University of Florida
    A Genetic Algorithm Based Approach for Multi-objective Hardware/Software Co-optimization PDF
  5. MAR
    24
    Phil Mucci
    Phil Mucci
    Minimal Metrics
    Systems Performance @ Sandia PDF
  6. APR
    1
    Ahmad Ahmad
    Ahmad Ahmad
    On the Development of Variable-Size Batched Computation for Heterogeneous Parallel Architectures PDF
  7. APR
    8
    Piotr Luszczek
    Piotr Luszczek
    Search Space Description, Generation, and Pruning System for Autotuners PDF
  8. APR
    15
    Miro Stoyanov
    Miro Stoyanov
    ORNL
    Resilient Solvers for Partial Differential Equations PDF
  9. APR
    22
    Chongxiao Cao
    Chongxiao Cao
    Fault Tolerant Design for a Task-based Runtime PDF
  10. APR
    29
    Wei Wu
    Wei Wu
    Accelerator Integration with Programming Models PDF

People

  1. Mathieu Faverge
    ICL Alum and frequent visitor Mathieu Faverge made his way back to ICL in January and will be staying with us through June. Mathieu will be working with the DisCo and Linear Algebra teams. Welcome back, Mathieu!

Dates to Remember

ICL Winter Reception

The 2016 ICL Winter Reception will be held at Calhoun’s on the River on Friday, February 19th from 5:30-7:30/8:00pm.