ICL Newsletter

News and Announcements

BDEC in HPCwire

The Big Data and Extreme Scale Computing (BDEC) effort was recently mentioned in HPCwire as BDEC co-PI Pete Beckman weighed in. The BDEC workshop series, organized by ICL and our partners, is premised on the need to systematically map out the ways in which the major issues associated with Big Data intersect and interact with plans for achieving Exascale computing.

The most recent BDEC activity was a BoF at SC15 and a planning meeting at ISC15. In HPCwire, Pete Beckman talks about the BDEC efforts moving forward, how big data and extreme scale computing have similar needs, and how the convergence of extreme scale computing and big data pose huge challenges while also providing great opportunities for application scientists, engineers, and the HPC community at large. Read the entire HPCwire feature here.

ICL Winter Grads

grads_w_jack

Yulu Jia (left) earned his PhD in Computer Science with a minor in Computational Science last fall, and the University formally presented his diploma on December 10, 2015. Yulu is now working for Intel. Tingxing Dong (right) also earned his PhD in Computer Science with a minor in Computational Science and graduated on December 10th. Tingxing is now working at AMD in Austin, TX. Congratulations to Yulu and Tingxing!

Recent Releases

2016 ICL Annual Report

For fifteen years, ICL has produced an annual report to provide a concise profile of our research, including information about the people and external organizations who make it all happen. Please download a copy and check it out.

MAGMA 2.0 Beta 3 Released

MAGMA 2.0 Beta 3 is now available. MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards, e.g., LAPACK and BLAS, to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs/coprocessors to deliver the fastest possible time to accurate solution within given energy constraints.

MAGMA 2.0 Beta 3 includes a major interface change for all MAGMA BLAS functions; most higher level functions such as magma_zgetrf have not changed their interface.

Significant changes include:

Added queue argument to magmablas routines, and deprecated magmablas{Set,Get}KernelStream. This resolves a thread safety issue with using global magmablas{Set,Get}KernelStream.
Fixed bugs related to relying on CUDA NULL stream implicit synchronization.
Fixed memory leaks (zunmqr_m, zheevdx_2stage, etc.). Added -DDEBUG_MEMORY option to catch leaks.
Fixed geqrf*_gpu bugs for m == nb, n >> m (ex: -N 64,10000); and m >> n, n == nb+i (ex: -N 10000,129).
Fixed zunmql2_gpu for rectangular sizes.
Fixed zhegvdx_m itype 3.
Added zunglq, zungbr, zgeadd2 (which takes both alpha and beta).

MAGMA sparse changes:

Added QMR, TFQMR, preconditioned TFQMR.
Added CGS, preconditioned CGS.
Added kernel-fused versions for CGS/PCGS QMR, TFQMR/PTFQMR.
Changed relative stopping criterion to be relative to RHS.
Fixed bug in complex version of CG.
Accelerated version of Jacobi-CG.
Added very efficient IDR.
Performance tuning for SELLP SpMV.

Note: the Windows CMake port has not yet been implemented in the new Makefile structure. It will be available in the final release.

Visit the MAGMA software page to download the tarball.

PaRSEC / DPLASMA 2.0.0 RC2

PaRSEC / DPLASMA 2.0.0 RC2 is now available. PaRSEC (Parallel Runtime Scheduling and Execution Controller) is a generic framework for architecture-aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider are expressed as a Direct Acyclic Graph (DAG) of tasks, with edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried to discover data dependencies in a totally distributed fashion, a drastic shift from today’s programming models, which are based on sequential flow of execution.

RC2 includes many new additions. Some of the most notable are the new profiling interface with accessors for python and R, a tight integration with PAPI to extract all types of hardware counters, and the addition of a tasklet insertion interface.

Visit the PaRSEC software page to download the tarball.

PLASMA 2.8.0 Released

PLASMA 2.8.0 is now available! PLASMA (Parallel Linear Algebra Software for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing, designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state-of-the-art solutions in parallel algorithms, scheduling, and software engineering. Currently, PLASMA offers a collection of routines for solving linear systems of equations, least square problems, eigenvalue problems, and singular value problems.

PLASMA 2.8.0 integrates the following changes:

Fix a synchronization problem in STEDC functions.
Reduce the amount of computation performed in UNGQR/UNGLQD family routines by taking advantage of the Identity structure. It is no longer required to initialize the Q identity before calling those functions.
New routine, PLASMA_[sdcz]lascal[_Tile[_Async]]—similar to Scalapack p[sdcz]lascal—to scale a matrix by a constant factor. This function does not handle numerical overflow/underflow as zlascl from LAPACK.
New routines, PLASMA_[sdcz]geadd[Tile[_Async]], and PLASMA[sdcz]tradd[_Tile[_Async]]—similar to ScaLAPACK’s p[sdcz]geadd and p[sdcz]tradd—to add two general, or trapezoidal, matrices together.
Add functions to the API to allow users to mix submission to the QUARK runtime system of asynchronous PLASMA calls and personal kernels.
Update Lapacke interface to 3.6.0.
Bug fix on Frobenius norm.
Add a missing check on descriptors alignment with tiles that may cause unreported trouble for users utilizing sub-descriptors, especially with recursive algorithms.

Visit the PLASMA software page to download the tarballs.

PAPI 5.4.3 Released

PAPI 5.4.3 is now available. PAPI (the Performance API) provides simultaneous access to performance counters on CPUs, GPUs, and other components of interest (e.g., network and I/O systems). Provided as a linkable library or shared object, PAPI can be called directly in a user program, or used transparently through a variety of third-party tools, making it a de facto standard for hardware counter analysis. Industry liaisons with Bull, Cray, Intel, IBM, NVIDIA, and others ensure seamless integration of PAPI with new architectures at or near their release.

PAPI 5.4.3 includes some new implementations of components and tools, some general enhancements, and a number of bug fixes.

New Implementations:

libmsr component: Using LLNL’s libmsr library to access Intel RAPL (Running Average Power Limit) library adds power capping abilities to PAPI.
CUDA PC sampling: A new standalone CUDA sampling tool (papi_cuda_sampling) has been added to the CUDA component (components/cuda/sampling/) and can be used as a preloader to perform PC sampling on NVIDIA GPUs which support the CUPTI sampling interface (e.g., Maxwell).
ARM Cortex A53 support: Event definitions added.

Enhancements:

Added Haswell-EP uncore support.
Initial Broadwell, Skylake support.
Added a general CUDA example (components/cuda/test) that uses LD_PRELOAD to attach to a running CUcontext.
Added “-check” flag to papi_avail and papi_native_avail to test counter availability/validity, unifying previous flags.

Information about the bug fixes can be found in the release notes and the ChangeLogP543.txt documents.

Last but not least, the PAPI team would like to extend a special thank you to all of their collaborators and contributors! PAPI would not be successful without your help!

Visit the PAPI software page to download the tarball.

Interview

Where are you from, originally?

I grew up in Joppatowne, Maryland, which is a suburb just northeast of Baltimore.

Can you summarize your educational background?

I obtained a BS in Electrical Engineering in 2000 from the University of Maryland, College Park. After a brief stint in industry that was cut short by the dot-com implosion, I started grad school at Cornell University (in the beautiful finger lakes region of New York) where I obtained my MS and PhD in Computer Engineering. My PhD research initially started out studying ways to address the memory wall problem in modern computer designs, but ended up being about using hardware performance counters to validate the (sadly fairly awful) accuracy and methodology used in academic cycle-accurate simulators.

How did you get introduced to ICL?

I don’t think it’s possible to do any research in the HPC arena without coming across one of the projects run by ICL. Besides the TOP500 list, I think I very quickly learned about PAPI once I started using hardware performance counters.

What did you work on during your time at ICL?

I mostly worked with the performance group developing PAPI. We had some major milestones, including the transition to the perf_event interface and libpfm4, as well as work on RAPL energy measurements and VM/cloud concerns.

What are some of your favorite memories from your time at ICL?

I always thought it was funny in these interviews how much people missed the Friday lunch talks, but I have to admit that in retrospect that was a big highlight.

Tell us where you are and what you’re doing now.

I’m currently an Assistant Professor of Electrical and Computer Engineering at the University of Maine. Assistant Professor means I’m tenure track so I have two years left to make the case that they should keep me around.

I teach classes in Embedded Systems, Operating Systems, Computer Architecture, and Cluster computing. My research includes work that I worked on at ICL (I still contribute to the PAPI project, and I am on an NSF grant along with ICL to extend PAPI). Other research includes trying to keep the Linux perf_event interface documented (I am the author of the perf_event_open() Linux manpage), keeping the Linux perf_event interface bug-free (I have written a system call fuzzer, perf_fuzzer, that has found numerous Linux bugs), as well as various projects involving power and performance analysis of both embedded systems and high performance computers.

Maine is a pretty place to live. The University is in Orono, just outside of Bangor, and only about an hour drive from Acadia National Park. On a clear day you can see Mount Kathadin from near my house; this is the end of the Appalachian Trail. The winter last year was record breaking in both bitter cold and snowfall, which has made this winter’s much milder weather seem pleasant by comparison. There are some things you have to get used to, such as the elementary schools send the kids outdoors for recess down to 10 degrees F (-12 C) and during recess they bring their sleds to school and go sledding.

Even in Maine it’s hard to get away from Tennessee. The professor whose office is next door to mine grew up in Oak Ridge and earned his PhD at UT. Also, you can have interesting experiences, such as dropping off your child at kindergarten on a freezing February morning and having someone yell “Go Vols!” at you from across the parking lot because you’re wearing an orange UT hat.

In what ways did working at ICL prepare you for what you do now, if at all?

Working at ICL was a great way to make connections with all of the other researchers in the HPC world.

Tell us something about yourself that might surprise some people.

In my office I’ve built a replica of the time circuits and flux capacitor from the Back to the Future movies, powered by a Raspberry Pi computer.

http://www.deater.net/weave/vmwprod/hardware/time_circuit/

Recent Papers

Dongarra, J., M. A. Heroux, and P. Luszczek, “A New Metric for Ranking High-Performance Computing Systems,” National Science Review, vol. 3, issue 1, pp. 30-35, January 2016. DOI: 10.1093/nsr/nwv084 (393.55 KB)
Bosilca, G., T. Herault, and J. Dongarra, “Context Identifier Allocation in Open MPI,” University of Tennessee Computer Science Technical Report, no. ICL-UT-16-01: Innovative Computing Laboratory, University of Tennessee, January 2016. (490.89 KB)
Abdelfattah, A., M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, et al., “High-Performance Tensor Contractions for GPUs,” University of Tennessee Computer Science Technical Report, no. UT-EECS-16-738: University of Tennessee, January 2016. (2.36 MB)
Herrmann, J., G. Bosilca, T. Herault, L. Marchal, Y. Robert, and J. Dongarra, “Assessing the Cost of Redistribution followed by a Computational Kernel: Complexity and Performance Results,” Parallel Computing, vol. 52, pp. 22-41, February 2016. DOI: doi:10.1016/j.parco.2015.09.005 (2.06 MB)
Dongarra, J., M. A. Heroux, and P. Luszczek, “High Performance Conjugate Gradient Benchmark: A new Metric for Ranking High Performance Computing Systems,” International Journal of High Performance Computing Applications, vol. 30, issue 1, pp. 3 - 10, February 2016. DOI: 10.1177/1094342015593158 (277.51 KB)
Abdelfattah, A., A. Haidar, S. Tomov, and J. Dongarra, “Performance, Design, and Autotuning of Batched GEMM for GPUs,” University of Tennessee Computer Science Technical Report, no. UT-EECS-16-739: University of Tennessee, February 2016. (1.27 MB)
Anzt, H., E. Chow, J. Saak, and J. Dongarra, “Updating Incomplete Factorization Preconditioners for Model Order Reduction,” Numerical Algorithms, vol. 73, issue 3, no. 3, pp. 611â630, February 2016. DOI: 10.1007/s11075-016-0110-2 (565.34 KB)

Recent Conferences

FEB
8

NESUS Winter School Timisoara, Rumania
George

George Bosilca
FEB
16

NSF SI2 PI Meeting Arlington, District of Columbia
Anthony

Anthony Danalis
FEB
23

Open MPI Developers Meeting Dallas, Texas
George

George Bosilca
FEB
29

MPI Forum Chicago, Illinois
Aurelien
George

Aurelien Bouteiller, George Bosilca

Upcoming Conferences

MAR
12

Principles and Practice of Parallel Programming (PPoPP16) Barcelona, Spain
Thomas

Thomas Herault
MAR
20

Copper Mountain Conference Denver, Colorado
Ichitaro

Ichitaro Yamazaki
MAR
23

PAPI-EX PI Meeting Hilton, Tennessee
Anthony
Asim
Heike
Phil

Anthony Danalis, Asim YarKhan, Heike Jagode, Phil Mucci
APR
4

GPU Technology Conference (GTC) 2016 San Jose, California
Ahmad
Azzam
Stan

Ahmad Abdelfattah Ahmad, Azzam Haidar, Stanimire Tomov
APR
11-12

SIAM PP Paris, France
Aurelien
George
Ichitaro

Aurelien Bouteiller, George Bosilca, Ichitaro Yamazaki

Recent Lunch Talks

JAN
8
Yves Robert
INRIA
Which Verification for Silent Error Detection? PDF
JAN
14
David Keffer
UTK Department of Materials Science and Engineering
Algorithms for 3D-3D Registration with Known and Unknown References: Applications to Materials Science PDF
JAN
22
Aurelien Bouteiller
Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery PDF
JAN
29
Joe Dorris
PLASMA OpenMP on Xeon Phi and A Case Study with Cholesky Decomposition PDF
FEB
5
Mathieu Faverge
INRIA
Massively Parallel Cartesian Discrete Ordinates Method for Neutron Transport Simulation PDF
FEB
12
Hartwig Anzt
A New Parallel Threshold ILU
FEB
19
Thomas Herault
Practical Scalable Consensus for Pseudo Synchronous Distributed Systems PDF
FEB
26
Peter Liaw
UTK Department of Materials Science and Engineering

Upcoming Lunch Talks

MAR
4
Hartwig Anzt
Solving Sparse Linear Systems on GPUs - The Good, the Bad, and the Ugly
MAR
11
Ichitaro Yamazaki
Preconditioning a Communication-avoiding Krylov solver PDF
MAR
18
Tim Davis
Texas A&M University
Sparse Matrix Algorithms: Combinatorics + Numerical Methods + Applications PDF
MAR
18
Sanjay Ranka
University of Florida
A Genetic Algorithm Based Approach for Multi-objective Hardware/Software Co-optimization PDF
MAR
24
Phil Mucci
Minimal Metrics
Systems Performance @ Sandia PDF
APR
1
Ahmad Ahmad
On the Development of Variable-Size Batched Computation for Heterogeneous Parallel Architectures PDF
APR
8
Piotr Luszczek
Search Space Description, Generation, and Pruning System for Autotuners PDF
APR
15
Miro Stoyanov
ORNL
Resilient Solvers for Partial Differential Equations PDF
APR
22
Chongxiao Cao
Fault Tolerant Design for a Task-based Runtime PDF
APR
29
Wei Wu
Accelerator Integration with Programming Models PDF

People

ICL Alum and frequent visitor Mathieu Faverge made his way back to ICL in January and will be staying with us through June. Mathieu will be working with the DisCo and Linear Algebra teams. Welcome back, Mathieu!

Dates to Remember

ICL Winter Reception

The 2016 ICL Winter Reception will be held at Calhoun’s on the River on Friday, February 19th from 5:30-7:30/8:00pm.

February 2016