News and Announcements
BDEC in HPCwire

The Big Data and Extreme Scale Computing (BDEC) effort was recently mentioned in HPCwire as BDEC co-PI Pete Beckman weighed in. The BDEC workshop series, organized by ICL and our partners, is premised on the need to systematically map out the ways in which the major issues associated with Big Data intersect and interact with plans for achieving Exascale computing.
The most recent BDEC activity was a BoF at SC15 and a planning meeting at ISC15. In HPCwire, Pete Beckman talks about the BDEC efforts moving forward, how big data and extreme scale computing have similar needs, and how the convergence of extreme scale computing and big data pose huge challenges while also providing great opportunities for application scientists, engineers, and the HPC community at large. Read the entire HPCwire feature here.
ICL Winter Grads

Yulu Jia (left) earned his PhD in Computer Science with a minor in Computational Science last fall, and the University formally presented his diploma on December 10, 2015. Yulu is now working for Intel. Tingxing Dong (right) also earned his PhD in Computer Science with a minor in Computational Science and graduated on December 10th. Tingxing is now working at AMD in Austin, TX. Congratulations to Yulu and Tingxing!
Recent Releases
2016 ICL Annual Report
For fifteen years, ICL has produced an annual report to provide a concise profile of our research, including information about the people and external organizations who make it all happen. Please download a copy and check it out.
MAGMA 2.0 Beta 3 Released
MAGMA 2.0 Beta 3 is now available. MAGMA (Matrix Algebra on GPU and Multicore Architectures) is a collection of next generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards, e.g., LAPACK and BLAS, to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures. MAGMA allows applications to fully exploit the power of current heterogeneous systems of multi/many-core CPUs and multi-GPUs/coprocessors to deliver the fastest possible time to accurate solution within given energy constraints.
MAGMA 2.0 Beta 3 includes a major interface change for all MAGMA BLAS functions; most higher level functions such as magma_zgetrf have not changed their interface.
Significant changes include:
- Added queue argument to magmablas routines, and deprecated magmablas{Set,Get}KernelStream. This resolves a thread safety issue with using global magmablas{Set,Get}KernelStream.
- Fixed bugs related to relying on CUDA NULL stream implicit synchronization.
- Fixed memory leaks (zunmqr_m, zheevdx_2stage, etc.). Added -DDEBUG_MEMORY option to catch leaks.
- Fixed geqrf*_gpu bugs for m == nb, n >> m (ex: -N 64,10000); and m >> n, n == nb+i (ex: -N 10000,129).
- Fixed zunmql2_gpu for rectangular sizes.
- Fixed zhegvdx_m itype 3.
- Added zunglq, zungbr, zgeadd2 (which takes both alpha and beta).
MAGMA sparse changes:
- Added QMR, TFQMR, preconditioned TFQMR.
- Added CGS, preconditioned CGS.
- Added kernel-fused versions for CGS/PCGS QMR, TFQMR/PTFQMR.
- Changed relative stopping criterion to be relative to RHS.
- Fixed bug in complex version of CG.
- Accelerated version of Jacobi-CG.
- Added very efficient IDR.
- Performance tuning for SELLP SpMV.
Note: the Windows CMake port has not yet been implemented in the new Makefile structure. It will be available in the final release.
Visit the MAGMA software page to download the tarball.
PaRSEC / DPLASMA 2.0.0 RC2
PaRSEC / DPLASMA 2.0.0 RC2 is now available. PaRSEC (Parallel Runtime Scheduling and Execution Controller) is a generic framework for architecture-aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider are expressed as a Direct Acyclic Graph (DAG) of tasks, with edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried to discover data dependencies in a totally distributed fashion, a drastic shift from today’s programming models, which are based on sequential flow of execution.
RC2 includes many new additions. Some of the most notable are the new profiling interface with accessors for python and R, a tight integration with PAPI to extract all types of hardware counters, and the addition of a tasklet insertion interface.
Visit the PaRSEC software page to download the tarball.
PLASMA 2.8.0 Released
PLASMA 2.8.0 is now available! PLASMA (Parallel Linear Algebra Software for Multicore Architectures) is a dense linear algebra package at the forefront of multicore computing, designed to deliver the highest possible performance from a system with multiple sockets of multicore processors. PLASMA achieves this objective by combining state-of-the-art solutions in parallel algorithms, scheduling, and software engineering. Currently, PLASMA offers a collection of routines for solving linear systems of equations, least square problems, eigenvalue problems, and singular value problems.
PLASMA 2.8.0 integrates the following changes:
- Fix a synchronization problem in STEDC functions.
- Reduce the amount of computation performed in UNGQR/UNGLQD family routines by taking advantage of the Identity structure. It is no longer required to initialize the Q identity before calling those functions.
- New routine, PLASMA_[sdcz]lascal[_Tile[_Async]]—similar to Scalapack p[sdcz]lascal—to scale a matrix by a constant factor. This function does not handle numerical overflow/underflow as zlascl from LAPACK.
- New routines, PLASMA_[sdcz]geadd[Tile[_Async]], and PLASMA[sdcz]tradd[_Tile[_Async]]—similar to ScaLAPACK’s p[sdcz]geadd and p[sdcz]tradd—to add two general, or trapezoidal, matrices together.
- Add functions to the API to allow users to mix submission to the QUARK runtime system of asynchronous PLASMA calls and personal kernels.
- Update Lapacke interface to 3.6.0.
- Bug fix on Frobenius norm.
- Add a missing check on descriptors alignment with tiles that may cause unreported trouble for users utilizing sub-descriptors, especially with recursive algorithms.
Visit the PLASMA software page to download the tarballs.
PAPI 5.4.3 Released
PAPI 5.4.3 is now available. PAPI (the Performance API) provides simultaneous access to performance counters on CPUs, GPUs, and other components of interest (e.g., network and I/O systems). Provided as a linkable library or shared object, PAPI can be called directly in a user program, or used transparently through a variety of third-party tools, making it a de facto standard for hardware counter analysis. Industry liaisons with Bull, Cray, Intel, IBM, NVIDIA, and others ensure seamless integration of PAPI with new architectures at or near their release.
PAPI 5.4.3 includes some new implementations of components and tools, some general enhancements, and a number of bug fixes.
New Implementations:
- libmsr component: Using LLNL’s libmsr library to access Intel RAPL (Running Average Power Limit) library adds power capping abilities to PAPI.
- CUDA PC sampling: A new standalone CUDA sampling tool (papi_cuda_sampling) has been added to the CUDA component (components/cuda/sampling/) and can be used as a preloader to perform PC sampling on NVIDIA GPUs which support the CUPTI sampling interface (e.g., Maxwell).
- ARM Cortex A53 support: Event definitions added.
Enhancements:
- Added Haswell-EP uncore support.
- Initial Broadwell, Skylake support.
- Added a general CUDA example (components/cuda/test) that uses LD_PRELOAD to attach to a running CUcontext.
- Added “-check” flag to papi_avail and papi_native_avail to test counter availability/validity, unifying previous flags.
Information about the bug fixes can be found in the release notes and the ChangeLogP543.txt documents.
Last but not least, the PAPI team would like to extend a special thank you to all of their collaborators and contributors! PAPI would not be successful without your help!
Visit the PAPI software page to download the tarball.
Interview

Vince Weaver
Where are you from, originally?
I grew up in Joppatowne, Maryland, which is a suburb just northeast of Baltimore.
Can you summarize your educational background?
I obtained a BS in Electrical Engineering in 2000 from the University of Maryland, College Park. After a brief stint in industry that was cut short by the dot-com implosion, I started grad school at Cornell University (in the beautiful finger lakes region of New York) where I obtained my MS and PhD in Computer Engineering. My PhD research initially started out studying ways to address the memory wall problem in modern computer designs, but ended up being about using hardware performance counters to validate the (sadly fairly awful) accuracy and methodology used in academic cycle-accurate simulators.
How did you get introduced to ICL?
I don’t think it’s possible to do any research in the HPC arena without coming across one of the projects run by ICL. Besides the TOP500 list, I think I very quickly learned about PAPI once I started using hardware performance counters.
What did you work on during your time at ICL?
I mostly worked with the performance group developing PAPI. We had some major milestones, including the transition to the perf_event interface and libpfm4, as well as work on RAPL energy measurements and VM/cloud concerns.
What are some of your favorite memories from your time at ICL?
I always thought it was funny in these interviews how much people missed the Friday lunch talks, but I have to admit that in retrospect that was a big highlight.
Tell us where you are and what you’re doing now.
I’m currently an Assistant Professor of Electrical and Computer Engineering at the University of Maine. Assistant Professor means I’m tenure track so I have two years left to make the case that they should keep me around.
I teach classes in Embedded Systems, Operating Systems, Computer Architecture, and Cluster computing. My research includes work that I worked on at ICL (I still contribute to the PAPI project, and I am on an NSF grant along with ICL to extend PAPI). Other research includes trying to keep the Linux perf_event interface documented (I am the author of the perf_event_open() Linux manpage), keeping the Linux perf_event interface bug-free (I have written a system call fuzzer, perf_fuzzer, that has found numerous Linux bugs), as well as various projects involving power and performance analysis of both embedded systems and high performance computers.
Maine is a pretty place to live. The University is in Orono, just outside of Bangor, and only about an hour drive from Acadia National Park. On a clear day you can see Mount Kathadin from near my house; this is the end of the Appalachian Trail. The winter last year was record breaking in both bitter cold and snowfall, which has made this winter’s much milder weather seem pleasant by comparison. There are some things you have to get used to, such as the elementary schools send the kids outdoors for recess down to 10 degrees F (-12 C) and during recess they bring their sleds to school and go sledding.
Even in Maine it’s hard to get away from Tennessee. The professor whose office is next door to mine grew up in Oak Ridge and earned his PhD at UT. Also, you can have interesting experiences, such as dropping off your child at kindergarten on a freezing February morning and having someone yell “Go Vols!” at you from across the parking lot because you’re wearing an orange UT hat.
In what ways did working at ICL prepare you for what you do now, if at all?
Working at ICL was a great way to make connections with all of the other researchers in the HPC world.
Tell us something about yourself that might surprise some people.
In my office I’ve built a replica of the time circuits and flux capacitor from the Back to the Future movies, powered by a Raspberry Pi computer.
























