News and Announcements
DOE Seeks Dongarra’s Input for Exascale

In a recently released DOE report, ICL’s Jack Dongarra—a member of the Department of Energy’s ASCAC Subcommittee on Exascale Computing—worked with other research scientists in the field of HPC and Big Data to study the economic viability and changes necessary to both the hardware and software infrastructure to reach ExaFLOP/s computing within the next ten years. As the report points out, many of the technical challenges lie within energy consumption, memory performance, resilience, an extreme number of simultaneous calculations, and the use of big data.
“Exascale is just the next major milestone in a process of exponential improvement that has continued for over half a century,” said Dongarra. “The need to advance our understanding of the universe is without bounds, as is the need for modeling and computing the phenomena around us. For example, everyone is concerned about climate change and we need computers to help in modeling the climate.”
“We will probably reach Exascale computing in the United States around 2022,” continued Dongarra. “The computational challenge for doing oceanic clouds, ice, and topography are all tremendously important. And today we need at least two orders of magnitude improvement on that problem alone.”
Destination Imagination
As many ICLers are probably aware, the Destination Imagination Global Finals descended upon the UTK campus on May 21-24. Among the many elementary and middle school participants was an aspiring programmer and HPC enthusiast from British Columbia named Max Hynes. Max and his dad, Steve, reached out to ICL director Jack Dongarra when they realized that their interests aligned. Max and Steve visited with Jack that Friday afternoon, discussing LINPACK, parallel computing, and the TOP500, among other things.
Afterwards, Max and Steve were given a tour of ICL by Terry Moore and Sam Crawford and took home some cool ICL swag and over an hour of GoPro video as keepsakes. Thanks for stopping by the lab, guys!
Fast Data Analysis with SVD
As part of the Intel Science and Technology Center for Big Data (ISTC), ICL director Jack Dongarra provided his thoughts on Fast Data Analysis with SVD for the ISTC blog. You can read the full entry below.
By Jack Dongarra, University of Tennessee Knoxville
The GenBase benchmark was developed as a collaboration with the Intel Parallel Computing Lab, the Broad Institute and Novartis, and the MIT Database Group. Among many challenging tests that the benchmark includes is a computation of the Singular Value Decomposition (SVD) of a large matrix (tens of thousands of rows and/or columns). The reason for this particular computation is that we can now obtain gene correlation matrices from the lab but we don’t know how much information is contained in this correlation matrix of size N and so we turn to SVD, which, among other things, answers this question from the linear algebra standpoint.
The problem is that SVD is a computationally intensive process. To be exact, we need 8/3 N3 operations to compute it. N gets large quickly especially when working with the human genome, which has some 23,000 genes. Even for N=100,000, there are over 2.5*1015 floating-point operations to perform, and for double the matrix size, which is not uncommon, there is 8 times as much work (the cubic complexity is at fault here). This of course poses problems because it requires a long computing time.
How long? A modern Intel processor core delivers about 50 Gflop/s (50*109 floating-point operations per second). This is only true for the newest Intel architectures (code-named Haswell and Ivy Town) that feature the AVX2 instruction set. A quick calculation of the time-to-solution reveals that it would take over 14 hours to complete SVD of a single matrix at this approximate rate. The recent high-edition of the server-grade Intel Xeon processor (the E5 v2 series) has 15 cores so we can scale the computation time to about one hour.
Unfortunately, modern linear algebra libraries, commercial or open source, do not compute SVD at the rate of 50 Gflop/s per core. The number is closer to 20 Gflop/s and does not scale with the number of cores. This limitation is related to the algorithm chosen for the implementation, which is a result of the computation having the memory bandwidth as the main bottleneck and the bandwidth becoming a scarce resource as the computational capacity increases with the advancements dictated by Moore’s Law. There exist newer algorithms that suffer much less from the bandwidth problem but they increase the computational burden even further without sufficiently benefiting the speed of the calculation beyond the bandwidth limit.
At this point the linear algebra research at the University of Tennessee Knoxville comes into play. The following reasoning led us to a substantial improvement in the state-of-the-art in terms of both algorithmic development and the implementation effort. The classic algorithms suffer the bandwidth problem due to memory-bound operations. Since these operations are necessary to obtain the SVD, we limit their effect by judiciously using the ideas from the more modern algorithms and adding new algorithms that remove the bandwidth problem and use the CPU cores with much greater efficiency.
At the implementation level, we rely on matrix-matrix operations, which reuse data in cache and ease the burden on the memory bus. Much more bandwidth-demanding are matrix-vector operations, which are necessary to obtain the final SVD form. We still use them but on a reduced portion of the matrix rather than on the original one. The reduced portion fits in cache and that increases the execution rate because cache memories offer much higher bandwidth and are duplicated for exclusive use in multicore processors. There is a penalty for the use of a different algorithm: an increase in the number of operations. Fortunately, the increased execution rate and scalability with the number of cores more than make up for it.
References
[1] Azzam Haidar, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware, Proceedings of SC13, November 17-21, 2013, Denver, CO, USA.
Conference Reports
Uppsala University
ICL’s Jakub Kurzak visited Uppsala University and their Department of Information Technology on May 22-23. Jakub gave an invited talk on Thursday, May 22nd, on the PULSAR project. The following day, Jakub participated as a committee member at the PhD defense of Martin Tillenius—at the invitation of Martin’s PhD advisor, Elizabeth Larsson.
At the actual defense, Jakub gave another talk, where he discussed how the PhD candidate’s efforts fit within the broader context of dataflow scheduling. Martin’s PhD work explored implementing a superscalar scheduler like QUARK-D within distributed memory systems. The shared memory software is called SuperGlue, and the distributed memory layer is called DuctTEiP.
Jakub took the role of committee member very seriously, but refrained from “grilling” Martin for longer than a half hour, which is admirable. Other well known members of the defense committee include Bo Kågström and Sverker Holmgren.
International Parallel and Distributed Processing Symposium (IPDPS)
On May 19-23, five members of the ICL team found their way to Phoenix, Arizona for the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS). IPDPS is an international forum for engineers and scientists from around the world to present their latest research findings in all aspects of parallel computation. In addition to technical sessions of submitted paper presentations, the meeting offers workshops, tutorials, and commercial presentations and exhibits.
ICL’s Hartwig Anzt, George Bosilca, Tim Dong, Piotr Luszczek, and Ichitaro Yamazaki all presented papers and gave talks at various workshops at the symposium. Hartwig gave a talk on Optimizing Krylov Subspace Solvers on Graphics Processing Units at the The Fourth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) and presented the paper, Hybrid Multi-Elimination ILU Preconditioners on GPUs at the International Heterogeneity in Computing Workshop (HCW).
George gave a talk on Assessing the Impact of ABFT & Checkpoint Composite Strategies at the 16th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), and Tim presented A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU in the main track of the IPDPS symposium.
Ichi gave a talk called Improving the Performance of CA-GMRES on Multicores with Multiple GPUs in the main track, and gave a talk on PULSAR, Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime, at the Workshop on Large-Scale Parallel Processing (LSPP).
The busiest member of the ICL team was Piotr, who gave 4 talks; including Linear Algebra Software on Accelerated Multicore Systems (Keynote at PLC), Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs (ASHES), Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment (IPDPS), and New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem (Best paper at PDSEC). Piotr also chaired the Performance Characterization and Optimization session at IPDPS and the Algorithms session at PDSEC.
Recent Releases
MAGMA MIC 1.2 Beta Released
MAGMA MIC 1.2 Beta is now available. This release provides implementations for MAGMA’s one-sided (LU, QR, and Cholesky) and two-sided (Hessenberg, bi- and tridiagonal reductions) dense matrix factorizations, as well as a linear and eigenproblem solver for Intel Xeon Phi Coprocessors. More information on the approach is given in this presentation.
The MAGMA MIC 1.2 Beta release adds usage and performance improvements.
Visit the software page to download the tarball.
MAGMA 1.5.0 Beta 2 Released
MAGMA 1.5.0 beta 2 is now available. This release provides performance improvements for SVD and eigenvector routines, and the first beta of sparse routines. More information is given in the MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures presentation as well as the MAGMA Quick Reference Guide. The MAGMA 1.5.0 release adds the following new functionalities:
- SVD using Divide and Conquer (gesdd);
- Nonsymmetric eigenvector computation is multi-threaded (trevc3_mt); and
- Sparse functions.
Parameters (trans, uplo, etc.) now use symbolic constants (MagmaNoTrans, MagmaLower, etc.) instead of characters (‘N’, ‘L’, etc.). Converters are provided to translate these to LAPACK, CUBLAS, and CBLAS constants.
Visit the software page to download the tarball.
Interview

Greg Henry
Where are you from, originally?
I was born in San Francisco, California, and grew up in the Bay Area, but spent time in southern California before a brief haunt on the east coast.
Can you summarize your educational background?
I earned a bachelor’s degree in Mathematics/Computer Science at UCSD, before earning a PhD at Cornell’s Center for Applied Mathematics. I studied numerical methods in eigenvalue problems.
How did you get introduced to ICL?
At a conference, probably SIAM, Jack spoke to me about joining him in Tennessee. This would have been in the early 90s, so I don’t believe the lab was called ICL at the time. Nevertheless, he was enormously convincing and I decided to join the research group.
What did you work on during your time at ICL?
I worked on some of the eigenvalue routines for ScaLAPACK. I also talked with Antoine Petitet about how one might do things like solve a system of linear equations faster—the guy was such a master that during that night he coded it up, and brought it to Jack’s attention. Unfortunately, the technique broke some of the ScaLAPACK code guidelines, but Antoine and Clint Whaley were able to make use of similar tricks later in HPL 1.0.
What are some of your favorite memories from your time at ICL?
The people and the opportunity. My stay in ICL (in 1996) was agonizingly short due to some family issues, but I feel truly blessed to have gotten even the short period I received. I’m forever grateful to Jack, and I absolutely loved working with the local team I shared an office with—Susan, Clint, Antoine, Andy, Henri (and briefly, J. Choi).
Tell us where you are and what you’re doing now.
I’m at Intel Corporation working on developing the Intel® Math Kernel Library in Hillsboro, Oregon. We now have an excellent local team, which includes myself, Vamsi Sripathi, Murat Efe Guney, Sarah Knepper, Kazushige Goto, and is led by Shane Story. At the moment, we are focused on BLAS optimizations and LINPACK tuning.
In what ways did working at ICL prepare you for what you do now, if at all?
I believe every step along the path helps prepare us for the next challenge. My stay at ICL was one of the highlights of my life. I still fondly recall showing up in Knoxville with a suitcase and no idea where to stay that night. I was ambling through the streets, lost and confused, and Antoine drove by. He saw me, pulled over, and let me stay at his place while I worked out my housing situation.
Tell us something about yourself that might surprise some people.
I’m a hobby novelist. I’ve written eleven (unpublished) novels and spend time taking writing classes, participating in writing conferences, and working with writing groups. Mostly I write young adult fantasy or paranormal. Learning how to write fiction well is a skill, like getting a PhD in a subject, and takes years of dedication and effort, but perhaps someday people will see one of my novels on the New York Times bestseller list.
















