ICL Newsletter

News and Announcements

TOP500: November 2019

The 54th TOP500 list was just unveiled at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19) in Denver, CO. The United States has kept the top-two spots with the Department of Energy’s Summit (at Oak Ridge National Laboratory) and Sierra (at Lawrence Livermore National Laboratory). In fact, the top 10 machines on the list remain unchanged since June 2019.

The most powerful supercomputer that is new to the November 2019 list is Rensselaer Polytechnic’s AiMOS, which acheived 8.045 PFLOP/s on the HPL benchmark and landed in the #24 spot. Installed at the Center for Computational Innovations, AiMOS runs Power9 CPUs with NVIDIA V100 GPUs, which is becoming a popular combination in the TOP500.

Rank	System	Cores	Rmax (TFLOP/s)	Rpeak (TFLOP/s)	Power (kW)
1	Summit – IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory United States	2,414,592	148,600.0	200,794.9	10,096
2	Sierra – IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM / NVIDIA / Mellanox DOE/NNSA/LLNL United States	1,572,480	94,640.0	125,712.0	7,438
3	Sunway TaihuLight – Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway, NRCPC National Supercomputing Center in Wuxi China	10,649,600	93,014.6	125,435.9	15,371
4	Tianhe-2A – TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000, NUDT National Super Computer Center in Guangzhou China	4,981,760	61,444.5	100,678.7	18,482
5	Frontera – Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox InfiniBand HDR, Dell EMC Texas Advanced Computing Center United States	448,448	23,516.4	38,745.9

HPCG: November 2019

The latest results for the HPC Preconditioned Conjugate Gradient (HPCG) benchmark were also released at SC19. A joint effort between ICL and Sandia National Laboratories, HPCG is designed to measure performance that is representative of modern HPC capability by simulating compute and communication patterns from sparse iterative solvers commonly found in science and engineering applications.

HPCG results are released twice per year alongside the TOP500 rankings to show how real-world applications might fare on a given machine. One notable change is that the K Computer, a stalwart entry in HPCG’s top five, was decommissioned, and so the lower-ranking systems moved up a place. The full list of HPCG rankings is available here.

Rank	Computer	HPL (PFLOP/s)	TOP500 Rank	HPCG (PFLOP/s)	%Peak
1	Summit – IBM, POWER9, NVIDIA Volta V100 DOE/SC/ORNL, USA	148.6	1	2.926	1.5%
2	Sierra – IBM, Power9, NVIDIA Tesla V100 DOE/NNSA/LLNL, USA	94.64	2	1.796	1.4%
3	Trinity – Cray XC40, Intel Xeon E5-2698 v3, Xeon Phi 7250 DOE/NNSA/LANL/SNL, USA	20.159	7	0.546	1.3%
4	AI Bridging Cloud Infrastructure – PRIMERGY CX2570 M4, Xeon Gold 6148 20C 2.4GHz, NVIDIA Tesla V100 AIST, Japan	19.880	8	0.509	1.6%
5	Piz Daint – Cray XC50, Xeon E5-2690v3 12C 2.6GHz, NVIDIA Tesla P100 Swiss National Supercomputing Centre, Switzerland	21.230	6	0.497	1.8%

HPL-AI

SC19 also saw the official launch of HPL-AI, a benchmark that seeks to highlight the emerging convergence of HPC and artificial intelligence (AI) workloads. While traditional HPC focuses on simulation runs for modeling phenomena in physics, chemistry, biology, and so on, the mathematical models that drive these computations require, for the most part, 64-bit accuracy. On the other hand, the machine-learning methods that fuel advances in AI can achieve the desired results at 32-bit (full precision) or even lower floating-point precision formats.

This lesser demand for accuracy fueled a resurgence of interest in new hardware platforms that deliver a mix of unprecedented performance levels and energy savings to achieve the classification and recognition fidelity afforded by higher-accuracy formats.

HPL-AI strives to unite these two realms by delivering a blend of modern algorithms and contemporary hardware while simultaneously connecting to the solver formulation of the decades-old High-Performance Linpack (HPL) framework of benchmarking the largest supercomputing installations in the world.

So far, Oak Ridge National Laboratory’s Summit is the only machine to be benchmarked with HPL-AI, and it achieved 445 PFLOP/s in mixed precision. This is nearly triple the 148 PFLOP/s that Summit achieved on the standard (double-precision) HPL benchmark used for the TOP500.

Read more about HPL-AI here: https://icl.bitbucket.io/hpl-ai/.

Employment Opportunities at ICL

ICL is seeking full-time Research Scientists (MS or PhD) to participate in the design, development, and maintenance of numerical software libraries for solving linear algebra problems on large, distributed-memory machines with multi-core processors, hardware accelerators, and performance monitoring capabilities for new and advanced hardware and software technologies.

The prospective researcher will coauthor papers to document research findings, present the team’s work at conferences and workshops, and help lead students and other team members in their research endeavors in ongoing and future projects. Given the nature of the work, there will be opportunities for publication, travel, and high-profile professional networking and collaboration across academia, labs, and industry.

An MS or PhD in computer science, computational sciences, or math is preferred. Background in at least one of the following areas is also preferred: numerical linear algebra, HPC, performance monitoring, machine learning, or data analytics.

For more information check out ICL’s jobs page: http://www.icl.utk.edu/jobs.

Conference Reports

SC19

This year’s International Conference for High Performance Computing Networking, Storage, and Analysis (SC19) was held in Denver, CO on November 17–22.

Five computational science research centers from the University of Tennessee—the Bredesen Center, the Global Computing Laboratory, the Innovative Computing Laboratory, the Joint Institute for Computational Sciences, and the SimCenter—represented the university by anchoring the University of Tennessee booth. As usual, ICL had a significant presence at SC, with faculty, research staff, and students giving talks, presenting papers, and leading “Birds of a Feather” sessions.

ICL once again ran a dedicated ICL@SC webpage, where interested parties could keep tabs on ICL-related events—including a list of attendees, detailed schedule of talks, and the latest project handouts. In addition, ICL’s Daniel Barry did a bang-up job running the ICL Twitter account (@ICL_UTK), where he provided up-to-the minute information about what was happening on the ground.

The editor would like to thank Jack Dongarra, Daniel Barry, and Fengguang Song for their contributions to this article.

Recent Releases

MAGMA 2.5.2 Released

MAGMA 2.5.2 is now available. Matrix Algebra on GPU and Multicore Architectures (MAGMA) is a collection of next-generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards (e.g., LAPACK and BLAS) to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures.

Changes for MAGMA 2.5.2 include:

New routine: magmablas_hgemm_batched for fixed-size, batched-matrix multiplication in FP16 using the Tensor Cores.
- The routine does not currently support pre-Volta GPUs.
- The routine outperforms cuBLAS where sizes are less than 100 by 100, and where general sizes are not multiples of 8.
- The kernel is tuned for the notrans-notrans case only. Comprehensive tuning is planned for future releases.
Fixed magmablas_?gemm_vbatched routines to correctly handle batch sizes over 65,535. The same fix is applied to vbatched syrk, herk, syr2k, her2k, symm, hemm, and trmm.
Fixed a bug in the FP32 <-> FP16 conversion routines (magmablas_hlag2s and magmablas_slag2h). The bug used to cause a launch failure for very large matrices.
Fixed a bug in the batched LU factorization to avoid NaNs when singularity is encountered.
Fixed a bug in the batched LU factorization to ensure that the first pivot is always returned—even when multiple pivots with the same absolute value are found.
Added Frobenius norm for general matrices (supported as an option to magmablas_Xlange for X = 's', 'd', 'c', or 'z').

Click here to download the tarball.

MAGMA DNN 1.1 Released

MAGMA DNN 1.1 is now available. MAGMA DNN is a C++ neural network library that aims to provide a simple, modular framework for deep learning—accelerated by heterogeneous architectures—using MAGMA as its computational backend.

Changes for MAGMA DNN 1.1 include:

bug fixes and performance improvements;
distributed training;
hyperparameter optimization framework improvements;
benchmarks using MAGMA DNN; and
performance comparisons, accuracy validations, and more with TensorFlow, Theano, and PyTorch.

More information on MagmaDNN 1.1 is provided in this paper and in this presentation.

Check out the MAGMA DNN repository on Bitbucket: https://bitbucket.org/icl/magmadnn.

LAPACK 3.9.0 Released

LAPACK 3.9.0 is now available. LAPACK (the Linear Algebra PACKage) is a widely used library for efficiently solving dense linear algebra problems, and ICL has been a major contributor to the development and maintenance of LAPACK since its inception. LAPACK is sequential, relies on the BLAS library, and benefits from the multi-core BLAS library.

Released at SC19, LAPACK 3.9.0 adds a QR-preconditioned QR SVD method and an LAPACK Householder Reconstruction routine.

Visit the LAPACK website to download the tarball.

SC19 Project Handouts

The new project handouts from SC19 are available for download in PDF format.

Interview

Where are you from, originally?
I was born in Kuwait and grew up in Jordan.

Can you summarize your educational background?
I earned a BS in Mathematics from Hashemite University of Jordan, Amman, in 2008, and I earned my MSc and PhD in Applied Mathematics and Computational Science at the King Abdullah University of Science and Technology in Saudi Arabia, where I studied from 2013 to 2019.

My research focuses on presenting a new high-performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on (1) multi-core architectures enhanced with GPUs and on (2) distributed-memory platforms based on the state-of-the-art, vendor-optimized ScaLAPACK library.

Where did you work before joining ICL?
I was a PhD student in Applied Mathematics and Computational Science at King Abdullah University of Science and Technology, Saudi Arabia.

How did you first hear about the lab, and what made you want to work here?
I first heard about ICL through the use of LAPACK/PLASMA libraries during one of my summer research courses. Moreover, ICL is a shining star in many conferences, and I have attended many talks and tutorials presented by ICL researchers.

Based on my previous work experience, I thought joining such a productive lab would be a great opportunity for me. So, once I heard about potential ICL job openings at the 2019 SIAM Computational Science and Engineering conference, I applied, and here I am.

What is your focus here at ICL? What are you working on?
I am involved in two interesting projects: SLATE and AsyncIS. In SLATE, we are working on adding more functionalities, such as SVD/EVD, to the SLATE library. In AsynchIS, I will be working on a high-performance implementation of the SGD algorithm to enhance the performance of training a Deep Neural Network.

What are your interests/hobbies outside of work?
I love swimming and walking. Indoors, I like to watch movies, play chess, and I enjoy cooking, too.

Tell us something about yourself that might surprise people.
During my Bachelor’s in Mathematics, I used to skip some difficult math classes (e.g., real analysis) to attend fashion classes at a nearby college.

If you weren’t working at ICL, where would you like to be working and why?
I have a huge appreciation for research in applied mathematics and computational science, so if I was to work elsewhere, I might pursue research with some other research group.

Recent Papers

Altintas, I., K. Marcus, V. Vural, S. Purawat, D. Crawl, G. Antoniu, A. Costan, O. Marcu, P. Balaprakash, R. Cao, et al., “A Collection of White Papers from the BDEC2 Workshop in San Diego, CA,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-13: University of Tennessee, October 2019. (8.25 MB)
Tomov, S., A. Abdelfattah, V. Barra, N. Beams, J. Brown, J-S. Camier, V. Dobrev, J. Dongarra, Y. Dudouit, P. Fischer, et al., CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps : Zenodo, October 2019. DOI: 10.5281/zenodo.3477618 (8.31 MB)
Tomov, S., A. Haidar, A. Ayala, H. Shaiek, and J. Dongarra, “FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019. (4.14 MB)
Bujanovic, Z., and Z. Drmac, “New Robust ScaLAPACK Routine for Computing the QR Factorization with Column Pivoting,” LAPACK Working Note, no. LAWN 296, ICL-UT-19-14: University of Tennessee, October 2019. (454.83 KB)
Han, L., V. Le FÃ¨vre, L-C. Canon, Y. Robert, and F. Vivien, “A Generic Approach to Scheduling and Checkpointing Workflows,” International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1255-1274, November 2019. DOI: 10.1177/1094342019866891 (555.01 KB)
Losada, N., A. Bouteiller, and G. Bosilca, “Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,” Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019. (440.7 KB)
Aupy, G., A. Benoit, B. Goglin, L. Pottier, and Y. Robert, “Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,” International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019. DOI: 10.1177/1094342019846956 (930.28 KB)
Pei, Y., G. Bosilca, I. Yamazaki, A. Ida, and J. Dongarra, “Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization,” PAW-ATM Workshop at SC19, Denver, CO, ACM, November 2019. (4.51 MB)
Herault, T., Y. Robert, G. Bosilca, and J. Dongarra, “Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC,” ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019. (260.69 KB)
Ayala, A., S. Tomov, X. Luo, H. Shaiek, A. Haidar, G. Bosilca, and J. Dongarra, “Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,” Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019. (1.6 MB)
Jagode, H., A. Danalis, H. Anzt, and J. Dongarra, “PAPI Software-Defined Events for in-Depth Performance Analysis,” The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1113-1127, November 2019. (442.39 KB)
Cao, Q., Y. Pei, T. Herault, K. Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, and J. Dongarra, “Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools,” Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19, Denver, CO, ACM, November 2019. (429.55 KB)
Benoit, A., T. Herault, V. Le FÃ¨vre, and Y. Robert, “Replication is More Efficient Than You Think,” The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019. (975.69 KB)
Gates, M., J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, “SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library,” International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Denver, CO, ACM, November 2019. DOI: 10.1145/3295500.3356223 (2.01 MB)
Gates, M., J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library , Denver, CO, International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), November 2019. (16.19 MB)
Anzt, H., G. Flegar, T. Gruetzmacher, and E. S. Quintana-Orti, “Toward a Modular Precision Ecosystem for High-Performance Computing,” The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1069-1078, November 2019. DOI: 10.1177/1094342019846547 (1.93 MB)
Anzt, H., T. Cojean, and E. Kuhn, “Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility,” Proceedings in Applied Mathematics and Mechanics, vol. 19, issue 1, November 2019. DOI: 10.1002/pamm.201900490
Abdelfattah, A., S. Tomov, and J. Dongarra, “Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs,” ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019. (523.87 KB) (3.42 MB)
Li, J., B. Nicolae, J. M. Wozniak, and G. Bosilca, “Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training,” 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), Denver, CO, IEEE, November 2019. DOI: 10.1109/MLHPC49564.2019.00006 (696.89 KB)

Recent Conferences

OCT
8-10

Frontier Application Readiness Kick-Off Workshop Knoxville, Tennessee
Heike

Heike Jagode
OCT
15-17

BDEC San Diego San Diego, California
Jack
Joan
Terry

Jack Dongarra, Joan Snoderly, Terry Moore
NOV
16-22

SC19 Denver, Colorado
Ahmad
Alan
Aurelien
Daniel
David
George
Gerald
Hartwig
Jack
Jiali
Joan
Mark
Piotr
Qinglei
Terry
Thomas
Mike
Yu

Ahmad Abdelfattah, Alan Ayala, Aurelien Bouteiller, Daniel Barry, David Eberius, George Bosilca, Gerald Ragghianti, Hartwig Anzt, Jack Dongarra, Jiali Li, Joan Snoderly, Mark Gates, Piotr Luszczek, Qinglei Cao, Terry Moore, Thomas Herault, Yaohung Tsai, Yu Pei

Upcoming Conferences

DEC
3-4

CnC 2019 Salt Lake City, Utah
George

George Bosilca
DEC
9-10

NSF EPEXA kickoff meeting new york, New York
George
Thomas

George Bosilca, Thomas Herault

Recent Lunch Talks

OCT
4
Yves Robert
ENS-Lyon
Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms PDF
OCT
11
Axel Huebl
Lawrence Berkeley National Laboratory
Scalable, Performance-Portable Particle-in-Cell Simulations and PByte-Scale Data-Challenges
OCT
18
Alan Ayala
heFFTe: Highly Efficient FFT for Exascale PDF
OCT
25
Yaohung Tsai
Autotuning in Deep Learning Kernels PDF
NOV
1
David Eberius
A Flexible MPI Benchmark For Fast Assessment of Multithreaded Communication Performance PDF
NOV
8
Sticks Mabakane
Effective Callgraph Visualisations for Optimisation of Parallel-Programs PDF
NOV
15
Anthony Danalis
Questions about hardware events? CAT has the answers.

Upcoming Lunch Talks

DEC
6
Damien Genet
PAPI Updates on Power9 PDF
DEC
13
Piotr Luszczek
Building Community through xSDK Software Policies PDF

People

After 13 years at ICL, Jakub Kurzak has taken a job as an STSM Software Engineer with AMD. Jakub will be at the frontier, so to speak, of AMD's software efforts as they build out their exascale system at Oak Ridge National Laboratory. Good luck, Jakub!
Aaron Welch joined ICL this fall as a Student Assistant working in the DisCo group. Welcome, Aaron!
Rephael Congmon joined ICL this fall as a Student Assistant working in the DisCo group. Welcome, Rephael!

November 2019