News and Announcements

TOP500: November 2019

The 54th TOP500 list was just unveiled at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC19) in Denver, CO. The United States has kept the top-two spots with the Department of Energy’s Summit (at Oak Ridge National Laboratory) and Sierra (at Lawrence Livermore National Laboratory). In fact, the top 10 machines on the list remain unchanged since June 2019.

The most powerful supercomputer that is new to the November 2019 list is Rensselaer Polytechnic’s AiMOS, which acheived 8.045 PFLOP/s on the HPL benchmark and landed in the #24 spot. Installed at the Center for Computational Innovations, AiMOS runs Power9 CPUs with NVIDIA V100 GPUs, which is becoming a popular combination in the TOP500.

Rank System Cores Rmax (TFLOP/s) Rpeak (TFLOP/s) Power (kW)
1 Summit – IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM
DOE/SC/Oak Ridge National Laboratory
United States
2,414,592 148,600.0 200,794.9 10,096
2 Sierra – IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM / NVIDIA / Mellanox
DOE/NNSA/LLNL
United States
1,572,480 94,640.0 125,712.0 7,438
3 Sunway TaihuLight – Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway, NRCPC
National Supercomputing Center in Wuxi
China
10,649,600 93,014.6 125,435.9 15,371
4 Tianhe-2A – TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000, NUDT
National Super Computer Center in Guangzhou
China
4,981,760 61,444.5 100,678.7 18,482
5 Frontera – Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox InfiniBand HDR, Dell EMC
Texas Advanced Computing Center
United States
448,448 23,516.4 38,745.9

HPCG: November 2019

The latest results for the HPC Preconditioned Conjugate Gradient (HPCG) benchmark were also released at SC19. A joint effort between ICL and Sandia National Laboratories, HPCG is designed to measure performance that is representative of modern HPC capability by simulating compute and communication patterns from sparse iterative solvers commonly found in science and engineering applications.

HPCG results are released twice per year alongside the TOP500 rankings to show how real-world applications might fare on a given machine. One notable change is that the K Computer, a stalwart entry in HPCG’s top five, was decommissioned, and so the lower-ranking systems moved up a place. The full list of HPCG rankings is available here.

Rank Computer HPL (PFLOP/s) TOP500 Rank HPCG (PFLOP/s) %Peak
1 Summit – IBM, POWER9, NVIDIA Volta V100

DOE/SC/ORNL, USA

148.6 1 2.926 1.5%
2 Sierra – IBM, Power9, NVIDIA Tesla V100

DOE/NNSA/LLNL, USA

94.64 2 1.796 1.4%
3 Trinity – Cray XC40, Intel Xeon E5-2698 v3, Xeon Phi 7250

DOE/NNSA/LANL/SNL, USA

20.159 7 0.546 1.3%
4 AI Bridging Cloud Infrastructure – PRIMERGY CX2570 M4, Xeon Gold 6148 20C 2.4GHz, NVIDIA Tesla V100

AIST, Japan

19.880 8 0.509 1.6%
5 Piz Daint – Cray XC50, Xeon E5-2690v3 12C 2.6GHz, NVIDIA Tesla P100

Swiss National Supercomputing Centre, Switzerland

21.230 6 0.497 1.8%

HPL-AI

SC19 also saw the official launch of HPL-AI, a benchmark that seeks to highlight the emerging convergence of HPC and artificial intelligence (AI) workloads. While traditional HPC focuses on simulation runs for modeling phenomena in physics, chemistry, biology, and so on, the mathematical models that drive these computations require, for the most part, 64-bit accuracy. On the other hand, the machine-learning methods that fuel advances in AI can achieve the desired results at 32-bit (full precision) or even lower floating-point precision formats.

This lesser demand for accuracy fueled a resurgence of interest in new hardware platforms that deliver a mix of unprecedented performance levels and energy savings to achieve the classification and recognition fidelity afforded by higher-accuracy formats.

HPL-AI strives to unite these two realms by delivering a blend of modern algorithms and contemporary hardware while simultaneously connecting to the solver formulation of the decades-old High-Performance Linpack (HPL) framework of benchmarking the largest supercomputing installations in the world.

So far, Oak Ridge National Laboratory’s Summit is the only machine to be benchmarked with HPL-AI, and it achieved 445 PFLOP/s in mixed precision. This is nearly triple the 148 PFLOP/s that Summit achieved on the standard (double-precision) HPL benchmark used for the TOP500.

Read more about HPL-AI here: https://icl.bitbucket.io/hpl-ai/.

Employment Opportunities at ICL

ICL is seeking full-time Research Scientists (MS or PhD) to participate in the design, development, and maintenance of numerical software libraries for solving linear algebra problems on large, distributed-memory machines with multi-core processors, hardware accelerators, and performance monitoring capabilities for new and advanced hardware and software technologies.

The prospective researcher will coauthor papers to document research findings, present the team’s work at conferences and workshops, and help lead students and other team members in their research endeavors in ongoing and future projects. Given the nature of the work, there will be opportunities for publication, travel, and high-profile professional networking and collaboration across academia, labs, and industry.

An MS or PhD in computer science, computational sciences, or math is preferred. Background in at least one of the following areas is also preferred: numerical linear algebra, HPC, performance monitoring, machine learning, or data analytics.

For more information check out ICL’s jobs page: http://www.icl.utk.edu/jobs.

Conference Reports

SC19

This year’s International Conference for High Performance Computing Networking, Storage, and Analysis (SC19) was held in Denver, CO on November 17–22.

Five computational science research centers from the University of Tennessee—the Bredesen Center, the Global Computing Laboratory, the Innovative Computing Laboratory, the Joint Institute for Computational Sciences, and the SimCenter—represented the university by anchoring the University of Tennessee booth. As usual, ICL had a significant presence at SC, with faculty, research staff, and students giving talks, presenting papers, and leading “Birds of a Feather” sessions.

ICL once again ran a dedicated ICL@SC webpage, where interested parties could keep tabs on ICL-related events—including a list of attendees, detailed schedule of talks, and the latest project handouts. In addition, ICL’s Daniel Barry did a bang-up job running the ICL Twitter account (@ICL_UTK), where he provided up-to-the minute information about what was happening on the ground.

The editor would like to thank Jack Dongarra, Daniel Barry, and Fengguang Song for their contributions to this article.

Recent Releases

MAGMA 2.5.2 Released

MAGMA 2.5.2 is now available. Matrix Algebra on GPU and Multicore Architectures (MAGMA) is a collection of next-generation linear algebra (LA) libraries for heterogeneous architectures. The MAGMA package supports interfaces for current LA packages and standards (e.g., LAPACK and BLAS) to allow computational scientists to easily port any LA-reliant software components to heterogeneous architectures.

Changes for MAGMA 2.5.2 include:

  • New routine: magmablas_hgemm_batched for fixed-size, batched-matrix multiplication in FP16 using the Tensor Cores.
    • The routine does not currently support pre-Volta GPUs.
    • The routine outperforms cuBLAS where sizes are less than 100 by 100, and where general sizes are not multiples of 8.
    • The kernel is tuned for the notrans-notrans case only. Comprehensive tuning is planned for future releases.
  • Fixed magmablas_?gemm_vbatched routines to correctly handle batch sizes over 65,535. The same fix is applied to vbatched syrk, herk, syr2k, her2k, symm, hemm, and trmm.
  • Fixed a bug in the FP32 <-> FP16 conversion routines (magmablas_hlag2s and magmablas_slag2h). The bug used to cause a launch failure for very large matrices.
  • Fixed a bug in the batched LU factorization to avoid NaNs when singularity is encountered.
  • Fixed a bug in the batched LU factorization to ensure that the first pivot is always returned—even when multiple pivots with the same absolute value are found.
  • Added Frobenius norm for general matrices (supported as an option to magmablas_Xlange for X = 's', 'd', 'c', or 'z').

Click here to download the tarball.

MAGMA DNN 1.1 Released

MAGMA DNN 1.1 is now available. MAGMA DNN is a C++ neural network library that aims to provide a simple, modular framework for deep learning—accelerated by heterogeneous architectures—using MAGMA as its computational backend.

Changes for MAGMA DNN 1.1 include:

  • bug fixes and performance improvements;
  • distributed training;
  • hyperparameter optimization framework improvements;
  • benchmarks using MAGMA DNN; and
  • performance comparisons, accuracy validations, and more with TensorFlow, Theano, and PyTorch.

More information on MagmaDNN 1.1 is provided in this paper and in this presentation.

Check out the MAGMA DNN repository on Bitbucket: https://bitbucket.org/icl/magmadnn.

LAPACK 3.9.0 Released

LAPACK 3.9.0 is now available. LAPACK (the Linear Algebra PACKage) is a widely used library for efficiently solving dense linear algebra problems, and ICL has been a major contributor to the development and maintenance of LAPACK since its inception. LAPACK is sequential, relies on the BLAS library, and benefits from the multi-core BLAS library.

Released at SC19, LAPACK 3.9.0 adds a QR-preconditioned QR SVD method and an LAPACK Householder Reconstruction routine.

Visit the LAPACK website to download the tarball.

SC19 Project Handouts

The new project handouts from SC19 are available for download in PDF format.

Interview

Dalal Sukkari Then

Dalal Sukkari

Where are you from, originally?
I was born in Kuwait and grew up in Jordan.

Can you summarize your educational background?
I earned a BS in Mathematics from Hashemite University of Jordan, Amman, in 2008, and I earned my MSc and PhD in Applied Mathematics and Computational Science at the King Abdullah University of Science and Technology in Saudi Arabia, where I studied from 2013 to 2019.

My research focuses on presenting a new high-performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on (1) multi-core architectures enhanced with GPUs and on (2) distributed-memory platforms based on the state-of-the-art, vendor-optimized ScaLAPACK library.

Where did you work before joining ICL?
I was a PhD student in Applied Mathematics and Computational Science at King Abdullah University of Science and Technology, Saudi Arabia.

How did you first hear about the lab, and what made you want to work here?
I first heard about ICL through the use of LAPACK/PLASMA libraries during one of my summer research courses. Moreover, ICL is a shining star in many conferences, and I have attended many talks and tutorials presented by ICL researchers.

Based on my previous work experience, I thought joining such a productive lab would be a great opportunity for me. So, once I heard about potential ICL job openings at the 2019 SIAM Computational Science and Engineering conference, I applied, and here I am.

What is your focus here at ICL? What are you working on?
I am involved in two interesting projects: SLATE and AsyncIS. In SLATE, we are working on adding more functionalities, such as SVD/EVD, to the SLATE library. In AsynchIS, I will be working on a high-performance implementation of the SGD algorithm to enhance the performance of training a Deep Neural Network.

What are your interests/hobbies outside of work?
I love swimming and walking. Indoors, I like to watch movies, play chess, and I enjoy cooking, too.

Tell us something about yourself that might surprise people.
During my Bachelor’s in Mathematics, I used to skip some difficult math classes (e.g., real analysis) to attend fashion classes at a nearby college.

If you weren’t working at ICL, where would you like to be working and why?
I have a huge appreciation for research in applied mathematics and computational science, so if I was to work elsewhere, I might pursue research with some other research group.

Recent Papers

  1. Altintas, I., K. Marcus, V. Vural, S. Purawat, D. Crawl, G. Antoniu, A. Costan, O. Marcu, P. Balaprakash, R. Cao, et al., A Collection of White Papers from the BDEC2 Workshop in San Diego, CA,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-13: University of Tennessee, October 2019.  (8.25 MB)
  2. Tomov, S., A. Abdelfattah, V. Barra, N. Beams, J. Brown, J-S. Camier, V. Dobrev, J. Dongarra, Y. Dudouit, P. Fischer, et al., CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps : Zenodo, October 2019. DOI: 10.5281/zenodo.3477618  (8.31 MB)
  3. Tomov, S., A. Haidar, A. Ayala, H. Shaiek, and J. Dongarra, FFT-ECP Implementation Optimizations and Features Phase,” Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.  (4.14 MB)
  4. Bujanovic, Z., and Z. Drmac, New Robust ScaLAPACK Routine for Computing the QR Factorization with Column Pivoting,” LAPACK Working Note, no. LAWN 296, ICL-UT-19-14: University of Tennessee, October 2019.  (454.83 KB)
  5. Han, L., V. Le Fèvre, L-C. Canon, Y. Robert, and F. Vivien, A Generic Approach to Scheduling and Checkpointing Workflows,” International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1255-1274, November 2019. DOI: 10.1177/1094342019866891  (555.01 KB)
  6. Losada, N., A. Bouteiller, and G. Bosilca, Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,” Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.  (440.7 KB)
  7. Aupy, G., A. Benoit, B. Goglin, L. Pottier, and Y. Robert, Co-Scheduling HPC Workloads on Cache-Partitioned CMP Platforms,” International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1221-1239, November 2019. DOI: 10.1177/1094342019846956  (930.28 KB)
  8. Pei, Y., G. Bosilca, I. Yamazaki, A. Ida, and J. Dongarra, Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization,” PAW-ATM Workshop at SC19, Denver, CO, ACM, November 2019.  (4.51 MB)
  9. Herault, T., Y. Robert, G. Bosilca, and J. Dongarra, Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC,” ScalA'19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.  (260.69 KB)
  10. Ayala, A., S. Tomov, X. Luo, H. Shaiek, A. Haidar, G. Bosilca, and J. Dongarra, Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,” Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.  (1.6 MB)
  11. Jagode, H., A. Danalis, H. Anzt, and J. Dongarra, PAPI Software-Defined Events for in-Depth Performance Analysis,” The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1113-1127, November 2019.  (442.39 KB)
  12. Cao, Q., Y. Pei, T. Herault, K. Akbudak, A. Mikhalev, G. Bosilca, H. Ltaief, D. Keyes, and J. Dongarra, Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools,” Workshop on Programming and Performance Visualization Tools (ProTools 19) at SC19, Denver, CO, ACM, November 2019.  (429.55 KB)
  13. Benoit, A., T. Herault, V. Le Fèvre, and Y. Robert, Replication is More Efficient Than You Think,” The IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis (SC19), Denver, CO, ACM Press, November 2019.  (975.69 KB)
  14. Gates, M., J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library,” International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Denver, CO, ACM, November 2019. DOI: 10.1145/3295500.3356223  (2.01 MB)
  15. Gates, M., J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library , Denver, CO, International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), November 2019.  (16.19 MB)
  16. Anzt, H., G. Flegar, T. Gruetzmacher, and E. S. Quintana-Orti, Toward a Modular Precision Ecosystem for High-Performance Computing,” The International Journal of High Performance Computing Applications, vol. 33, issue 6, pp. 1069-1078, November 2019. DOI: 10.1177/1094342019846547  (1.93 MB)
  17. Anzt, H., T. Cojean, and E. Kuhn, Towards a New Peer Review Concept for Scientific Computing ensuring Technical Quality, Software Sustainability, and Result Reproducibility,” Proceedings in Applied Mathematics and Mechanics, vol. 19, issue 1, November 2019. DOI: 10.1002/pamm.201900490
  18. Abdelfattah, A., S. Tomov, and J. Dongarra, Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs,” ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.  (523.87 KB) (3.42 MB)
  19. Li, J., B. Nicolae, J. M. Wozniak, and G. Bosilca, Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training,” 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), Denver, CO, IEEE, November 2019. DOI: 10.1109/MLHPC49564.2019.00006  (696.89 KB)

Recent Conferences

  1. OCT
    -
    Heike Jagode
    Heike
    Heike Jagode
  2. OCT
    -
    BDEC San Diego San Diego, California
    Jack Dongarra
    Jack
    Joan Snoderly
    Joan
    Terry Moore
    Terry
    Jack Dongarra, Joan Snoderly, Terry Moore
  3. NOV
    -
    SC19 Denver, Colorado
    Ahmad Abdelfattah
    Ahmad
    Alan Ayala
    Alan
    Aurelien Bouteiller
    Aurelien
    Daniel Barry
    Daniel
    David Eberius
    David
    George Bosilca
    George
    Gerald Ragghianti
    Gerald
    Hartwig Anzt
    Hartwig
    Jack Dongarra
    Jack
    Jiali Li
    Jiali
    Joan Snoderly
    Joan
    Mark Gates
    Mark
    Piotr Luszczek
    Piotr
    Qinglei Cao
    Qinglei
    Terry Moore
    Terry
    Thomas Herault
    Thomas
    Yaohung Tsai
    Mike
    Yu Pei
    Yu
    Ahmad Abdelfattah, Alan Ayala, Aurelien Bouteiller, Daniel Barry, David Eberius, George Bosilca, Gerald Ragghianti, Hartwig Anzt, Jack Dongarra, Jiali Li, Joan Snoderly, Mark Gates, Piotr Luszczek, Qinglei Cao, Terry Moore, Thomas Herault, Yaohung Tsai, Yu Pei

Upcoming Conferences

  1. DEC
    -
    CnC 2019 Salt Lake City, Utah
    George Bosilca
    George
    George Bosilca
  2. DEC
    -
    NSF EPEXA kickoff meeting new york, New York
    George Bosilca
    George
    Thomas Herault
    Thomas
    George Bosilca, Thomas Herault

Recent Lunch Talks

  1. OCT
    4
    Yves Robert
    Yves Robert
    ENS-Lyon
    Scheduling Independent Stochastic Tasks on Heterogeneous Cloud Platforms PDF
  2. OCT
    11
    Axel Huebl
    Axel Huebl
    Lawrence Berkeley National Laboratory
    Scalable, Performance-Portable Particle-in-Cell Simulations and PByte-Scale Data-Challenges
  3. OCT
    18
    Alan Ayala
    Alan Ayala
    heFFTe: Highly Efficient FFT for Exascale PDF
  4. OCT
    25
    Yaohung Tsai
    Yaohung Tsai
    Autotuning in Deep Learning Kernels PDF
  5. NOV
    1
    David Eberius
    David Eberius
    A Flexible MPI Benchmark For Fast Assessment of Multithreaded Communication Performance PDF
  6. NOV
    8
    Sticks Mabakane
    Sticks Mabakane
    Effective Callgraph Visualisations for Optimisation of Parallel-Programs PDF
  7. NOV
    15
    Anthony Danalis
    Anthony Danalis
    Questions about hardware events? CAT has the answers.

Upcoming Lunch Talks

  1. DEC
    6
    Damien Genet
    Damien Genet
    PAPI Updates on Power9 PDF
  2. DEC
    13
    Piotr Luszczek
    Piotr Luszczek
    Building Community through xSDK Software Policies PDF

People

  1. Jakub Kurzak
    After 13 years at ICL, Jakub Kurzak has taken a job as an STSM Software Engineer with AMD. Jakub will be at the frontier, so to speak, of AMD's software efforts as they build out their exascale system at Oak Ridge National Laboratory. Good luck, Jakub!
  2. Aaron Welch
    Aaron Welch joined ICL this fall as a Student Assistant working in the DisCo group. Welcome, Aaron!
  3. Rephael Congmon
    Rephael Congmon joined ICL this fall as a Student Assistant working in the DisCo group. Welcome, Rephael!