ICL Newsletter

News and Announcements

TOP500 – Tianhe-2 is Number 1

tianhe-2

Presented at the 2013 International Supercomputing Conference (ISC13) on June 17 in Leipzig, Germany, the 41st edition of the TOP500 list confirmed that China’s Tianhe-2 had muscled its way to the number 1 spot on the TOP500 with an astounding 33.86 petaflop/s on the LINPACK benchmark.

Tianhe-2, or Milky Way-2, will be deployed at the National Supercomputer Center in Guangzho, China, by the end of the year. The surprise appearance of Tianhe-2, two years ahead of the expected deployment, marks China’s first return to the number 1 position since November 2010, when Tianhe-1A was the top system. Tianhe-2 has 16,000 nodes, each with two Intel Xeon IvyBridge processors and three Xeon Phi coprocessors for a combined total of 3,120,000 computing cores.

Check out the podcast below from InsideHPC to hear Jack Dongarra discuss the new TOP500 list and give his thoughts on China’s latest efforts in HPC: [audio:http://icl.utk.edu/newsletter/files/2013-07/audio/top500_podcast.mp3]

Rank	Site	System	Rmax (TFlop/s)
1	National University of Defense Technology China	Tianhe-2 (MilkyWay-2) – TH-IVB-FEP Cluster NUDT	33,862.7
2	DOE/SC/Oak Ridge National Laboratory United States	Titan – Cray XK7 Cray Inc.	17,590.0
3	DOE/NNSA/LLNL United States	Sequoia – BlueGene/Q IBM	17,173.2
4	RIKEN Advanced Institute for Computational Science (AICS) Japan	K computer, SPARC64 VIIIfx Fujitsu	10,510.0
5	DOE/SC/Argonne National Laboratory United States	Mira – BlueGene/Q IBM	8,586.6
See the full list at TOP500.org.

PaRSEC and BEAST Funded

ICL recently received major funding for two projects: DOE awarded funding for PaRSEC and the NSF funded BEAST. Congratulations to all those involved!

PaRSEC

The Parallel Runtime Scheduling and Execution Controller (PaRSEC) is a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be expressed as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact problem-size independent format that can be queried on-demand to discover data dependencies in a totally distributed fashion.

PaRSEC assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse. The framework includes libraries, a runtime system, and development tools to help application developers tackle the difficult task of porting their applications to highly heterogeneous and diverse environments.

BEAST

The Bench-testing Environment for Automated Software Tuning (BEAST) creates a framework for exploring and optimizing the performance of computational kernels on hybrid processors that 1) applies to a diverse range of computational kernels, 2) (semi)automatically generates better performing implementations on various hybrid processor architectures, and 3) increases developer insight into why given kernel/processor combinations have the performance profiles they do.

To achieve this three-fold goal, it applies the model used for traditional application benchmarking in a completely novel way: it combines an abstract kernel specification and corresponding verification test, similar to standard benchmarking, with an automated testing engine and data analysis and machine learning tools, called the BEAST workbench.

Conference Reports

International Supercomputing Conference

DSCN3977

The 2013 International Supercomputing Conference (ISC13) held its 28th meeting on June 22 – 26 in Leipzig, Germany. This year’s conference drew 2,423 attendees from 47 nations as well as 153 leading high performance computing vendors and research organizations from around the world.

ICL was well represented with Jack Dongarra, Jakub Kurzak, and Heike McCraw, along with ICL alumus Hatem Ltaief. Jack was chair of the Application of Supercomputing session, gave a talk about the new rankings at the TOP500 session, and gave two presentations: Toward a New (Another) Metric for Ranking High Performance Computing Systems and Critical Issues at Exascale for Algorithm & Software Design.

Jakub, Hatem, and Jack presented a tutorial on dense linear algebra software, including PLASMA/DPLASMA, QUARK, PaRSEC, and MAGMA, and Heike presented her paper, Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q.

International Conference on Supercomputing

ICS-group This year’s International Conference on Supercomputing was held on June 10 – 14 in Eugene, Oregon, and marks the 27th meeting in the series. Quite a few ICLers attended the conference, including Piotr Luszczek, Aurelien Bouteiller, Thomas Herault, Mark Gates, and Gabriel Marin, as well as ICL alumnus and frequent collaborator Yves Robert.

Piotr and Aurelien hosted a tutorial called DLA on Multicore with Accelerators where the main objective was to show specific methods and implementations that deal with portability and scalability of high performance codes. Thomas and Yves gave a tutorial called An overview of fault-tolerant techniques for HPC where, as the name suggests, the objective was to present a comprehensive survey of techniques to deal with failures in high performance systems.

Mark Gates presented a paper, Toward a Scalable Multi-GPU Eigensolver via Compute-intensive Kernels and Efficient Communication, and Gabriel Marin presented his paper, Diagnosis and Optimization of Application Prefetching Performance.

Recent Releases

MAGMA 1.4 Beta 2 Released

MAGMA 1.4 Beta is now available. This release provides performance improvements and support for the new NVIDIA Kepler GPUs. More information is given in the MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures SC12 presentation. The MAGMA 1.4 Beta release adds the following new functionalities:

Merge libmagmablas into libmagma to eliminate circular dependencies. Link with just -lmagma now;
Add multi-GPU Hessenberg and non-symmetric eigenvalue routines: geev_m, gehrd_m, unghr_m, ungqr_m;
Fix required workspace size in gels_gpu, gels3_gpu, geqrs_gpu, geqrs3_gpu;
Fix required workspace size in [zcsd]geqrf;
Add macro USE_INT64 to compile with int being 64-bit. See make.inc.int64;
Add panel factorizations for LU, QR, and Cholesky entirely on the GPU, correspondingly in [zcsd]getf2_gpu, [zcsd]geqr2_gpu, and [zcsd]potf2_gpu;
Add QR with pivoting in GPU interface (functions [zcsd]geqp3_gpu), and improve the performance for both CPU and GPU interface QRs with pivoting;
Add multi-GPU symmetric eigenvalue routines (one-stage):
- [zhe|che|ssy|dsy]trd_mgpu, [zhe|che|ssy|dsy]evd_m, [zhe|che|ssy|dsy]evdx_m, [zhe|che|ssy|dsy]gvd_m, [zhe|che|ssy|dsy]gvdx_m;
Add single and multi-GPU symmetric eigenvalue routines (two-stage):
- [zhe|che|ssy|dsy]evdx_2stage, [zhe|che|ssy|dsy]gvdx_2stage, [zhe|che|ssy|dsy]evdx_2stage_m, [zhe|che|ssy|dsy]gvdx_2stage_m.

Visit the MAGMA software page to download the tarball.

Interview

Where are you from, originally?

I am from Bangladesh, a beautiful small country in South Asia.

Can you summarize your educational background?

I earned my bachelor’s degree in Computer Science and Engineering (CSE) from Bangladesh University of Engineering and Technology (BUET), and earned my master’s in Computer Science from the University of Tennessee, Knoxville (UTK). I am currently pursuing a PhD at UTK with Dr. Dongarra as my advisor.

Tell us how you first learned about ICL.

When I applied for admission at UTK, I learned about ICL from the university website. Later, I heard about ICL from one of my friends from Bangladesh who worked for ICL for couple of years. I got all the details about ICL from him, i.e., the research projects at ICL and its working environment, help from fellow researchers and students available in the lab, his and other ICL people’s work, the publications, ICL staff, etc.

What made you want to work for ICL?

Dr. Jack Dongarra and ICL are two famous names in the HPC arena. ICL not only provides leading edge tools for high performance computing problems, but also plays a major role in the development of standards for scientific computing. As I am interested in HPC, working with Jack Dongarra and ICL seemed like the obvious choice for someone given the opportunity.

What are you working on while at ICL?

Currently I am working on the MAGMA project which aims to develop a dense linear algebra library for heterogeneous/hybrid architecture. We focus on developing a dense linear algebra library for “Multicore + Xeon Phi Coprocessor”. We put our efforts in designing a task based algorithm for this heterogeneous system that can extract the best performance out of it.

We have already released a linear algebra library for one sided factorization for both single and multiple Xeon Phi coprocessors. Now we are moving towards two sided factorization. Since MAGMA already has a multicore + GPU implementation, it helped us to develop the library for multicore + Xeon Phi.

If you weren’t working at ICL, where would you like to be working and why?

If I wasn’t working at ICL, I would like to be working at Intel with the MKL team as they are building a world class library for linear algebra that is used by many people in the HPC community as a point of reference. Recently, they have inculed a heterogeneous (multicore + Xeon Phi) implementation of dense linear algebra named, “ao – automatic offload” in their package, which is similar to our work.

What are your interests/hobbies outside of work?

Outside of my work I like to watch movies, hang out with my friends & family, and travel to beautiful places.

Tell us something about yourself that might surprise people.

I am an ordinary guy, maybe that will surprise some people. Haha!

Recent Papers

Wang, Y., M. Baboulin, J. Falcou, Y. Fraigneau, and O. Le MaÃ®tre, “A Parallel Solver for Incompressible Fluid Flows,” International Conference on Computational Science (ICCS 2013), Barcelona, Spain, Elsevier B.V., June 2013. DOI: DOI: 10.1016/j.procs.2013.05.207 (588.79 KB)
McCraw, H., D. Terpstra, J. Dongarra, K. Davis, and R. Musselman, “Beyond the CPU: Hardware Performance Counter Monitoring on Blue Gene/Q,” International Supercomputing Conference 2013 (ISC'13), Leipzig, Germany, Springer, June 2013. (624.58 KB)
Marin, G., C. McCurdy, and J. Vetter, “Diagnosis and Optimization of Application Prefetching Performance,” Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013. DOI: 10.1145/2464996.2465014 (827.31 KB)
Li, Y., A. YarKhan, J. Dongarra, K. Seymour, and A. Hurault, “Enabling Workflows in GridSolve: Request Sequencing and Service Trading,” Journal of Supercomputing, vol. 64, issue 3, pp. 1133-1152, June 2013. DOI: 10.1007/s11227-010-0549-1 (821.29 KB)
Haidar, A., S. Tomov, J. Dongarra, R. SolcÃ , and T. C. Schulthess, “Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations,” International Supercomputing Conference (ISC), Lecture Notes in Computer Science, vol. 7905, Leipzig, Germany, Springer Berlin Heidelberg, pp. 67-80, June 2013. DOI: 10.1007/978-3-642-38750-0_6 (2.14 MB)
Aupy, G., A. Benoit, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni, “On the Combination of Silent Error Detection and Checkpointing,” UT-CS-13-710: University of Tennessee Computer Science Technical Report, June 2013. (1.29 MB)
Heroux, M. A., and J. Dongarra, “Toward a New Metric for Ranking High Performance Computing Systems,” SAND2013 - 4744, June 2013. (225.32 KB)
Haidar, A., M. Gates, S. Tomov, and J. Dongarra, “Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication,” Proceedings of the 27th ACM International Conference on Supercomputing (ICS '13), Eugene, Oregon, USA, ACM Press, June 2013. DOI: 10.1145/2464996.2465438 (1.27 MB)
Jia, Y., P. Luszczek, and J. Dongarra, “Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures,” UT-CS-13-712: University of Tennessee Computer Science Technical Report, June 2013. (206.42 KB)
Donfack, S., S. Tomov, and J. Dongarra, “Dynamically balanced synchronization-avoiding LU factorization with multicore and GPUs,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-713, July 2013. (659.77 KB)
Bland, W., P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, “Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,” Concurrency and Computation: Practice and Experience, July 2013. DOI: 10.1002/cpe.3100 (3.89 MB)
Dong, T., V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra, “Hydrodynamic Computation with Hybrid Programming on CPU-GPU Clusters,” University of Tennessee Computer Science Technical Report, no. ut-cs-13-714, July 2013. (866.68 KB)
Ma, T., G. Bosilca, A. Bouteiller, and J. Dongarra, “Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms,” Journal of Parallel and Distributed Computing, vol. 73, issue 7, pp. 1000-1010, July 2013. DOI: 10.1016/j.jpdc.2013.01.015 (1.4 MB)
Donfack, S., J. Dongarra, M. Faverge, M. Gates, J. Kurzak, P. Luszczek, and I. Yamazaki, “On Algorithmic Variants of Parallel Gaussian Elimination: Comparison of Implementations in Terms of Performance and Numerical Properties,” University of Tennessee Computer Science Technical Report, no. UT-CS-13-715, July 2013, 2012. (358.98 KB)

Recent Lunch Talks

JUN
7
Vincent C. Betro
NICS
Performance of the fusion code GYRO on four generations of Cray Computers PDF
JUN
14
Dong Li
ORNL
Toward Reliable and Power Efficient Exascale Systemsf PDF
JUN
21
Hartwig Anzt
Energy Efficiency on Emerging Hardware PDF
JUN
28
Haihang You
NICS
Optimizing utilization across XSEDE resources
JUL
19
Anthony Danalis
Creating a new operation with DPLASMA: a step by step guide PDF

Upcoming Lunch Talks

AUG
23
Michela Taufer
University of Delaware
On the effectiveness of application-aware self-management for scientific discovery in volunteer computing systems PDF
AUG
30
Jeff Larkin
NVIDIA
OpenACC 2.0 Highlights PDF

Visitors

Shirley Moore from the University of Texas at El Paso will be visiting from July 15 through August 16. Shirley will be working with the performance analysis group druing her visit.

Visitors

Shirley Moore from the University of Texas at El Paso will be visiting from July 15 through August 16. Shirley will be working with the performance analysis group druing her visit.

Dates to Remember

ICL Closed on July 4th & 5th

Just a reminder, ICL will be closed on July 4th and 5th to observe Independence Day.

ICL Retreat 2013

Mark your calendars for August 15 – 16 for the 2013 ICL Retreat! This year, the retreat will be held at the RT Lodge in Maryville.

July 2013

News and Announcements

TOP500 – Tianhe-2 is Number 1

PaRSEC and BEAST Funded

PaRSEC

BEAST

Conference Reports

International Supercomputing Conference

International Conference on Supercomputing

Recent Releases

MAGMA 1.4 Beta 2 Released

Interview

Khairul Kabir

Recent Papers

Recent Lunch Talks

Upcoming Lunch Talks

Visitors

Visitors

Dates to Remember

ICL Closed on July 4th & 5th

ICL Retreat 2013

Archives

PDF Editions