Today, a desktop computer with a multicore processor and a GPU accelerator or a many-core accelerator can already provide a Tera-FLOP of performance. This tremendous computational power can only be fully utilized with the appropriate software infrastructure. Most often a major part of the computational effort in scientific and engineering computing goes towards solving linear algebra sub-problems. This tutorial shows design and optimization techniques of the state-of-the-art numerical libraries for solving problems in dense linear algebra.
The main objective of this tutorial is to show specific methods and their implementations that deal with portability and scalability of high performance codes. The use case of numerical linear algebra serves as a convenient example of how these techniques achieve their main objective -- maximizing the efficiency with respect to the metric of choice: peak floating-point performance of the machine.
The tutorial consists of three parts. The first part focuses on the challenges of multicore programming. We show some of the ways of dealing with prevalent need of parallelism, pitfalls of concurrency, aspects of affinity and locality, varying task granularity, load imbalance, and separation of concerns. We compare our scheduling approach based on DAGs (Direct Acyclic Graphs) against the commonly known standards, libraries, and languages such as OpenMP and its tasks, Cilk's extension to C, Intel's Thread Building Block's for C++, and Apple's Grand Central Dispatch. The concepts are illustrated by the actual techniques applied within the PLASMA (Parallel Linear Algebra Software for Multicore Architectures) and QUARK (QUeing And Runtime for Kernels) projects. The second part discusses GPU and/or coprocessor acceleration issues including the software heterogeneity, system bus bottleneck, and overlapping techniques available in the various ports of the MAGMA (Matrix Algebra on GPU and Multicore Architectures) project, and also. Finally, the third part will treat the ongoing efforts in linear algebra software for distributed memory machines with heterogeneous nodes: the PARSEC (Parallel Runtime Scheduling and Execution Controller) and DPLASMA projects. The key concepts covered in this part are communication-computation overlap, modern techniques for flow control, data distribution, dependence discovery and tracking through both compiler-oriented methods and runtime discovery.
The target audience consists mainly of users of parallel machines interested in advanced optimization techniques on distributed memory heterogeneous architectures as well as users of dense linear algebra libraries.
The prerequisite knowledge includes basic understanding of modern hardware and familiarity with parallel software for multi-core and accelerator units.