DPLASMA

Overview

Converting Sequential Code to a DAG representation

Most applications can be described as a sequential SMPSS-like code. This sequential representation can be automatically translated in the JDF representation (described in the sections below) using our tool, the PaRSEC Compiler, which is based on the integer programming framework Omega-Test. The JDF representation of a DAG is then precompiled as C-code by our framework and linked in the final binary program, with the PaRSEC library. The figure below schemes the tool chain of PaRSEC.

The input Serial Code is typically a sequential code featuring loops of identified tasks, providing latent parallelism to be exploited by the system. The code below presents an example with the Tile QR Factorization:

Sequential code of the Tile-QR Factorization. Typical input for the PaRSEC Compiler.

The PaRSEC Compiler computes the symbolic analysis of the data flows, and transforms this in our internal representation of the DAG. An example of the analysis done on the QR factorization algorithm above is represented below, for the DGEQRT task:

Symbolic DAG Representation: the JDF Format

The JDF is the compact representation of DAGs used in PaRSEC; a language used to describe the DAGs of tasks in a synthetic and concise way. A realistic example of JDF, for the QR factorization of the Figure above is given below:

JDF representation of the QR Factorization

To explain the behavior of the PaRSEC runtime environment on this internal representation, let us look closely at the task DGEQRT. It takes a single input variable: V. After modifying V, DGEQRT sends it to memory and to other tasks, and produces an output variable, T, to be sent to memory and to another task. V is the name of the variable that represents A(k, k) in the QR Sequential code, and T is the name of the variable that represents T(k,k). Each of the blue dependencies of this figure translates as an input dependency for V, while each of the green dependencies translates as an output dependency for V or T. V can come either from memory (local to the node on which the task executes or located in a remote node), or from the output C2 of the task DSSMQR(k-1, k, k). One might notice that for output dependencies, the language supports ranges of tasks as targets for a data.

Each dependency line may end with a type qualifier between braces (e.g. lines 5 to 9). As seen in the sequential code, for the QR operation, the dependencies of lines 5 and 6 represent different dependencies applying to two different parts of the same tile (A(k, k)). This type qualifier provides a way for the application developer to specify which part of the tile is involved in each dependency. Dependencies on T (lines 8 and 9) also require a type qualifier, because the communication engine needs to know that T is not a regular tile of doubles.

The JDF representation requires that all dependencies are expressed symmetrically as an output dependency for a task, and an input for another. Most users don't need to consider this themselves, since the PaRSEC Compiler translates a sequential code into a correct JDF representation. At run time, the compiled opaque object from the JDF keeps a symbolic representation of the task system on each node. The runtime represents single tasks only by their name and parameter values, without unrolling (or traversing) the whole DAG in memory at any time, or requiring any communication. Operations like evaluating if this task is to be executed locally, or on a remote node, or computing the descending and ascending tasks of any given task, are done locally, by computing the context of the task from this symbolic representation.

DAG Scheduling in PaRSEC

From a high-level standpoint the scheduler in PaRSEC is divided in two parts. One is generic, handling the placement of ready tasks into the actual scheduling infrastructure. The second is generated by the pre-compiler and is algorithm dependent. This part uses the knowledge discovered by the pre-compiler during the JDF analysis, and encapsulates properties of the algorithm itself (such as priorities, tasks ordering and the critical path). As described previously, in PaRSEC, the scheduler is partly in the library and partly in the opaque object compiled from the JDF representation. Internally, PaRSEC creates a thread per core on the local machine and binds them on each core. Each thread runs its own version of the scheduler, alternating between times when it runs the body of the task, and times when it executes the local scheduler to find a new task to run. As a consequence, the scheduling is not only distributed between the nodes, it is also distributed between the cores.

To improve locality and data reuse on NUMA architectures, the schedule function (part of the PaRSEC library) favors the queuing of the new enabled tasks in the local queue of the calling thread. A task that is en-queued in this queue has a greater chance to be executed on the same core, maximizing cache and memory locality. If the thread-local queue is full, the tasks are put on a single node-local waiting-queue. When a task is completed, an epilogue, compiled from the JDF and found in the opaque object, is executed to compute which tasks are now enabled, and the thread looks up the next task to run. The most recently added task in the local queue is chosen; if no such task is found, the thread first tries to pop from the queues of the other threads and then from the node queue, following an order that maps the physical hierarchy of the cores on the node. The PaRSEC environment uses the HWLOC library to discover the architecture of the machine at run time and to define the work stealing strategy. The JDF representation, including both the generated code and the dynamic data structures, are specifically tailored to handle DAGs that enable simultaneously a large number of dependencies.

DAG of QR for a 4x4 tiles matrix on a 2x2 grid of processors

The Figure above represents the complete unrolling of the Tile QR JDF for SIZE = 4, on a 2 × 2 process grid. Plain arrows represent communications, and dotted arrows data that is passed from one task to another on the same computing node. The bold arrows represent the life cycle of the data pointed by V in the DGEQRT(0) task, that is sent to DTSQRT(0, 1), then continues towards DTSQRT(0, 2), and finishes in DTSQRT(0, 3).

As illustrated by the figure, in addition to tracking dependencies, PaRSEC needs to track data flows between tasks. When a task is completed, the epilogue stores pointers to each of the data produced by the task in a structure shared between the threads. When a new task is to be scheduled, the pointers corresponding to all local variables are retrieved from this shared structure, and assigned to the local variables, before the task is executed. When all tasks depending on a particular data are completed, the engine stops tracking this data, and releases all internal resources associated with it.

At the end of the epilogue, the thread has noted, using the binding between tasks and data in the JDF, which tasks, if any, will execute remotely, and which data from the completed task they require. The epilogue ends with a call to the Asynchronous Communication Engine to trigger the emission of the output data to the requesting nodes. The Asynchronous Communication Engine is implemented as a separate thread (one per node) that serves the data exchange needs of the tasks. A producer/consumer approach is taken: the orders are pushed into a queue, and the communication engine will serve them as soon as possible.

Task completions which enable remote tasks (by satisfying their dependencies) are signified from one communication engine to the others using small control messages containing the reference of the completed task. When a communication engine learns that a remote task has completed, it computes which local tasks this enables, and sends, when possible, a control message to receive the corresponding data, following a pull protocol.

In PaRSEC, communications are implicitly inferred from the data dependencies between tasks, according to the binding between tasks and data, as defined in the JDF. Asynchrony and dynamic scheduling are key concepts of PaRSEC, meaning that the communication engine has also to exhibit the same properties in order to effectively achieve communication/computation overlap and asynchronous progress of tasks in a distributed environment.

As a consequence, in PaRSEC, all communications are handled by a separate communication thread which unlike the computation threads is not bound to any specific core. This decision has been made in order to decrease the potential load imbalance between the work imposed on the cores. By not imposing a binding on the communication thread, we give the opportunity to the operating system to move the communication thread to an idle core, or to migrate it regularly between the cores. For portability and efficiency reasons, the communication infrastructure is implemented using asynchronous point-to-point MPI communications.

Project Handouts