# A performance model to execute workflows on high-bandwidth memory architectures









Anne Benoit<sup>1,3</sup> Swann Perarnau<sup>4</sup>

Loic Pottier<sup>1</sup> Yves Robert<sup>1,2</sup>

<sup>1</sup>ENS Lyon & Inria, France <sup>2</sup>University of Tennessee Knoxville, USA <sup>3</sup>Georgia Institute of Technology, USA <sup>4</sup>Argonne National Laboratory, USA

Yves.Robert@inria.fr

ICL - April 27, 2018

## Motivation

Focus on KNL architecture.

- Cache mode: MCDRAM used by hardware as last-level cache
- Flat mode: MCDRAM manually managed by programmers
- Hybrid mode: mixes both previous modes
- Cache mode easy to use (nothing to do)
- Flat mode?

Yves.Robert@inria.fr ICL - April 27, 2018 2 / 31

## Motivation

#### Focus on KNL architecture.

- Cache mode: MCDRAM used by hardware as last-level cache
- Flat mode: MCDRAM manually managed by programmers
- Hybrid mode: mixes both previous modes
- Cache mode easy to use (nothing to do)
- Flat mode?

no performance model available! (yet)

Yves.Robert@inria.fr ICL - April 27, 2018 2 / 31

## **Problem**

#### Given:

- a scientific workflow, represented as a DAG  $\rightarrow G = (V, E)$
- a deep-memory architecture with P cores (a.k.a. processors)

#### Goal

Find a scheduling and memory mapping that minimizes the execution time of the workflow (a.k.a. makespan)

Yves.Robert@inria.fr ICL - April 27, 2018 3 / 31

## Outline

- Model and complexity
- Heuristics
- Simulation results
- Experimental results on a 1D stencil
- Conclusion

Yves.Robert@inria.fr ICL - April 27, 2018

## Model

#### Architecture

- ullet Large slow-bandwidth memory  $M_s$  with bandwidth  $eta_s$
- Small high-bandwidth memory  $M_f$  with bandwidth  $\beta_f$
- P identical processors processing at speed s



Yves.Robert@inria.fr ICL - April 27, 2018

#### **Application**

- Acyclic graph G = (V, E)
- Node set  $V = \{v_1, \dots, v_n\}$ : **sequential** computational tasks
- Edges  $E \subseteq V^2$  are dependences between tasks
- Each node  $v_i$  has a number of computing operations:  $w_i$
- Each edge  $(v_i, v_j) \in E$  has a number of data blocks:  $e_{i,j}$



# How to compute the execution time?

Resources (bandwidth and memory) are concurrently used ©

 $\rightarrow$  partition execution into events  $(t_1, \ldots, t_k, \ldots, t_{2n})$ 

#### Chunk

A *chunk* is a period of time between two consecutive events. The chunk k is defined as  $t_{k+1} - t_k$ .

ightarrow during a *chunk*,both bandwidths and memory occupations remain constant  $\stackrel{\bigcirc}{\odot}$ 

Yves.Robert@inria.fr ICL - April 27, 2018 7 / 31

### **Events**

To express all scheduling constraints, we define two events:

- $\sigma_1(i) \rightarrow \text{time-step when } v_i \text{ starts}$
- $\sigma_2(i) \rightarrow \text{time-step when } v_i \text{ ends}$
- 2n events at most (recall, n = |V|)



Yves.Robert@inria.fr ICL - April 27, 2018

## **Events**

- $w_i^{(k)}$  number of operations executed by task  $v_i$  during chunk k
- $\beta_f^{(k)}$  fast bandwidth available for each task during chunk k (shared equally among executing tasks)
  - for slow bandwidth.  $\beta_s^{(k)}$
- $S_f^{(k)}$  number of blocs allocated into the fast memory  $M_f$ during chunk k

ICL - April 27, 2018

# Scheduling constraints

$$\forall (v_i, v_j) \in E, \ t_{\sigma_2(i)} \leq t_{\sigma_1(j)}$$

$$\forall 1 \leq k \leq 2n - 1, \ \mathsf{S}_f^{(k)} \leq \mathsf{S}_f$$

$$\left| \left\{ v_i \mid t_{\sigma_1(i)} \leq t < t_{\sigma_2(i)} \right\} \right| \leq \mathsf{P}$$

$$\forall i, \quad \sum_{i=1}^{2n-1} w_i^{(k)} = w_i$$

Yves.Robert@inria.fr ICL - April 27, 2018 10 / 31

## **Problem**

At schedule time, for a task  $v_i$ , choose for each successor  $v_j$ :

- $\bullet$   $e_{i,j}^f$  number of data blocks in fast memory  $M_f$
- $e_{i,j}^s$  number of data blocks in slow memory  $M_s$   $(e_{i,j}^f + e_{i,j}^s = e_{i,j})$

(And decide which task is scheduled on which available processor)

Yves.Robert@inria.fr ICL - April 27, 2018 11 / 31

## **Problem**

At schedule time, for a task  $v_i$ , choose for each successor  $v_j$ :

- $e_{i,j}^f$  number of data blocks in fast memory  $M_f$
- $e_{i,j}^s$  number of data blocks in slow memory  $M_s$   $(e_{i,j}^f + e_{i,j}^s = e_{i,j})$

(And decide which task is scheduled on which available processor)

#### **MEMDAG**

Given an acyclic graph G and the platform, find a memory mapping  $\mathcal{X}=\{e_{i,j}^f\}_{(v_i,v_j)\in E}$  and a schedule  $\{t_{\sigma_1(i)},t_{\sigma_2(i)}\}_{v_i\in V}$  satisfying all constraints and minimizing

$$\max_{v_i \in V} t_{\sigma_2(i)}$$

Yves.Robert@inria.fr ICL - April 27, 2018

## Execution time

The total number of blocks read/written from fast memory for  $v_i$ :

• 
$$\mathit{in}_i^f = \sum_{v_j \in \mathsf{pred}(v_i)} e_{j,i}^f$$
 and  $\mathit{out}_i^f = \sum_{v_j \in \mathsf{succ}(v_i)} e_{i,j}^f$ 

Yves.Robert@inria.fr ICL - April 27, 2018 12 / 31

## Execution time

The total number of blocks read/written from fast memory for  $v_i$ :

• 
$$\mathit{in}_i^f = \sum_{v_j \in \mathsf{pred}(v_i)} e_{j,i}^f$$
 and  $\mathit{out}_i^f = \sum_{v_j \in \mathsf{succ}(v_i)} e_{i,j}^f$ 

Computations 
$$\frac{w_i^{(k)}}{s} \leq t_{k+1} - t_k$$
 Fast communications 
$$\frac{w_i^{(k)}}{w_i} \times \frac{in_i^f + out_i^f}{\beta_f^{(k)}} \leq t_{k+1} - t_k$$
 Slow communications 
$$\frac{w_i^{(k)}}{w_i} \times \frac{in_i^s + out_i^s}{\beta_s^{(k)}} \leq t_{k+1} - t_k$$

Yves.Robert@inria.fr ICL - April 27, 2018

# Execution time (1/2)

Work done by  $v_i$  during chunk k:

$$w_i^{(k)} = (t_{k+1} - t_k) \min \left( s, \frac{\beta_f^{(k)} w_i}{i n_i^f + o u t_i^f}, \frac{\beta_s^{(k)} w_i}{i n_i^s + o u t_i^s} \right)$$

Pessimistic formulation:

$$\frac{w_i^{(k)}}{s} + \frac{w_i^{(k)}}{w_i} \cdot \frac{\inf_i^f + out_i^f}{\beta_f^{(k)}} + \frac{w_i^{(k)}}{w_i} \cdot \frac{\inf_i^s + out_i^s}{\beta_s^{(k)}} \le t_{k+1} - t_k$$

Yves.Robert@inria.fr ICL - April 27, 2018 13 / 31

# Execution time (2/2)

#### Next time-step

$$E_{i}^{(k)} = t_{k} + \frac{w_{i} - \sum_{k'=\sigma_{1}(i)}^{k-1} w_{i}^{(k')}}{\min\left(s, \frac{\beta_{f}^{(k)}w_{i}}{in_{i}^{f} + out_{i}^{f}}, \frac{\beta_{s}^{(k)}w_{i}}{in_{i}^{s} + out_{i}^{s}}\right)}.$$

$$t_{k+1} = \min_{v_{i} \in V} E_{i}^{(k)}$$

#### Completion

$$\sum_{k=1}^{2n-1} w_i^{(k)} = w_i, \quad \sum_{k=1}^{2n-1} i n_i^{f(k)} = i n_i^f, \quad \dots$$

### Makespan

$$\mathcal{T} = \max_{\mathbf{v} \in V} t_{\sigma_2(i)}$$

Yves.Robert@inria.fr

# Execution time (2/2)

### Next time-step

$$w_i - \sum_{i=1}^{k-1} w_i^{(k')}$$

### Candid complaint

And you said you had a "simplified model" 😊 😊

$$v_i \in V$$

#### Completion

Yves.Robert@inria.fr

# Complexity

- MemDag is NP-Complete in the strong sense (of course)
- Complexity unknown for linear chains

Yves.Robert@inria.fr ICL - April 27, 2018 15 / 31

## Linear chains

$$\stackrel{\mathbf{e_{0,1}}}{\rightarrow} v_1 \stackrel{e_{1,2}}{\rightarrow} v_2 \rightarrow \cdots \rightarrow v_i \stackrel{e_{i,i+1}}{\rightarrow} v_{i+1} \rightarrow \cdots \rightarrow v_n \stackrel{\mathbf{e_{n,n+1}}}{\rightarrow}$$

MINIMIZE  $\sum_{i=1}^{n} m_i$ 

$$\text{Subject to} \left\{ \begin{array}{ll} e_{i,i+1} = e_{i,i+1}^s + e_{i,i+1}^f & \text{for } 0 \leq i \leq n \\ \frac{w_i}{s} \leq m_i & \text{for } 1 \leq i \leq n \\ \\ \frac{e_{i-1,i}^s + e_{i,i+1}^s}{\beta_s} \leq m_i & \text{for } 1 \leq i \leq n \\ \frac{e_{i-1,i}^f + e_{i,i+1}^f}{\beta_f} \leq m_i & \text{for } 1 \leq i \leq n \\ e_{i-1,i}^f + e_{i,i+1}^f \leq \mathsf{S}_f & \text{for } 1 \leq i \leq n \end{array} \right.$$

Yves.Robert@inria.fr ICL - April 27, 2018

## Heuristics

NP-Complete problem  $\rightarrow$  polynomial-time heuristics. Need to:

- Decide which task executes first (processor allocation)
- Decide in which memory a given task will write its data blocks (memory mapping)

Yves.Robert@inria.fr ICL - April 27, 2018 17 / 31

## Metrics

Two metrics used to sort tasks:

### Critical Path of $v_i$

$$CP_i = \max\left(\frac{w_i}{s}, \frac{in_i + out_i}{\beta_s}\right) + \max_{j \in \operatorname{succ}(v_i)} CP_j.$$

#### Gain

The gain of a subgraph  $G_i$  rooted at  $v_i$  is defined as:

$$gain(G_i) = \frac{BI_f(G_i)}{BI_s(G_i)},$$

where  $Bl_s(G_i)$  is the makespan of  $G_i$  when using no fast memory and  $Bl_f(G_i)$  is the makespan when using an infinite fast memory.

Yves.Robert@inria.fr ICL - April 27, 2018

# Main algorithm

### Main algorithm based list scheduling:

- Ready tasks sorted according to scheduling policy
- For each scheduled task  $v_i$  at that step:
  - For each successor of  $v_i$ , a memory mapping decides the value of each  $e_{i,j}^f$
- And so on, until all tasks have been scheduled

Yves.Robert@inria.fr ICL - April 27, 2018 19 / 31

# Processor scheduling policies

Let  $L^{(k)}$  be the list of ready tasks to be executed at chunk k.

- CP sorts  $L^{(k)}$  according to their critical paths (non-increasing order of  $CP_i$  with  $v_i \in L^{(k)}$ )
- GG sorts  $L^{(k)}$  according to their potential gains (non-decreasing order of  $gain(G_i)$  with  $v_i \in L^{(k)}$ )

Yves.Robert@inria.fr ICL - April 27, 2018 20 / 31

# Memory mapping policies

A memory mapping must decide in which memory a task will write its data blocks:

- MEMCP greedily gives fast memory blocks to each successor of  $v_i$  according to their critical paths
- MEMGG greedily gives fast memory blocks to each successor of  $v_i$  according to their gains
- MEMFAIR greedily gives data blocks in fast memory to the tasks, according to their amount of computations, but accounting for other successors

Yves.Robert@inria.fr ICL - April 27, 2018 21 / 31

# Two memory mappings

#### Algorithm 1: Heuristic Memcp 1 procedure MEMCP $(v_i)$ begin Let U be the set of $v_i$ 's 2 successors ordered by $CP_i$ : $\mathcal{X} \leftarrow \emptyset$ : 3 foreach $i \in U$ do $e_{i,i}^f \leftarrow$ 5 $\min\left(\mathsf{S}_f-\mathsf{S}_f^{(k)},e_{i,j}\right);$ $\mathcal{X} \leftarrow \dot{\mathcal{X}} \cup \{e_{i,j}^f\}$ ; 6 $S_{\epsilon}^{(k)} \leftarrow S_{\epsilon}^{(k)} + e_{i,i}^{f}$ ; 7 return $\mathcal{X}$ ; 8

```
Algorithm 2: Heuristic MEMFAIR
```

```
1 procedure MemFair (v_i)
      begin
           Let U be the set of v_i's
2
             successors ordered by wi
           \mathcal{X} \leftarrow \emptyset:
3
           foreach j \in U do
                  e_{i,i}^f \leftarrow
                    \min\left(\left|\frac{S_f-S_f^{(k)}}{|\operatorname{succ}(v_i)|}\right|,e_{i,j}\right);
                 \mathcal{X} \leftarrow \mathcal{X} \cup \{e_{i,i}^f\};
6
                  S_f^{(k)} \leftarrow S_f^{(k)} + e_{i,i}^f;
           return \mathcal{X}:
```

## Baseline heuristics

- ullet CP +NoFast: no fast memory is available
- CP +INFFAST: infinite fast memory (finite bandwidth)
- CP +CcMode: imitation of the KNL cache mode behavior (fast memory divided into P slices, each task can use at most  $\frac{S_f}{P}$  data blocks)

Yves.Robert@inria.fr ICL - April 27, 2018 23 / 31

## Simulations results

Random graphs using Standard Tasks Graphs sets of 50 and 100 nodes (180 graphs per size)

- Two data sets for each size:
  - Sparse: the 20 graphs that exhibit the lower densities (only sparse results in this talk)
  - Dense: the 20 graphs that exhibit the higher densities

Parameters based on KNL architecture:

- $\beta_s = 90$  GB/s and  $\beta_f = 450$  GB/s (five times faster)
- Processor speed s = 1.4 GHz
- Fast memory of size  $S_f = 16$  GB (infinitely large slow memory)

Yves.Robert@inria.fr ICL - April 27, 2018 24 / 31

## Simulations results

Generating computation weights  $w_i$  and communication costs  $e_{i,j}$ :

- $w_i$  uniformly drawn in  $[w_i^{min} = 10^4, w_i^{max} = 10^6]$  flops
- Computations to Communication Ratio:

$$\mathsf{CCR} = \frac{w_i}{s} / \frac{e_{i,j}}{\beta_s}.$$

•  $e_{i,j}$  is uniformly and randomly pick in

$$\left[\frac{w_i^{\min} \times \beta_s}{s \times \mathsf{CCR}}, \frac{w_i^{\max} \times \beta_s}{s \times \mathsf{CCR}}\right]$$

Yves.Robert@inria.fr ICL - April 27, 2018

## Impact of the number of processors



Figure: Impact of the number of processors with 50 nodes and  $S_f = 1GB$  fast memory for  ${\rm CP}$  scheduling heuristic.

Heuristics very efficient (gain around 50%) CP + MEMFAIR even outperforms CP + CCMODE!

Yves.Robert@inria.fr ICL - April 27, 2018

## Impact of fast memory size





Figure: Impact of fast memory size with 50 nodes and 8 processors.

Yves.Robert@inria.fr ICL - April 27, 2018 27 / 31

## Experimental 1D stencil

#### Comparison of simulations and real experiments

- $\rightarrow$  assess model accuracy
  - 1D stencil kernel running on a KNL
  - Pipelined approach: for a given iteration in parallel:
    - prefetch future tiles in fast memory
    - compute on tiles already prefetched
    - flush computed tiles back into slow memory

#### Algorithm 3: 1D Gauss-Seidel algorithm

```
procedure 1D-GS(array) begin

for t = 1 \ to \dots do

for i = 1 \ to \dots do

Tile _i^t \leftarrow Gauss-Seidel (Tile _{i-1}^t, Tile _i^{t-1}, Tile _{i+1}^{t-1});
```

Yves.Robert@inria.fr ICL - April 27, 2018

## Task graph

But our model does allow (yet  $\odot$ ) data transfers:

- Update of each tile is decomposed into three sequential tasks:
  - R<sub>i</sub><sup>t</sup> transfers the tile from slow memory to fast memory
  - C<sub>i</sub><sup>t</sup> computes the tile in fast memory
  - W<sub>i</sub><sup>t</sup> writes the updated tile back into slow memory



Figure: t is the iteration index, i is the tile index, and m is the size of one tile.

Yves.Robert@inria.fr ICL - April 27, 2018 29 / 31

## Experimental results



Figure: Performance of a 1D stencil with 64 threads.

Good concordance between experiments and simulations! Best performance when access to slow memory is the bottleneck.

Yves.Robert@inria.fr ICL - April 27, 2018

## Conclusion

#### Contributions

- Detailed performance model for scheduling workflows on dual memory-systems
- Efficient polynomial heuristics
- Simulations and real experiments on a 1D stencil

#### Future work

- Extend simulations to other workflow graphs (fork-joins, etc)
- Allow prefetching into fast memory
- Additional experiments with more complicated workflows

Yves.Robert@inria.fr ICL - April 27, 2018 31 / 31