# Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors

Aurélien Cavelan<sup>1</sup>, Yves Robert<sup>1,2</sup>, Hongyang Sun<sup>1</sup> and Frédéric Vivien<sup>1</sup>

1. ENS Lyon & INRIA, France 2. University of Tennessee Knoxville, USA

Yves.Robert@inria.fr

ICL Lunch - March 27, 2015

Introduction

- Algorithmic Approach
- Simulations
- 4 Conclusion

- Introduction

## Dynamic Power Consumption

One can use Dynamic Voltage and Frequency Scaling (DVFS) to reduce power consumption.

Simulations

$$Power = \alpha f V^2$$

- $\bullet$   $\alpha$  effective capacitance
- f frequency
- V operating voltage

⇒Voltage has a quadratic impact on dynamic power.

## The Easy Way

Target: energy consumption, not execution time.

For any frequency value, there is a threshold voltage:

Find optimal frequency for application (given performance) constraints)

Simulations

Decrease voltage to threshold voltage

## The Easy Way

Target: energy consumption, not execution time.

For any frequency value, there is a threshold voltage:

- Find optimal frequency for application (given performance constraints)
- ② Decrease voltage to threshold voltage



Can we do better?

## Threshold Voltage



Figure: Set of voltages of a FPGA multiplier block and the associated error probabilities measured on random inputs at 90MHz and 27°C

000000

#### **Definition**

The results of some logic gates could be used before their output signals reach their final values.

- Occurs when  $V_{\rm DD} < V_{\rm TH}$
- Deterministic but unpredictable
- Induces Silent Data Corruptions (SDC)

## Timing Errors

#### Definition

The results of some logic gates could be used before their output signals reach their final values.

- ullet Occurs when  $V_{
  m DD} < V_{
  m TH}$
- Deterministic but unpredictable
- Induces Silent Data Corruptions (SDC)

Unlike lightning, timing errors always strike twice

## Timing Errors

#### Definition

The results of some logic gates could be used before their output signals reach their final values.

- ullet Occurs when  $V_{
  m DD} < V_{
  m TH}$
- Deterministic but unpredictable
- Induces Silent Data Corruptions (SDC)

Unlike lightning, timing errors always strike twice

Silent errors are detected only when corrupt data is activated

## Two Approaches

## Near-Threshold Computing ( $V_{\sf dd} pprox V_{\sf th}$ )

- Used in NTC circuits (hardware)
- Almost *safe* ©
- Great energy savings ©

## Voltage Overscaling ( $V_{\sf dd} < V_{\sf th}$ )

- Even more energy savings ©
- Purely software-based approach ©
- Generates timing errors ©
- Requires verification mechanism ©
- Requires knowledge of failure probabilities 🙁

## Question

Introduction 000000



Is it possible to obtain the (correct) result of a computation for a lower energy budget than that of the best DVFS / NTC solution?

- 2 Algorithmic Approach

Consider a task and a set of voltages V:

| Voltages                          | $V_1$                 | $V_2$                 | <br>$V_m = V_{\mathrm{TH}}$ |
|-----------------------------------|-----------------------|-----------------------|-----------------------------|
| $\mathbb{P}(V_{\ell}	ext{-fail})$ | $p_1$                 | <i>p</i> <sub>2</sub> | <br>$p_m=0$                 |
| Cost                              | <i>c</i> <sub>1</sub> | <i>c</i> <sub>2</sub> | <br>C <sub>m</sub>          |

## Model Assumptions

Consider a task and a set of voltages V:

| Voltages                                   | $V_1$                 | $V_2$                 | <br>$V_m = V_{\mathrm{TH}}$ |
|--------------------------------------------|-----------------------|-----------------------|-----------------------------|
| $\mathbb{P}(\mathit{V}_{\ell}	ext{-fail})$ | $p_1$                 | <i>p</i> <sub>2</sub> | <br>$p_{m} = 0$             |
| Cost                                       | <i>c</i> <sub>1</sub> | <i>c</i> <sub>2</sub> | <br>C <sub>m</sub>          |

Remember: timing errors always strike twice.

Consider a task and a set of voltages  $\mathcal{V}$ :

| Voltages                          | $V_1$                 | $V_2$                 | <br>$V_m = V_{\text{TH}}$ |
|-----------------------------------|-----------------------|-----------------------|---------------------------|
| $\mathbb{P}(V_{\ell}	ext{-fail})$ | $p_1$                 | <i>p</i> <sub>2</sub> | <br>$p_{m} = 0$           |
| Cost                              | <i>c</i> <sub>1</sub> | <i>c</i> <sub>2</sub> | <br>C <sub>m</sub>        |

Simulations

Remember: timing errors always strike twice.

- When an error strikes, a higher voltage *must* be used
- Switching from voltage  $V_{\ell}$  to  $V_{h}$  incurs a cost  $o_{\ell,h}$
- Execution at  $V_{\rm TH}$  always succeeds

## Model Assumptions

Consider a task and a set of voltages  $\mathcal{V}$ :

| Voltages                          | $V_1$                 | $V_2$                 | <br>$V_m = V_{\mathrm{TH}}$ |
|-----------------------------------|-----------------------|-----------------------|-----------------------------|
| $\mathbb{P}(V_{\ell}	ext{-fail})$ | $p_1$                 | <i>p</i> <sub>2</sub> | <br>$p_{m} = 0$             |
| Cost                              | <i>c</i> <sub>1</sub> | <i>c</i> <sub>2</sub> | <br>C <sub>m</sub>          |

Simulations

Remember: timing errors always strike twice.

- When an error strikes, a higher voltage *must* be used
- Switching from voltage  $V_{\ell}$  to  $V_h$  incurs a cost  $o_{\ell,h}$
- Execution at  $V_{\text{TH}}$  always succeeds

How to compute failure probabilities?

## Model Assumptions

Consider a task and a set of voltages  $\mathcal{V}$ :

| Voltages                                   | $V_1$                 | $V_2$                 | <br>$V_m = V_{\text{TH}}$ |
|--------------------------------------------|-----------------------|-----------------------|---------------------------|
| $\mathbb{P}(\mathit{V}_{\ell}	ext{-fail})$ | $p_1$                 | <i>p</i> <sub>2</sub> | <br>$p_{m} = 0$           |
| Cost                                       | <i>c</i> <sub>1</sub> | <i>c</i> <sub>2</sub> | <br>C <sub>m</sub>        |

Simulations

Remember: timing errors always strike twice.

- When an error strikes, a higher voltage *must* be used
- Switching from voltage  $V_{\ell}$  to  $V_h$  incurs a cost  $o_{\ell,h}$
- Execution at  $V_{\rm TH}$  always succeeds

How to compute failure probabilities?

And optimal sequence of voltages?

• Given an operation and an input I, there exists a *threshold* voltage  $V_{\text{TH}}(I)$ :

Simulations

- $V < V_{\rm TH}(I)$  will always lead to an incorrect result
- $V \geq V_{\text{th}}(I)$  will always lead to a successful execution

## Property of Timing Errors

- Given an operation and an input I, there exists a *threshold* voltage  $V_{\text{TH}}(I)$ :
  - $V < V_{\text{TH}}(I)$  will always lead to an incorrect result
  - $V \geq V_{\text{\tiny TH}}(I)$  will always lead to a successful execution
- ② Given an operation and a voltage  $V \in \mathcal{V}$ :
  - $\bullet$   $\, {\cal I}$  denotes the set of all possible inputs
  - $\mathcal{I}_f(V) \subseteq \mathcal{I}$  is the set of inputs that will fail at voltage V
  - Failure probability is computed as  $p_V = |\mathcal{I}_f(V)|/|\mathcal{I}|$
  - ullet For any two voltages  $V_1 \geq V_2$ , we have  $\mathcal{I}_f(V_1) \subseteq \mathcal{I}_f(V_2)$

## Property of Timing Errors

- Given an operation and an input I, there exists a threshold voltage V<sub>TH</sub>(I):
  - $V < V_{\text{TH}}(I)$  will always lead to an incorrect result
  - $V \geq V_{\text{\tiny TH}}(I)$  will always lead to a successful execution
- ② Given an operation and a voltage  $V \in \mathcal{V}$ :
  - $\bullet$   $\, {\cal I}$  denotes the set of all possible inputs
  - $\mathcal{I}_f(V) \subseteq \mathcal{I}$  is the set of inputs that will fail at voltage V
  - Failure probability is computed as  $p_V = |\mathcal{I}_f(V)|/|\mathcal{I}|$
  - ullet For any two voltages  $V_1 \geq V_2$ , we have  $\mathcal{I}_f(V_1) \subseteq \mathcal{I}_f(V_2)$

$$\mathbb{P}(V_{\ell}\text{-fail}\mid V_0V_1\cdots V_{\ell-1}\text{-fail}) = \frac{\mid \mathcal{I}_f(V_{\ell})\mid/\mid \mathcal{I}\mid}{\mid \mathcal{I}_f(V_{\ell-1})\mid/\mid \mathcal{I}\mid} = \frac{p_{\ell}}{p_{\ell-1}}$$

Consider a sequence L of k voltages  $V_1 < V_2 < \cdots < V_k = V_{\text{\tiny TH}}$ ,

Simulations

Execution starts at voltage  $V_1$ :

- 1 In case of success, return result ©
- In case of failure, go to next (higher) voltage

# Energy Consumption of a Single Task

Consider a sequence L of k voltages  $V_1 < V_2 < \cdots < V_k = V_{ ext{\tiny TH}}$ ,

Simulations

Execution starts at voltage  $V_1$ :

- In case of success, return result ©
- In case of failure, go to next (higher) voltage

$$E(L) = c_1 + p_1 \left( o_{1,2} + c_2 + \frac{p_2}{p_1} \left( o_{2,3} + c_3 + \dots \frac{p_{k-1}}{p_{k-2}} (o_{k-1,k} + c_k) \right) \right)$$
  
=  $c_1 + p_1 (o_{1,2} + c_2) + p_2 (o_{2,3} + c_3) + \dots + p_{k-1} (o_{k-1,k} + c_k)$ 

## Energy Consumption of a Single Task

Consider a sequence L of k voltages  $V_1 < V_2 < \cdots < V_k = V_{\text{TH}}$ ,

Execution starts at voltage  $V_1$ :

- In case of success, return result <a>©</a>
- In case of failure, go to next (higher) voltage

$$E(L) = c_1 + p_1 \left( o_{1,2} + c_2 + \frac{p_2}{p_1} \left( o_{2,3} + c_3 + \dots \frac{p_{k-1}}{p_{k-2}} (o_{k-1,k} + c_k) \right) \right)$$
  
=  $c_1 + p_1 (o_{1,2} + c_2) + p_2 (o_{2,3} + c_3) + \dots + p_{k-1} (o_{k-1,k} + c_k)$ 

We generalize:

$$E(L) = c_1 + \sum_{\ell=2}^{k} p_{\ell-1} \left( o_{\ell-1,\ell} + c_{\ell} \right) \tag{1}$$

Simulations

## Optimal Sequence of Voltages

#### Theorem

To minimize the expected energy consumption for a single task, the optimal sequence of voltages to execute the task with a preset voltage  $V_p \in \mathcal{V}$  of the system can be obtained by dynamic programming with complexity  $O(k^2)$ .

Simulations

## Optimal Sequence of Voltages

#### Theorem

To minimize the expected energy consumption for a single task, the optimal sequence of voltages to execute the task with a preset voltage  $V_p \in \mathcal{V}$  of the system can be obtained by dynamic programming with complexity  $O(k^2)$ .

$$E(L_s^*) = c_s + \min_{s < \ell < k} \{ E(L_\ell^*) - c_\ell + p_s(o_{s,\ell} + c_\ell) \}$$
 (2)

Simulations

and the optimal sequence starting with  $V_s$  is  $L_s^* = \langle V_s, L_{\ell}^* \rangle$  where

$$\ell' = \operatorname*{arg\,min}_{s<\ell< k} \left\{ E(L_\ell^*) + p_s o_{s,\ell} + (p_s-1)c_\ell \right\}.$$

The dynamic program is initialized with  $E(L_k^*) = c_k$  and  $L_k^* = \langle V_k \rangle$ 

## Chain of Tasks

• Without switching cost: optimal sequence for one task can be used to execute each task.

 Without switching cost: optimal sequence for one task can be used to execute each task.

Simulations

- With switching cost:
  - ullet After execution of a task, platform is left at voltage  $V_{
    m e}$
  - ullet Optimal sequence starts at voltage  $V_s$
  - Additional switching cost  $o_{e,s}$  must be paid
  - Algorithm for one task is no longer optimal

- Without switching cost: optimal sequence for one task can be used to execute each task.
- With switching cost:
  - After execution of a task, platform is left at voltage  $V_e$
  - Optimal sequence starts at voltage  $V_s$
  - Additional switching cost o<sub>e.s</sub> must be paid
  - Algorithm for one task is no longer optimal

#### Theorem

To minimize the expected energy consumption for a linear chain of tasks, the optimal sequence of voltages to execute each task, given the terminating voltage of its preceding task (or given the preset voltage  $V_p$  of the system for the first task), can be obtained by dynamic programming with complexity  $O(nk^2)$ .

Simulations

- Simulations

## Blocked Matrix-Matrix Multiplication

Consider the blocked matrix multiplication  $C = A \times B$ .

#### **Application Workflow**

$$\begin{array}{l} \text{for } i=1 \text{ to } \lceil \frac{m}{b} \rceil \text{ do} \\ \text{for } j=1 \text{ to } \lceil \frac{m}{b} \rceil \text{ do} \\ \text{for } k=1 \text{ to } \lceil \frac{m}{b} \rceil \text{ do} \\ C_{i,j} \leftarrow C_{i,j} + A_{i,k} \times B_{k,j} \end{array}$$

- m matrix size
- b block size

Simulations

# Blocked Matrix-Matrix Multiplication

Consider the blocked matrix multiplication  $C = A \times B$ .

#### **Application Workflow**

$$\begin{aligned} & \textbf{for } i = 1 \text{ to } \left\lceil \frac{m}{b} \right\rceil \, \textbf{do} \\ & \textbf{for } j = 1 \text{ to } \left\lceil \frac{m}{b} \right\rceil \, \textbf{do} \\ & \textbf{for } k = 1 \text{ to } \left\lceil \frac{m}{b} \right\rceil \, \textbf{do} \\ & C_{i,j} \leftarrow C_{i,j} + A_{i,k} \times B_{k,j} \end{aligned}$$

- m matrix size
- b block size

ABFT can be used to add per-block verification.

Let 
$$e^T = [1, 1, \dots, 1]$$
, we define

$$A^c := \begin{pmatrix} A \\ e^T A \end{pmatrix}, B^r := \begin{pmatrix} B & Be \end{pmatrix}, C^f := \begin{pmatrix} C & Ce \\ e^T C & e^T Ce \end{pmatrix}.$$

Where  $A^c$  is the column checksum matrix,  $B^r$  is the row checksum matrix and  $C^f$  is the full checksum matrix.

Simulations

Simulations

Let  $e^T = [1, 1, \dots, 1]$ , we define

$$A^c := \begin{pmatrix} A \\ e^T A \end{pmatrix}, B^r := \begin{pmatrix} B & Be \end{pmatrix}, C^f := \begin{pmatrix} C & Ce \\ e^T C & e^T Ce \end{pmatrix}.$$

Where  $A^c$  is the column checksum matrix,  $B^r$  is the row checksum matrix and  $C^f$  is the full checksum matrix.

$$A^{c} \times B^{r} = \begin{pmatrix} A \\ e^{T} A \end{pmatrix} \times \begin{pmatrix} B & Be \end{pmatrix}$$
$$= \begin{pmatrix} AB & ABe \\ e^{T} AB & e^{T} ABe \end{pmatrix} = \begin{pmatrix} C & Ce \\ e^{T} C & e^{T} Ce \end{pmatrix} = C^{f}$$

Consider the matrix multiplication as a chain of  $n = \lceil \frac{m}{h} \rceil^3$  tasks.

Simulations

000000000

#### Time to Execute one Task

- $t = \tau \cdot w/\eta$ 
  - $\tau = 1/f$  time to do one cycle
  - $\eta = 0.8$  peak performance
- $w = b(b+1)^2 + \sigma$ 
  - $\sigma = 8^3$  initialization overhead
  - (b+1) ABFT overhead

### From [Razor] for a FPGA at f = 90MHz and 27°C:

- Set of voltages
- Timing errors probabilities

#### **Failure Probabilities**

Consider a set of voltages  $\mathcal{V}.$  For any voltage  $V_\ell \in \mathcal{V}$ 

- $p_{\ell} = 1 (1 p_{\ell}^{(1)}/\gamma)^{w}$ 
  - $p_{\ell}^{(1)}$  probability of timing error for one random operation
  - $\gamma$  correction factor (circuit-level)

# Platform Settings (2/2)

## From [Razor] for a FPGA at f = 90MHz and $27^{\circ}C$ :

- Set of voltages
- Timing errors probabilities

#### **Dynamic Power Consumption**

- $P(V, f) = \alpha f V^2$
- ullet We assume (wlog) lpha f=1

# Platform Settings (2/2)

## From [Razor] for a FPGA at f = 90MHz and 27°C:

- Set of voltages
- Timing errors probabilities

#### **Dynamic Power Consumption**

- $P(V, f) = \alpha f V^2$
- We assume (wlog)  $\alpha f = 1$

### **Voltage Switching Cost**

- $\bullet o_{\ell,h} = \begin{cases} 0, & \text{if } \ell = h \\ \beta \cdot \frac{|V_\ell V_h|}{|V_k V_1|} & \text{otherwise} \end{cases}$   $\bullet \beta = o_{1,k}$

 N-Voltage: Baseline algorithm that applies NTC and always uses threshold voltage.

Simulations

000000000

- DP<sub>1</sub>-detect & DP<sub>1</sub>-correct: Optimal dynamic programming algorithms for a single task.
- $DP_n$ -detect &  $DP_n$ -correct: Optimal dynamic programming algorithms for a chain of tasks.

detect algorithms use ABFT for error detection.
correct algorithms use ABFT for detection and correction.

## Failure probabilities



Figure: Failure probabilities for one operation and for one task under different block sizes and voltages.

## Simulations (without switching cost)



Figure: Impact of b and  $\gamma$  on expected energy consumption, without voltage switching costs. For (ii), only results for  $DP_n$ -correct algorithm are shown.

(ii)

## Simulations (with switching cost)



Figure: Impact of b and  $\beta$  on expected energy consumption. Voltage switching cost equivalent to energy consumed by one  $32 \times 32$  matrix multiplication at threshold voltage without overhead.

- 4 Conclusion

#### Conclusion

We use dynamic voltage overscaling to reduce power consumption.

#### **Summary**

- Software based approach
- Model for timing errors
- Optimal polynomial-time solution for a chain of tasks
- Simulations for matrix multiplication using ABFT

## Conclusion

We use dynamic voltage overscaling to reduce power consumption.

#### **Summary**

- Software based approach
- Model for timing errors
- Optimal polynomial-time solution for a chain of tasks
- Simulations for matrix multiplication using ABFT

Original problem and encouraging results; needs further research.

### Conclusion

We use dynamic voltage overscaling to reduce power consumption.

#### **Summary**

- Software based approach
- Model for timing errors
- Optimal polynomial-time solution for a chain of tasks
- Simulations for matrix multiplication using ABFT

Original problem and encouraging results; needs further research.

#### Future Work

- Algorithms for other task graphs
- Additional simulations, emulations and experiments