



#### ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths

M. Weißbrich<sup>1</sup>, A. Najafi<sup>2</sup>, A. García-Ortiz<sup>2</sup> and G. Payá-Vayá<sup>1</sup>

- 1) Institute of Microelectronic Systems, Leibniz Universität Hannover
- 2) Institute of Electrodynamics and Microelectronics, Universität Bremen



3<sup>rd</sup> Workshop on Approximate Computing (AxC18), 31.05.2018



- Precise computation results not necessary in many applications
  - (Noisy) image processing, edge & feature detection etc.
- Imprecise results in hardware by...





- Precise computation results not necessary in many applications
  - (Noisy) image processing, edge & feature detection etc.
- Imprecise results in hardware by...
  - Approximate Computing: Deterministic designs without timing violations





- Precise computation results not necessary in many applications
  - (Noisy) image processing, edge & feature detection etc.
- Imprecise results in hardware by...
  - Approximate Computing: Deterministic designs without timing violations





- Precise computation results not necessary in many applications
  - (Noisy) image processing, edge & feature detection etc.
- Imprecise results in hardware by...
  - Approximate Computing: Deterministic designs without timing violations
  - Stochastic Computing:
     Deterministic designs with timing violations
    - Not considered here







- Pipelining of processor architectures for image processing:
  - Independent data, no pipeline conflicts





- Pipelining of processor architectures for image processing:
  - Independent data, no pipeline conflicts
  - Direct performance boost for long vector operations





- Pipelining of processor architectures for image processing:
  - Independent data, no pipeline conflicts
  - Direct performance boost for long vector operations
  - Pipelining not considered in comparisons of approximate arithmetic





- Pipelining of processor architectures for image processing:
  - Independent data, no pipeline conflicts
  - Direct performance boost for long vector operations
  - Pipelining not considered in comparisons of approximate arithmetic
- Influence on:
  - Area Efficiency
  - Energy Efficiency
  - Needs to be explored for architectural decisions





- 1. Generic VHDL description
  - Precise arithmetic described behaviorally
  - Tool selects efficient implementation considering user constraints



#### 1. Generic VHDL description

- Precise arithmetic described behaviorally
- Tool selects efficient implementation considering user constraints

VHDL

C <= A + B;



#### 1. Generic VHDL description

- Precise arithmetic described behaviorally
- Tool selects efficient implementation considering user constraints





#### 1. Generic VHDL description

- Precise arithmetic described behaviorally
- Tool selects efficient implementation considering user constraints





- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing



- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing

VHDL

```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```



- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing









D

N registers

 $\cap$ 

#### Implementation & Flow Strategy

- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing





1 timing constraint

**One-Stage Synthesis** 

CLK

1.

N registers



- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing







1 timing constraint

CLK

1.

 $t_{c,initial} = t_{FF}$ 

D

 $\mathbf{O}$ 



- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing









- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing

VHDL

```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```



```
Synthesis
```



Relaxed constraint for selection

N registers

#### Implementation & Flow Strategy

- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing



```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```



1. One-Stage Synthesis

 $t_{c,initial} = t_{FF}$ 

D

()

- 1 timing constraint
- 2. Two-Stage Synthesis
  - *Relaxed* constraint for selection

CLK

- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing



```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```





• *Desired* constraint for balancing

- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing

#### VHDL

```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```



Synthesis



- 1. One-Stage Synthesis
  - 1 timing constraint
- 2. Two-Stage Synthesis
  - Relaxed constraint for selection
  - Desired constraint for balancing



- 2. Pipeline
  - Described behaviorally by output registration
  - Tool performs retiming/register balancing

#### VHDL

```
C_tmp <= A + B;
process(clk)
begin
    if rising_edge(clk) then
        C <= C_tmp;
    end if;
end process;
```



Clock Period in ns (log2 scale)

Synthesis



- 1. One-Stage Synthesis
  - 1 timing constraint
- 2. Two-Stage Synthesis
  - *Relaxed* constraint for selection
  - Desired constraint for balancing





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



Mohapatra et al., Design of voltage-scalable meta-functions for approximate computing, 2011





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



Mohapatra et al., Design of voltage-scalable meta-functions for approximate computing, 2011



- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



Zhu et al., An enhanced low-power high-speed adder for error-tolerant application, 2009

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 28





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



Zhu et al., An enhanced low-power high-speed adder for error-tolerant application, 2009





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



#### LOA

Mahdiani et al., *Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications*, 2010

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 30





- Precise adders in VHDL inferred using 
   operator
  - Precise reference architecture: Parallel-Prefix architecture (32b)



#### LOA

Mahdiani et al., *Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications*, 2010

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 31





- Precise multipliers in VHDL inferred using operator
  - Precise reference architecture: Radix-4 Booth (32b)





- Precise multipliers in VHDL inferred using operator
  - Precise reference architecture: Radix-4 Booth (32b)



ETM

Kyaw et al., Low-power high-speed multiplier for error-tolerant application, 2010





- Precise multipliers in VHDL inferred using operator
  - Precise reference architecture: Radix-4 Booth (32b)



#### ETM

Kyaw et al., Low-power high-speed multiplier for error-tolerant application, 2010





- Precise multipliers in VHDL inferred using operator
  - Precise reference architecture: Radix-4 Booth (32b)



Mahdiani et al., *Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications*, 2010

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 35





- Precise multipliers in VHDL inferred using operator
  - Precise reference architecture: Radix-4 Booth (32b)



Mahdiani et al., *Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications*, 2010

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 36





### **ATE-Accuracy Profiling**





# **ATE-Accuracy Profiling**







# **ATE-Accuracy Profiling**





# **ATE-Accuracy Profiling**





# **ATE-Accuracy Profiling**

































































# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**



ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 61



# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





## **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**





# **ATE-Accuracy Profiling: Approximate Multipliers**



Institute of Microelectronic Systems



### **ATE-Accuracy Profiling: Approximate Multipliers**



ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 73





#### Conclusion

- Exploration of approximate adder and multipliers with a generic design and parameterizable amount of pipeline stages
  - Pipelining applied by register balancing





#### Conclusion

- Exploration of approximate adder and multipliers with a generic design and parameterizable amount of pipeline stages
  - Pipelining applied by register balancing
- Besides to more performance, pipelining can increase area and energy efficiency of arithmetic units at a fixed performance
  - Pipelining-aware two-phase synthesis flow
  - Area reduction of up to 20%
  - Energy reduction of up to 11%





#### Conclusion

- Exploration of approximate adder and multipliers with a generic design and parameterizable amount of pipeline stages
  - Pipelining applied by register balancing
- Besides to more performance, pipelining can increase area and energy efficiency of arithmetic units at a fixed performance
  - Pipelining-aware two-phase synthesis flow
  - Area reduction of up to 20%
  - Energy reduction of up to 11%

Pipelining aids in meeting the target timing constraint without switching to an approximate unit with lower accuracy



Institute of Microelectronic Systems



# Thank you for your attention!

ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 77





Specialized architectures for operating on independent, massively vectorizable data





- Specialized architectures for operating on independent, massively vectorizable data
  - Found in image processing, e.g., filtering (2D convolution, MAC), image differences (pixel-wise subtraction)





- Specialized architectures for operating on independent, massively vectorizable data
  - Found in image processing, e.g., filtering (2D convolution, MAC), image differences (pixel-wise subtraction)
  - Programming flexibility while maintaining high processing performance





- Specialized architectures for operating on independent, massively vectorizable data
  - Found in image processing, e.g., filtering (2D convolution, MAC), image differences (pixel-wise subtraction)
  - Programming flexibility while maintaining high processing performance
  - Approximate arithmetic for higher performance, area- or energy-efficiency
    - Approximate adder and multiplier designs









Generic VHDL description, exploit data-level parallelism of image processing algorithms







Generic VHDL library of approximate adders & multipliers



Generic VHDL description, exploit data-level parallelism of image processing algorithms



Institute of Microelectronic Systems

# Backup: Analysis Framework for *Approximate* and *Stochastic Computing* Processor Architectures



Generic VHDL description, exploit data-level parallelism of image processing algorithms







Generic VHDL description, exploit data-level parallelism of image processing algorithms

FLINT+ FPGA-based Timing Analysis Framework for Stochastic Computing Operation



















ATE-Accuracy Trade-Offs for Approximate Adders and Multipliers in Pipelined Processor Datapaths, AxC18, 31.05.2018 Slide 88



Institute of Microelectronic Systems

# Backup: Analysis Framework for *Approximate* and *Stochastic Computing* Processor Architectures











- Generic VHDL implementation strategy for approximate adder and multiplier designs
  - Inferring optimized precise sub-components





- Generic VHDL implementation strategy for approximate adder and multiplier designs
  - Inferring optimized precise sub-components
- Pipelining-aware ASIC synthesis flow for area-efficient gate-level implementations
  - Configurable number of pipeline stages





- Generic VHDL implementation strategy for approximate adder and multiplier designs
  - Inferring optimized precise sub-components
- Pipelining-aware ASIC synthesis flow for area-efficient gate-level implementations
  - Configurable number of pipeline stages
- Area-Timing-Energy-Accuracy profiling for pipelined approximate adders and multipliers





Architecture selection and mapping with a *relaxed* timing constraint → more area-efficient





Architecture selection and mapping with a *relaxed* timing constraint → more area-efficient







Architecture selection and mapping with a *relaxed* timing constraint → more area-efficient







- 1. Architecture selection and mapping with a *relaxed* timing constraint  $\rightarrow$  more area-efficient
- 2. Incremental synthesis with retiming/register balancing at the *desired* timing constraint







- 1. Architecture selection and mapping with a *relaxed* timing constraint  $\rightarrow$  more area-efficient
- 2. Incremental synthesis with retiming/register balancing at the *desired* timing constraint







- 1. Architecture selection and mapping with a *relaxed* timing constraint  $\rightarrow$  more area-efficient
- 2. Incremental synthesis with retiming/register balancing at the *desired* timing constraint

