

# <section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><list-item><list-item><list-item><list-item><list-item>































| Review: Loops                                                                                                                                                                                                              |                                                                   |                            |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|----------------------------|
| <ul> <li>&gt; By default, loops are rolled         <ul> <li>Each C loop iteration → Imp</li> <li>Each C loop iteration → Imp</li> <li>Each C loop iteration → Imp</li> <li>(add: for (i=3;i&gt;=0;i) {</li></ul></li></ul> | olemented in the same state<br>olemented with same resources      |                            |
| <ul> <li>Loops can be unrolled if the</li> <li>Not when the number of iterative</li> </ul>                                                                                                                                 | ir indices are statically determinable at a<br>ations is variable | elaboration time           |
| Improving Performance 13- 18                                                                                                                                                                                               | © Copyright 2013 Xilinx                                           | XILINX ➤ ALL PROGRAMMABLE. |















| Loop Reports                                                                                                                               |                                             |                                                              |
|--------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------|--------------------------------------------------------------|
| <ul> <li>&gt; Vivado HLS reports the later</li> <li>– Shown in the report file and G</li> <li>&gt; Given a variable loop index,</li> </ul> | the latency cannot be                       | reported                                                     |
| <ul> <li>Vivado HLS does not know the</li> </ul>                                                                                           | •                                           |                                                              |
| <ul> <li>This results in latency reports</li> </ul>                                                                                        | showing unknown values                      |                                                              |
| > The loop tripcount (iteration                                                                                                            | count) can be specifi                       | ed                                                           |
| <ul> <li>Apply to the loop in the directive</li> </ul>                                                                                     | ves pane                                    |                                                              |
| <ul> <li>Allows the reports to show an</li> </ul>                                                                                          |                                             | Impacts reporting – not synthesis                            |
| Performance Estimates                                                                                                                      | Vivado HLS Directive Editor                 | Performance Estimates                                        |
| B Timing (ns)                                                                                                                              | Type<br>Directive: LOOP_TRIPCOUNT           | Timing (ns)                                                  |
| Summary     Clock Target Estimated Uncertainty                                                                                             | Destination                                 | Clock Target Estimated Uncertainty                           |
| default 10.00 8.74 1.25                                                                                                                    | Options                                     | default 10.00 8.74 1.25                                      |
| Latency (clock cycles)     Summary                                                                                                         | min (optional): 200<br>max (optional): 1280 | Latency (clock cycles)                                       |
| Latency Interval<br>min max min max Type                                                                                                   | average (optional):                         | E Summary                                                    |
| ? ? ? ? none                                                                                                                               | Help Cancel OK                              | min max min max Type<br>561205 34417925 561206 34417926 none |
|                                                                                                                                            |                                             | JUL203 JTT1/323 JUL200 JTT1/320 HORE                         |









| Dataflow Optimiz                                                                                 | zation Commands                                                                                                                                                                        |                              |
|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|
| <ul> <li>Dataflow is set using</li> <li>Vivado HLS will seek</li> <li>Throughput of 1</li> </ul> | <b>a directive</b><br>to create the highest performance design                                                                                                                         |                              |
|                                                                                                  | Vivado HLS Directive Editor          Type         Directive:       DATAFLOW         Destination         Source File         Image: Directive File         Help       Cancel         OK |                              |
| Improving Performance 13- 31                                                                     | © Copyright 2013 Xilinx                                                                                                                                                                | ₭ XILINX > ALL PROGRAMMABLE. |





# Pipelining: Dataflow, Functions & Loops

### Dataflow Optimization

- Dataflow optimization is "coarse grain" pipelining at the function and loop level
- Increases concurrency between functions and loops
- Only works on functions or loops at the top-level of the hierarchy
  - Cannot be used in sub-functions

### > Function & Loop Pipelining

- "Fine grain" pipelining at the level of the operators (\*, +, >>, etc.)
- Allows the operations inside the function or loop to operate in parallel
- Unrolls all sub-loops inside the function or loop being pipelined
  - · Loops with variable bounds cannot be unrolled: This can prevent pipelining
  - Unrolling loops increases the number of operations and can increase memory and run time

Improving Performance 13- 34

© Copyright 2013 Xilinx

**EXILINX >** ALL PROGRAMMABLE.































| Array Dimensions                                                                                                                               |                                                                                                                      |
|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|
| The array options can be performed on dimens                                                                                                   |                                                                                                                      |
| dimension (optional): 1                                                                                                                        | my_array[10][6][4]<br>Dimension 1<br>Dimension 2<br>Dimension 3<br>Dimension 0<br>(All dimensions)                   |
| ➤ Examples<br>my_array[10][6][4] → partition dimension 3 →<br>my_array_0[10][6]<br>my_array_0[10][6]<br>my_array_0[10][6]<br>my_array_0[10][6] | my_array_0[6][4]<br>my_array_1[6][4]<br>my_array_2[6][4]<br>my_array_3[6][4]                                         |
| my_array[10][6][4] → partition dimension 1 ———————————————————————————————————                                                                 | my_array_4[6][4]<br>my_array_5[6][4]<br>my_array_5[6][4]<br>my_array_7[6][4]<br>my_array_8[6][4]<br>my_array_9[6][4] |
| my_array[10][6][4] → partition dimension 0 → 10x6x4 = 240 individua<br>Improving Performance 13-50 © Copyright 2013 X                          | al registers                                                                                                         |











| Summary                                           |                                                 |                                 |
|---------------------------------------------------|-------------------------------------------------|---------------------------------|
|                                                   |                                                 |                                 |
| Optimizing Performance                            |                                                 |                                 |
| <ul> <li>Latency optimization</li> </ul>          |                                                 |                                 |
| <ul> <li>Specify latency directives</li> </ul>    |                                                 |                                 |
| <ul> <li>Unroll loops</li> </ul>                  |                                                 |                                 |
| <ul> <li>Merge and Flatten loops to</li> </ul>    | to reduce loop transition overheads             |                                 |
| <ul> <li>Throughput optimization</li> </ul>       |                                                 |                                 |
| <ul> <li>Perform Dataflow optimization</li> </ul> | ation at the top-level                          |                                 |
| <ul> <li>Pipeline individual function</li> </ul>  | ns and/or loops                                 |                                 |
| <ul> <li>Pipeline the entire functio</li> </ul>   | n: beware of lots of operations, lots to schedu | le and it's not always possible |
| <ul> <li>Array Optimizations</li> </ul>           |                                                 |                                 |
| <ul> <li>Focus on bottlenecks ofter</li> </ul>    | n caused by memory and port accesses            |                                 |
| <ul> <li>Removing bottlenecks im</li> </ul>       | proves latency and throughput                   |                                 |
| <ul> <li>Use Array Partition</li> </ul>           | ning, Reshaping, and Data packing directives    | to achieve throughput           |
|                                                   |                                                 |                                 |
| Improving Performance 13- 56                      | © Copyright 2013 Xilinx                         | XILINX > ALL PROGRAMMABL        |



## **Objectives**

- > After completing this lab, you will be able to:
  - Add directives to your design
  - Understand the effect of INLINE-ing functions
  - Observe the effect of PIPELINE-ing functions
  - Improve the performance using various directives

Lab2 Intro 13a- 2

© Copyright 2013 Xilinx

€ XILINX ➤ ALL PROGRAMMABLE.



| Procedure                                                                                        |                                                                                                                                                                                                         |                            |
|--------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| <ul> <li>Open the created pro</li> <li>Apply TRIPCOUNT d</li> <li>Apply PIPELINE dire</li> </ul> | project by executing script from Vivade<br>oject in Vivado HLS GUI and analyze<br>irective using PRAGMA<br>ctive, generate solution, and analyze ou<br>irective to improve performance<br>nt the design |                            |
|                                                                                                  |                                                                                                                                                                                                         |                            |
| Lab2 Intro 13a- 4                                                                                | © Copyright 2013 Xilinx                                                                                                                                                                                 | € XILINX > ALL PROGRAMMABL |

### Summary

In this lab you learned that even though this design could not be pipelined at the toplevel, a strategy of pipelining the individual loops and then using dataflow optimization to make the functions operate in parallel was able to achieve the same high throughput, processing one pixel per clock. When DATAFLOW directive is applied, the default memory buffers (of ping-pong type) are automatically inserted between the functions. Using the fact that the design used only sequential (streaming) data accesses allowed the costly memory buffers associated with dataflow optimization to be replaced with simple 2 element FIFOs using the Dataflow command configuration

Lab2 Intro 13a- 5

© Copyright 2013 Xilinx

**EXILINX >** ALL PROGRAMMABLE.