#### *Computer Architecture: Fundamentals, Tradeoffs, Challenges*

# Chapter 4: Because of Pipelines (out-of-order execution, branch prediction)

# Yale Patt The University of Texas at Austin

Austin, Texas Spring, 2023

# Outline

- Out-of-order Execution
  - The Tomasulo Algorithm (the IBM 360/91)
    - Pipelining with a hiccup
    - An example
  - What was wrong with it (did not have precise exceptions)
  - How do we fix it (the Reorder Buffer)
    - Pipelining with a second hiccup

#### Branch Prediction

- Other ways to deal with conditional branches
  - Predication (the THUMB IT instruction)
  - Multiple pipelines
  - Delayed branch
  - Combine branches
- Compile time branch prediction (BTFN, hint bits)
- Runtime (LT, 2bitCtr, Two-level, Hybrid, Perceptron, TAGE)
- The HEP (Burton Smith, Donelcor (~1978)

# An example

MUL R3,R1,R2 ADD R4,R2,R3 MUL R2,R6,R7 ADD R5,R2,R3 ADD R3,R3,R5

#### **Out-of-Order execution (example)**

#### • In-order

|     |     |     |    | 1 | 2 | 3 | 4 |   |   |   |      |   |   |   |   |   |   |      |   |   |   |    |   |   |   | 23  | £ |
|-----|-----|-----|----|---|---|---|---|---|---|---|------|---|---|---|---|---|---|------|---|---|---|----|---|---|---|-----|---|
| MUL | 13  | RI, | L2 | F | D | ~ | ~ | - | ~ | 1 | (R3) |   |   | 1 |   |   |   | -    |   |   |   | 1  |   |   |   |     |   |
|     |     | 22  |    |   | F | D |   |   |   |   |      | V | V | ~ | A | > |   |      |   |   |   |    |   |   |   |     |   |
| MUL | R2  | RG  | RT |   |   | F |   |   |   |   |      | D | V | V | L | V | ~ | (22) |   |   |   |    |   |   |   |     |   |
| ADD | RS  | RZ  | RS |   |   |   |   |   |   |   |      | F | D |   |   |   |   |      | V | V | ~ | RS | 2 |   |   |     |   |
| ADD | R3. | R3  | RS |   |   |   |   |   |   |   |      |   | F |   |   |   |   |      | P |   |   |    | ~ | ~ | - | R3) |   |

#### Out-of-Order

|     |    |     |     | 1 | 2             | 3 | 4 |   |   |   |     |   |      |   |    |   |    |   |   |   | 1   | 8 | <br>1 | <br> |  |
|-----|----|-----|-----|---|---------------|---|---|---|---|---|-----|---|------|---|----|---|----|---|---|---|-----|---|-------|------|--|
| MUL | R3 | RI, | RZ  | F | $\mathcal{D}$ | V | V | V | V | V | (3) |   |      |   |    |   |    |   |   |   |     |   |       |      |  |
| ADD | R4 | RZ  | R3  |   | F             | D |   |   |   |   |     | ~ | v    | V | R4 | ) |    |   |   |   |     |   |       |      |  |
| MUL | /  |     |     |   |               | F | D | V | V | ~ | ~   | ~ | (AZ) |   |    |   |    |   |   |   |     |   |       |      |  |
| ADD | RS | RZ  | 123 |   |               |   | F | D |   |   |     |   |      | V | V  | ~ | RS |   |   |   |     |   |       |      |  |
|     |    | R3  | 125 |   |               |   |   | F | D |   |     |   |      | • |    |   |    | 1 | V | / | R3) |   |       |      |  |

### **Tomasulo Algorithm Microarchitecture**



# Pipelining with "hiccups"

Tomasulo (~1965)



• HPS (~1985)



# **Out-of-order commit violates precise exceptions**

### Precise exceptions

- Identity of the faulting instruction is known
- State recovered to just before faulting instruction
- A simple example:

# *DIV R1,R2,R3 ADD R2,R3,R4*

ADD takes one cycle and stores result in R2 DIV eventually takes an exception. Not possible to recover R2 prior to DIV execution

# **Branch Prediction**

- Ways to deal with Conditional Branches
  - Compile time
    - BTFN
    - Hint bits
    - Predication (the THUMB IT instruction)
    - Delayed branch
    - Combine branches

#### - Run time Branch Prediction

- *LT*
- 2bit Counter
- Two-level
- Hybrid
- Perceptron
- TAGE
- The HEP (Burton Smith, Donelcor (~1978)

## **Predicated Execution (THUMB ISA)**

• Example 1

(4 instructions)





## **Dynamic branch prediction**

Last time taken

(Store a bit in the I Cache Tag Store)

• Saturating 2 bit counter (Use high bit to predict)



Two-level Predictor

Gshare Modification



# The Heterogeneous Element Processor (HEP)

- Burton Smith (at Donelcor ~1975)
- Introduction of Multithreading
  - Burton always credited CDC6600; I credit Burton!
- The mechanism (modified for explanation)
  - Four stages, HEP had 8
  - Four threads in progress, HEP had 8
  - Not shown: the hopper (20 threads waiting to participate)
    - IF a thread's instruction could not finish in time for next instruction

#### cycle1 cycle2 cycle3 cycle4 cycle5 cycle6 cycle7 cycle8 cycle9

fetch1 decod execu store fetch2 decod execu store fetch3 fetch1 decod execu store fetch2 decod execu store fetch1 decod execu store fetch2 decod execu fetch1 decod execu store fetch2 decod

#### Xie xie!