## Computer Architecture: Fundamentals, Tradeoffs, Challenges

## Chapter 12: Single Thread Parallelism

## Yale Patt The University of Texas at Austin

Austin, Texas Spring, 2023

# Outline

- Granularity of Concurrency
- Pipelining
- SIMD (both vector and array processing)
- VLIW
- DAE
- HPS
- Data Flow

## **Granularity of Concurrency**

- Intra-instruction (pipelining)
- Parallel instructions in a single thread (SIMD, VLIW)
- Tightly-coupled (multiprocessor)
- Loosely-coupled (multi-computer network)

#### Pipelining

Pipelined:

|                |                | S1             | E, | D1 | F <sub>1</sub> |
|----------------|----------------|----------------|----|----|----------------|
|                | S2             | E2             | D2 | F2 |                |
| s <sub>3</sub> | E3             | D3             | F3 |    |                |
|                | D <sub>4</sub> | F <sub>4</sub> |    | 18 |                |
|                | Fg             |                | 2  |    |                |

Superscalar:

| F <sub>1</sub> | D1             | E.             | S1             |                |                |
|----------------|----------------|----------------|----------------|----------------|----------------|
| F2             | D2             | E2             | S2             | 1              |                |
| F3             | D3             | E3             | 53             | 1              |                |
| F <sub>4</sub> | D <sub>4</sub> | E <sub>4</sub> | S4             | 1              |                |
|                | F <sub>5</sub> | Dg             | E <sub>5</sub> | s <sub>5</sub> |                |
|                |                | F6             | D <sub>6</sub> | E <sub>6</sub> | s <sub>6</sub> |
|                |                | F7             | D7             | E7             | 87             |
|                |                | Fg             | D <sub>8</sub> | Eg             | S <sub>8</sub> |
|                |                |                | F <sub>9</sub> |                |                |

Superpipelined:

| F | 1  | 1  | 4  | 1  | 1              | Г              | s1 |                |    |     |   |
|---|----|----|----|----|----------------|----------------|----|----------------|----|-----|---|
| T | F, | 2  | D  | 2  | E              | 2              | 1  | 52             | 1  |     |   |
| Ĩ | Τ  | F3 |    | 0  |                | E.3            |    | S              | 3  | Ì., |   |
|   | -  | T  | F4 | 1  | D <sub>4</sub> | 8              | 4  | T              | s4 | ٦   |   |
|   |    |    | F  | 5  | D              | 5              | E  | 5              | 5  | 5   | 1 |
|   |    | 8  | T  | Fe | Τ              | D <sub>6</sub> | T  | E <sub>6</sub> | T  | Se  | í |

#### SIMD/MIMD

SISD The Typical Pentium-Pro, for example MISD SIMD Array Processor, Vector Processor MIMD Multiprocessor

and, Note:



2





## Vector processing example

 The scalar code: for i=1,50 A(i) = (B(i)+C(i))/2; Vectorizable!

## • The vector code:

vadd V2, V0, V1

vshfr V3,V2,1

Ivl 50

vld V0,B

vld V1,C

vst V3,A

Ivs 1 ; load vector stride

- ; load vector length
  - ; load V0 from memory, starting at address B
  - ; load V1 from memory, starting at address C
    - ; V2 < --- V0 + V1

; V3 < --- V2 divided by 2 (shift right one bit)

; store V3 to memory, starting at address A

## Vector processing example (continued)

- Baseline: with a Scalar Processor:
  - Loads/Stores take 11 cycles
  - Add takes 4 cycles
  - Shift takes 1 cycle
  - Iteration Control takes 2 cycles
- 50 iterations of (LD, LD, Add, Shift, Store, Iteration Ctl)
  - 50 x (Load, Load, Add, Shift, Store, Iteration\_Ctl)
  - $-50 \times (11 + 11 + 4 + 1 + 11 \ 2) = 50 \times 40 = 2000 \text{ clock cycles}$

# Vector processing example (continued) Vector Processor Timing

Vector code (no vector chaining): 285 clock cycles



Vector code (with chaining): 182 clock cycles



• Vector code (with 2 load, 1 store port to memory):



79 clock cycles

# Important to note that SIMD can be either Vector Processors or Array Processors



#### SIMD

### Vector Processors, Array Processors



#### VLIW

#### Static Scheduling \*

- Everything in lock step Trace Scheduling -
- -

#### Generic Model \*



## **Trace Scheduling**

## Instruction flow

VLIW Code





## The Decoupled Access Execute (DAE) Paradigm (by Andrew Pleszkun (SMA), Jim Smith (DAE)





# \* Andrew Plezskun, Univ. of Illinois

#### SMA

## ★ James E. Smith, Univ of Wisconsin

DAE

## DAE (an example)



ORIGINAL CODE

## Timing of the DAE Example





# The HPS Paradigm

- Processing micro-ops! (Restricted Data Flow)
- Incorporated the following:
  - Aggressive branch prediction
  - Speculative execution
  - Wide issue
  - Out-of-order execution
  - In-order retirement
- First published in Micro-18 (1985)
  - Patt, Hwu, Shebanow: Introduction to HPS
  - Patt, Melvin, Hwu, Shebanow: Critical issues



FOR EXAMPLE, THE VAX INSTRUCTION :









HPS - WHAT IS IT ?



÷

\* RESTRICTED DATA FLOW

82 (n. 25)

HPS



# Data Flow

## Data Driven execution of inst-level Graphical code

- Nodes are operators
- Arcs specify a producer/consumer relationship
- Only REAL (i.e., flow) dependencies constrain processing
  - Flow dependencies do (read-after-write)
  - Anti-dependencies don't (write-after-read)
  - Output dependencies don't (write-after-write)
  - NO sequential I-stream (No program counter)
- Operations execute ASYNCHRONOUSLY
- Instructions do not reference memory

- (at least memory as we understand it)

• Execution is triggered by presence of data

#### A Unit of Computation:

#### The Data Flow Node



OR,

| * | R | ARG1 | R | ARG2 | Dest. Of Result |
|---|---|------|---|------|-----------------|



The Firing Rule:



## An Example Data Flow Program: <u>Factorial</u> (Done, Iteratively)



# **Concerns with Full Data Flow**

- Difficulty with taking an interrupt or exception
  - Too much state information for a consistent state
    - Some instructions have no data yet
    - Some instructions have one operand and not the other
  - Too much latency (time to create a consistent state
- Difficulty in verifying the data flow engine
  - The complexity of the engine
  - Multiple instruction flows
- Size of the representation of an algorithm
- Handling recursion
  - One operand is produced before the recursive call
  - One operand is produced after the recursive call
  - Another descriptor required to keep track of the iterations

teşekkür ederim

## Tesekkur ederim!