# Department of Electrical and Computer Engineering 

The University of Texas at Austin

EE 460N Fall 2018
Y. N. Patt, Instructor

Chirag Sakhuja, John MacKay, Aniket Deshmukh, Mohammad Behnia, TAs
Final Exam
December 14, 2018

Name:

## Part A

Problem 1 (10 points): $\qquad$

Problem 2 (10 points): $\qquad$
Problem 3 (10 points): $\qquad$

Problem 4 (10 points): $\qquad$
Problem 5 (10 points): $\qquad$
$\qquad$ Part B Total (80 points): $\qquad$

Total (130 points): $\qquad$

Note: Please be sure that your answers to all questions (and all supporting work that is required) are contained in the space provided.
Note: Please be sure your name is recorded on each sheet of the exam.

Please read the following sentence, and if you agree, sign where requested: I have not given nor received any unauthorized help on this exam.

Signature: $\qquad$

Name: $\qquad$

Problem 1 (10 points): An LC-3b system, with a 16-bit address space, has a 16 KB physically indexed, physically tagged cache. The cache is 4 way set associative, write-back, and uses a victim/next-victim replacement policy. A cache line is 64 B .

How many bits of storage are required for the tag store?

Name: $\qquad$

Problem 2 ( 10 points): A given program takes time $T$ to run on a processor. To improve performance, you consider using a newer processor that provides a $\mathbf{2 x}$ speedup over the older one. You also have the option of using any number of older processors to create a multi-core system. After analyzing the program, you find that $\mathbf{5 0 \%}$ of the program can be parallelized.

Part a (5 points): Which option performs better? Explain.
$\square$

To further improve performance you decide to construct a system consisting of the new processor and four older processors together. Due to energy constraints, either only the new processor or the four-core system can run at any given time.

Part b ( $\mathbf{5}$ points): What is the best speedup over the original execution time ( T ), given this configuration?

Name: $\qquad$

Problem 3 ( 10 points): For the LC-3b, an in-order pipeline was designed consisting of 5 stages: Fetch, Decode, Address Generation/Execute, Memory and Writeback. Each stage takes one cycle. Assume registers and condition codes are read in the Address Generation/Execute stage. Consider two different schemes:

1. No branch prediction or data forwarding. For a branch instruction, Fetch of the instruction following the branch occurs in the same cycle as when the branch reaches Writeback.
2. Perfect branch prediction, no data forwarding. Fetch of the next instruction following the branch occurs in the next cycle after the branch instruction is fetched.

The function FOO is executed on this processor.

```
FOO AND R0, R0, #0
    ADD R0, R0, #-1
    BRZ SKIP
    ADD R6, R6, #1
SKIP ADD R6, R6, #-1
    RET
```

The table below contains the instructions in FOO and the cycle in which each reaches the Writeback stage for the two schemes.

Your task: Fill in the missing entries.

| Instruction | Scheme 1 | Scheme 2 |
| :---: | :---: | :---: |
| AND R0, R0, \#0 | 5 | 5 |
| ADD R0, R0, \#-1 |  |  |
| BRZ SKIP |  |  |
| ADD R6, R6, \#1 |  |  |
| ADD R6, R6, \#-1 |  |  |
| RET |  |  |

We have provided on the next page four copies of a timing diagram that you can use for scratch work. Use as much or as little of it as you need.

| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AND R0, R0, \#0 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R0, R0, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| BRZ SKIP |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| RET |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |


| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AND R0, R0, \#0 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R0, R0, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| BRZ SKIP |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| RET |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |


| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AND R0, R0, \#0 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R0, R0, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| BRZ SKIP |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| RET |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |


| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| AND R0, R0, \#0 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R0, R0, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| BRZ SKIP |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| ADD R6, R6, \#-1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| RET |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

Name: $\qquad$

Problem 4 ( 10 points): Consider an N-bit IEEE-style floating point representation. The BIAS is determined in exactly the same way as it is for 32-bit and 64-bit IEEE representations. For our N-bit representation, we are able to represent the value $12 \frac{1}{8}$ exactly, but we cannot represent the value $34 \frac{1}{4}$ exactly. The smallest positive normalized value we can represent is $2^{-62}$.

Of our N bits, how many are fraction bits?

$$
\text { Number of Fraction Bits: } \square
$$

Of our N bits, how many are exponents bits?
$\square$
What is N ?


What is the bias?
$\square$

Name: $\qquad$

Problem 5 ( 10 points): Recall my handout on virtual addressing, showing the VAX process but reducing the sizes of everything to make the example less tedious. For example: VA is 9 bits, as follows:


Physical memory is 256 bytes.

Page table entries are 8 bits each and are broken down as follows.


The following snippets of physical memory are known.

| Addr | Data |
| :---: | :---: |
| $\ldots$ | $\ldots$ |
| $0 \times 10$ | $0 \times 0 \mathrm{~F}$ |
| $0 \times 11$ | $0 \times 88$ |
| $0 \times 12$ | $0 \times 83$ |
| $0 \times 13$ | $0 \times 89$ |
| $\ldots$ | $\ldots$ |


| Addr | Data |
| :---: | :---: |
| $\ldots$ | $\ldots$ |
| $0 \times 30$ | $0 \times 80$ |
| $0 \times 31$ | $0 \times 89$ |
| $0 \times 32$ | $0 \times 88$ |
| $0 \times 33$ | $0 \times 8 \mathrm{~A}$ |
| $\ldots$ | $\ldots$ |


| Addr | Data |
| :---: | :---: |
| $\ldots$ | $\ldots$ |
| 0x90 | 0x0F |
| 0x91 | $0 \times 88$ |
| 0x92 | 0x83 |
| 0x93 | $0 \times 89$ |
| $\ldots$ | $\ldots$ |


| Addr | Data |
| :---: | :---: |
| $\ldots$ | $\ldots$ |
| 0xF0 | 0x03 |
| 0xF1 | 0x0E |
| 0xF2 | 0x81 |
| 0xF3 | 0x80 |
| $\ldots$ | $\ldots$ |

The instruction LOADBYTE R12,A is executed, where the virtual address A is $0 x 032$.

After the instruction is fetched and decoded, three physical memory accesses are necessary to obtain the byte to be loaded into R12. Fill in the physical addresses and data that are accessed.

The P0BR and SBR are 0x120 and 0xF0, respectively.


Name: $\qquad$

Problem 6 ( 20 points): This problem is similar to the Tomasulo question from Exam 1, with some key differences. Consider an out-of-order processor which executes its instructions based on the Tomasulo algorithm with in-order retirement. The ISA specifies 8 registers, R0 to R7. The execute stage of the pipeline contains one pipelined adder and one pipelined multiplier. Fetch and Decode take one cycle each. The ADD instruction takes additional 4 cycles to execute; the MUL instruction takes additional 5 cycles to execute. At the end of the last cycle of execution, each instruction will store its result into any reservation stations waiting for it and its ROB (Reorder Buffer) entry. Both ADD and MUL need one additional cycle to retire. Only a single instruction can retire at a time.

The adder and multiplier each have 2-entry reservation stations. The reservation stations and the ROB are initially empty and are filled from top to bottom. Each instruction remains in the reservation station until the end of the cycle in which it writes its result to its ROB entry.

In this example, the computer will execute a program fragment consisting of five instructions.
Part a (10 points): Determine the five instructions in the program fragment.

|  | Opcode | DR | SR1 | SR2 |
| :--- | :--- | :--- | :--- | :--- |
| I1 |  |  |  |  |
| I2 |  |  |  |  |
| I3 |  |  |  |  |
| I4 |  |  |  |  |
| I5 |  |  |  |  |

To help you we have included the state of the register file before the first instruction is fetched, and after all five instructions have retired (the bolded values indicate the values that changed). We have also included a snapshot of the ROB (Reorder Buffer) at the beginning of cycle X .

Note that each ROB entry contains 3 bits of state information. The Valid (V) bit indicates if the entry is in use, i.e. $\mathrm{V}=0$ indicates an empty slot in the Reorder Buffer. The Executed bit indicates that the entry corresponds to an instruction that has successfully executed. The Retirement bit indicates that the executed instruction has retired.

|  | V |  | Tag |
| :---: | :---: | :---: | :---: |
| Value |  |  |  |
| R0 | 1 | - | 4 |
| R1 | 1 | - | 11 |
| R2 | 1 | - | -6 |
| R3 | 1 | - | 3 |
| R4 | 1 | - | 205 |
| R5 | 1 | - | 49 |
| R6 | 1 | - | 12 |
| R7 | 1 | - | 25 |
|  |  |  |  |


|  | V | Tag | Value |
| :---: | :---: | :---: | :---: |
| R0 | 1 | - | 4 |
| R1 | 1 | - | $\mathbf{4 4}$ |
| R2 | 1 | - | -6 |
| R3 | 1 | - | $\mathbf{- 6 6}$ |
| R4 | 1 | - | $\mathbf{1}$ |
| R5 | 1 | - | 49 |
| R6 | 1 | - | 12 |
| R7 | 1 | - | 25 |
|  |  |  |  |

Register File Before Cycle 1 Register File After Execution

| V | Executed | Retired | DR |  |
| :---: | :---: | :---: | :---: | :---: |
| Value |  |  |  |  |
| 0 | 1 | 1 | R4 | -3 |
| 0 | 1 | 1 | R3 | -33 |
| 1 | 1 | 0 | R1 | 44 |
| 1 | 0 | 0 | R3 | - |
| 1 | 1 | 0 | R4 | 1 |
| 0 | 0 | 0 | - | - |

ROB at the beginning of cycle X

Part b (3 points): What is the value of X ?


PROBLEM CONTINUES ON NEXT PAGE

Name: $\qquad$

We have included a timing diagram template for your scratch work. Use as much of it as you deem useful.

| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| I1 | F | D |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I2 |  | F |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I3 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I4 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| I5 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

Part c ( 7 points): Complete the values of the reservation stations, ROB, and register file after cycle 5.


Name: $\qquad$

Problem 7 (20 points): The unused opcode ' 1011 ' in the LC-3b ISA is used for a new instruction. The new instruction has the following format.


The state machine for the new instruction is shown below.

Your job: Augment the LC-3b data path and microsequencer shown on the next two pages to add the new instruction to the LC-3b ISA, and describe what the new instruction does.


Name: $\qquad$

Part a, The data path ( 9 points): We have provided registers for $\mathrm{A}, \mathrm{B}$, and TEMP, as well as GateTEMP tri-state buffer, a left-shift-by-1 unit, and a subtractor. We also provide inputs from the bus and ZEXT IR[4:0] to the large box shown. Your job is to implement the changes needed to support the new instruction by adding the necessary logic to the box and connecting the necessary structures to the LC-3b datapath. You may only add two $4: 1$ muxes, wires, and constant values. You may not add additional inputs or outputs to the box.

Note: ALUK is a 3-bit code in order to provide the ALU with the ability to subtract. This is needed in state 38 .


Name: $\qquad$

Part b, The microsequencer ( 3 points): To make this work, we need to add to the microsequencer a COND2 control signal and an input X. Fill in the box labeled X.


Part c (8 points): In 15 words or fewer, explain what the new instruction does. (Hint: You really only need three words)

Name: $\qquad$

Problem 8 (20 points): A byte addressable processor with a 32-bit wide data bus executes the following function.

```
int check_palindrome (int * input) {
    int flag = 0;
    for(int i = 0; i < 16; i++) {
        if(input[i] != input[31-i]) flag = -1;
    }
    return flag;
}
```

This function determines whether an array of 32 integers is a palindrome. Integers are 32 bits. You may assume local variables are stored in registers and only array accesses go to memory. The array begins at address x 8000 .

Suppose we have an 8-way interleaved DRAM memory, with the following address scheme:

| $15 \quad 11$ | 10 | $\mathbf{Y + 1}$ | $\mathbf{Y} \quad \mathbf{X + 1}$ | $\mathbf{X}$ |
| :---: | :---: | :---: | :---: | :---: |
| Row | Column | Bank | BoB |  |

- A row access takes 6 cycles, and subsequent column accesses take 1 cycle.
- Memory address and data buses are independent.
- Memory requests arrive at the memory controller 1 per cycle.
- Memory requests must be issued in order.
- If a request comes to a busy bank, it (and future requests) must wait until the bank is free.

Part a (2 points): Fill in the values for X and Y .


Part b (12 points): To execute this function, 32 memory locations must be accessed. What is the total number of clock cycles that these 32 memory accesses take?

Part c ( 6 points): Will moving to a 16-way interleaved memory system improve performance? Explain.

Name: $\qquad$

Problem 9 (20 points): The asynchronous bus discussed in class has two controllers relevant to this question, a cache controller and a memory controller. The bus can transfer one byte at a time.


Your job is to implement the state machine for a cache controller so it can load a cache line from memory in one bus cycle. A cache line consists of eight bytes.

The cache controller gets the following inputs from the cache:

- REQ: a request from the cache to get a cache line from memory
- BYTE_A: address of the next byte to be accessed (initially the address of the first byte in the cache line; updated in the cache and latched by the cache controller)
- CACHE_I: a control signal from the cache to the cache controller
- DONE: 1 if all of the bytes have been transferred; 0 if more bytes still need to be transferred

The cache controller has the following outputs to the cache:

- BYTE_D: byte of data
- CACHE_O: a control signal from the cache controller to the cache

The cache controller gets the following inputs from the bus:

- BBSY_I: 1 if the bus is busy; 0 if it is free
- SSYN: a control signal from the slave to the master
- DATA: the byte on the bus (produced by the memory controller and latched by the cache controller)

The cache controller has the following outputs to the bus:

- BBSY_O: 1 if the bus is busy; 0 if it is free
- MSYN: a control signal from the master to the slave
- ADDR: the address to access on the bus


## PROBLEM CONTINUES ON NEXT PAGE

Name: $\qquad$

Part a (10 points): Complete the following asynchronous timing diagram by showing all the signals that are transmitted. Note that we have labeled the states A, B, C, and D of the cache controller at various points in time. Those states correspond to the states in the state machine on the next page.


Name: $\qquad$

Part b ( $\mathbf{1 0}$ points): Complete the state machine for the cache controller by filling in the bolded boxes showing how it loads a cache line from memory in one bus cycle.


