## Department of Electrical and Computer Engineering

### The University of Texas at Austin

EE 360N, Fall 2004
Study Questions (covering some of the topics covered in class after Problem Set 5)
Due date: Not to be turned in
Yale N. Patt, Instructor
Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs

These questions are to aid you in your studies. They are not to be turned in and they do not cover all the topics covered in class after Problem Set 5.

1. Suppose we have the following loop executing on a pipelined LC-3b machine.
```
DOIT     STW   R1, R6, #0
AND   R3, R1, R2
BRz   EVEN
BRp   DOIT
BRp   DOIT
```

Assume that before the loop starts, the registers are initialized to the following integer values:
R1 <- 0
R2 <- 1
R5 <- 5
R6 <- 4000
R7 <- 5

"Fetch" takes 1 cycle, "Decode" takes 1 cycle, "Execute" stage takes variable number of cycles depending on the type of instruction (see below), and "Store Result" stage takes 1 cycle.

All execution units (including the load/store unit) are fully pipelined and the following instructions that use these units take the indicated number of cycles:

STW: 3 cycles
AND: 2 cycles
BR : 1 cycle

For example, the execution of an ADD instruction followed by a BR would look like:

```ADD       F | D | E1 | E2 | E3 | ST
BR            F | D  | -  | -  | E1  | ST
TARGET                                 F  | D
```

This figure shows several things about the structure of the pipeline:

• Whenever possible, data forwarding is used. Instructions that are dependent on the previous instructions can make use of the results produced before right after the previous instruction finishes the "Execute" stage.
• Branch instructions require 1 "execute" cycle to resolve the branch. Hence, the target instruction can be fetched when the BR instruction is in ST stage.

Also, you are given the following information about the pipeline and the ISA:

• The pipeline implements "in-order execution". A scoreboarding scheme is used as discussed in class.
• The pipeline emulates the LC-3b ISA. Hence, the above instructions are all LC-3b instructions.

a) How many cycles does the loop above take to execute if no branch prediction is used?

b) Suppose that a static BTFN (backward taken-forward not taken) branch prediction scheme is used to predict branches.

i. How many cycles does the above loop take to execute with this scheme?

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?

c) Suppose that two-bit saturating up/down counters (as discussed in lecture) are used for branch prediction. Each branch instruction has its own counter. The counters are initialized to '10' state. Top bit of the counter is used as the prediction. Hence, the first time a branch is seen it will be predicted taken.

i. How many cycles does the above loop take to execute if two-bit counters are used for branch prediction?

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?

2. From Tanenbaum, 4th edition, Appendix B, 4.

The following binary floating-point number consists of a sign bit, an excess 63, radix 2 exponent, and a 16-bit fraction. Express the value of this number as a decimal number.

0 0111111 0000001111111111

3. From Tanenbaum, 4th edition, Appendix B, 5.

To add two floating point numbers, you must adjust the exponents (by shifting the fraction) to make them the same. Then you can add the fractions and normalize the result, if need be. Add the single precision IEEE floating-point numbers 3EE00000H and 3D800000H and express the normalized result in hexadecimal. ['H' is a notation indicating these numbers are in hexadecimal]

4. From Tanenbaum, 4th edition, Appendix B, 6.

The Tightwad Computer Company has decided to come out with a machine having 16-bit floating-point numbers. The model 0.001 has a floating-point format with a sign bit, 7-bit, excess 63 exponent and 8-bit fraction. Model 0.002 has a sign bit, 5-bit, excess 15 exponent and a 10-bit fraction. Both use radix 2 exponentiation. What are the smallest and largest positive normalized numbers on both models? About how many decimal digits of precision does each have? Would you buy either one?

5. In an Omega network as presented in class, assume that there are n inputs and n outputs. Let k be the size of each switch. For k taking the values 2, 4, 8, and 64, answer the following questions. (Assume the cost of each switch is k^2)

a. What is the cost of the network as a function of n?
b. What is the latency of the network?
c. Assume that n=64. What k value would you choose? Why? State your assumptions and design point.

6. We have got the following expression to compute:
```    a*x^6 + b*x^5 + c*x^4 + d*x^3 + e*x^2 + f*x + g
```
• How many operations and time-steps will the computation take on a single processor system (Use the smallest number of operations possible)?
• How many operations and time-steps will the computation take on a multiprocessor system with 4 processors? (Use the smallest number of operations possible)
• What is the speedup of the multiprocessor system over a single processor?

7. The state diagram for the Goodman cache consistency scheme makes one assumption about the size of the cache blocks. What is it? (Hint: Focus on the case in which a block is in the DIRTY state and a BW signal comes in. Where do we go? Why?) If that assumption is not made, what will be the change in the state diagram? Draw the new state diagram.