The University of Texas at Austin

EE 360N

Spring 2001

Y. N. Patt, Instructor

Kameswar Subramaniam, Onur Mutlu, TAs

Problem Set 4

Due: April 16, 2001

**1.** Given the following code segment:

Write Cray-like assembly code to perform the calculation. Then compute the number of cycles required for the code segment to execute on the following machines:for (i=1; i<=100; i++)A[i] = (B[i] * C[i] + D[i])/2;

a) Scalar processorAssume each machine has vector registers of length 64. The multiply, add, and load units are pipelined and take 6, 4, and 11 cycles, respectively, to complete one operation. Memory is 16-way interleaved.

b) Vector processor without chaining, 1 port to memory (1 load or store / cycle)

c) Vector processor with chaining, 1 port to memory

d) Vector processor with chaining, 2 read ports and 1 write port to memory

**2.** (Tanenbaum, 4th edition: Appendix A, 10,11,12) Signed decimal
numbers consisting of n digits can be represented in n+1 digits without
a sign.Positive numbers have 0 as the leftmost digit. Negative numbers
are formed by subtracting each each digit from 9. Such numbers are called
nine's complement numbers and are analogous to one's complement binary
numbers.Express the following as three-digit nine's complement numbers:
6, -2, 100, -14, -1,0.

Determine the rule for addition of nine's complement numbers and then perform the following additions.

0001 0001 9997 9241+9999 +9998 +9996 +0802----- ----- ----- -----

Ten's complement is analogous to two's complement.A ten's complement
negative number is formed by adding 1 to the corresponding nine's complement
number, ignoring the carry. What is the rule for ten's complement addition?

**3.** (Tanenbaum, 4th edition: Appendix B, 4,5,6) The following
binary floating-point numbers consist of a sign bit, an excess 63, radix
2 exponent, an a 16-bit fraction. Normalize them.

a.To add two floating point numbers, you must adjust the exponents(by shifting the fraction) to make them the same. Then you can add the fractions and normalize the result if need be.Add the single precision IEEE numbers 3EE00000H and 3D800000H and express the normalized result in hexadecimal.0 1000000 0001010100000001

b.0 0111111 0000001111111111

c.0 1000011 1000000000000000

The Tightwad Computer Company has decided to come out with a machine
having 16-bit floating point numbers. The model 0.001 has a floating-point
format with a sign bit, 7 bit, excess 63 exponent and 8-bit fraction.Model
0.002 has a sign bit, 5-bit excess 15 exponent and a 10-bit fraction.Both
use radix 2 exponentiation. Assume that the all 0s and all 1s exponents are
reserved similar to the IEEE standard. What are the smallest and largest
positive normalized numbers on both models? About how many decimal digits of
precision does each have? Would you buy either one?

**4.**
Booth's algorithm for multiplication, speeds up the shift/add process
by consuming 2 bits during each shift/add step. But why stop at 2 bits?
Your task: design the data path and control (like the 2bit Booth's algorithm)
for consuming up to 4 bits at each step. Each step will be one or two shifts
combined with either an add or a subtract.

Control inputs to the data path are: number of bits to be shifted before the add/subtract, whether to add, subtract, or add 0, and the number of bits to be shifted after the add/subtract/zero operation. Also whether to set or clear the *carry-control* bit. Complete the truth table below. Each input combination corresponds to the low-order four bits of the multiplier at the time of the iteration step.

Your job:Carry-ctl mpy bits | carry-ctl pre-sh +/-/0 post-sh--------------------|--------------------------------------0 0000 | Clear 0 0 40 0001 | Clear 0 + 40 0010 | Clear 1 + 30 0011 | Set 0 - 20 0100 | Clear 2 + 20 0101 | Clear 0 + 20 0110 | Set 1 - 20 0111 |0 1000 |0 1001 |0 1010 |0 1011 |0 1100 |0 1101 |0 1110 | Set 1 - 30 1111 | Set 0 - 41 0000 | Clear 0 + 41 0001 |1 0010 |1 0011 |1 0100 |1 0101 |1 0110 |1 0111 |1 1000 |1 1001 |1 1010 |1 1011 |1 1100 |1 1101 |1 1110 |1 1111 | Set 0 0 4

--------

Step 1: Design the data path for this functional unit.

Step 2: Complete the output columns of the truth table above, which provides the control signals.

Step 3: Compute the average number of bits consumed on each iteration of an integer multiply instruction. (Hint: more than 3)

**5.** Write an assembly program to add two 32-bit integers. Assume that the contents of memory are as follows:

x4000: Low 16 bits of first integer

x4001: High 16 bits of first integer

x4002: Low 16 bits of second integer

x4003: High 16 bits of second integer

x4004: Low 16 bits of the result

x4005: High 16 bits of the result

Assume the LC-2 architecture has the following features:

a. A carry bit in addition to the condition codes.which adds the contents of R1 and R2 together with the carry of the previous operation to generate the sum in R1 along with the carry resulting from the operation.

b. A new instruction in the instruction set namely,ADDC R1, R2, R3

**6.**Suppose we have the following loop executing on a pipelined LC-2 machine.

DOIT STR R1, R6, #0 ADD R6, R6, #1 AND R3, R1, R2 BRz EVEN ADD R1, R1, #3 ADD R5, R5, #-1 BRp DOIT EVEN ADD R1, R1, #1 ADD R7, R7, #-1 BRp DOIT

Assume that before the loop starts, the registers are initialized to the following integer values:

R1 <- 0

R2 <- 1

R5 <- 49

R6 <- 4000

R7 <- 49

"Fetch" takes 1 cycle, "Decode" takes 1 cycle, "Execute" stage takes
variable number of cycles depending on the type of instruction (see below), and "Store Result" stage takes
1 cycle.

All execution units (including the load/store unit) are fully pipelined and the following instructions that
use these units take the indicated number of cycles:

STR: 3 cycles

ADD: 3 cycles

AND: 2 cycles

BR : 1 cycle

For example, the execution of an ADD instruction followed by a BR would look like:

ADD F | D | E1 | E2 | E3 | ST BR F | D | - | - | E1 | ST TARGET F | D

This figure shows several things about the structure of the pipeline:

- Whenever possible, data forwarding is used. Instructions that are dependent on the previous instructions can make use of the results produced before right after the previous instruction finishes the "Execute" stage.
- Branches are resolved at the end of 1 "execute" cycle. Hence, the target instruction can be fetched when the BR instruction is in ST1 stage.

- The pipeline implements "in-order execution". Scoreboarding scheme is used as discussed in class.
- The pipeline emulates the LC-2 ISA. Hence, the above instructions are all LC-2 instructions. Assume that NOP is in the LC-2 ISA.
- STR instruction accesses the memory during the "Execute" and "store result" stages.

Answer the following questions:

a) How many cycles does the loop above take to execute if no branch prediction is used?

b) Suppose that a static BTFNT (backward taken-forward not taken) branch prediction scheme is used to predict branches.

i. How many cycles does the above loop take to execute with this scheme?

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?

c) Suppose that we would like to use "delayed branch" instead of predicting the branches. The branch delay slot comprises 3 instructions. This means that the three instructions that are after the branch instruction in the straightline program are always executed (regardless of whether or not the branch is taken). Compiler finds the appropriate instructions to insert into these slots.

i. You are the compiler. Modify the above code so that it takes into account the branch delay slots and makes the best possible use of them.

ii. How many cycles does the modified code take to execute? (Assume that NOP instructions exist and spend 1 cycle in the "Execute" stage.

d) Suppose that two-bit saturating up/down counters (as discussed in lecture) are used for branch prediction. Each branch instruction has its own counter. The counters are initialized to '10' state. Top bit of the counter is used as the prediction. Hence, the first time a branch is seen it will be predicted taken.

i. How many cycles does the above loop take to execute if two-bit counters are used for branch prediction.

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?