Department of Electrical and Computer Engineering
The University of Texas at Austin

EE 360N
Spring 2001
Y. N. Patt, Instructor
Kameswar Subramaniam, Onur Mutlu, TAs
Problem Set 4
Due: April 16, 2001

1. Given the following code segment:

for (i=1; i<=100; i++)
A[i] = (B[i] * C[i] + D[i])/2;

Write Cray-like assembly code to perform the calculation. Then compute the number of cycles required for the code segment to execute on the following machines:

a) Scalar processor
b) Vector processor without chaining, 1 port to memory (1 load or store / cycle)
c) Vector processor with chaining, 1 port to memory
d) Vector processor with chaining, 2 read ports and 1 write port to memory

Assume each machine has vector registers of length 64. The multiply, add, and load units are pipelined and take 6, 4, and 11 cycles, respectively, to complete one operation. Memory is 16-way interleaved.

2. (Tanenbaum, 4th edition: Appendix A, 10,11,12) Signed decimal numbers consisting of n digits can be represented in n+1 digits without a sign.Positive numbers have 0 as the leftmost digit. Negative numbers are formed by subtracting each each digit from 9. Such numbers are called nine's complement numbers and are analogous to one's complement binary numbers.Express the following as three-digit nine's complement numbers: 6, -2, 100, -14, -1,0.

Determine the rule for addition of nine's complement numbers and then perform the following additions.

0001 0001 9997 9241
+9999 +9998 +9996 +0802
----- ----- ----- -----

Ten's complement is analogous to two's complement.A ten's complement negative number is formed by adding 1 to the corresponding nine's complement number, ignoring the carry. What is the rule for ten's complement addition?

3. (Tanenbaum, 4th edition: Appendix B, 4,5,6) The following binary floating-point numbers consist of a sign bit, an excess 63, radix 2 exponent, an a 16-bit fraction. Normalize them.

a. 0 1000000 0001010100000001
b. 0 0111111 0000001111111111
c. 0 1000011 1000000000000000

To add two floating point numbers, you must adjust the exponents(by shifting the fraction) to make them the same. Then you can add the fractions and normalize the result if need be.Add the single precision IEEE numbers 3EE00000H and 3D800000H and express the normalized result in hexadecimal.

The Tightwad Computer Company has decided to come out with a machine having 16-bit floating point numbers. The model 0.001 has a floating-point format with a sign bit, 7 bit, excess 63 exponent and 8-bit fraction.Model 0.002 has a sign bit, 5-bit excess 15 exponent and a 10-bit fraction.Both use radix 2 exponentiation. Assume that the all 0s and all 1s exponents are reserved similar to the IEEE standard. What are the smallest and largest positive normalized numbers on both models? About how many decimal digits of precision does each have? Would you buy either one?

4. Booth's algorithm for multiplication, speeds up the shift/add process by consuming 2 bits during each shift/add step. But why stop at 2 bits? Your task: design the data path and control (like the 2bit Booth's algorithm) for consuming up to 4 bits at each step. Each step will be one or two shifts combined with either an add or a subtract.

Control inputs to the data path are: number of bits to be shifted before the add/subtract, whether to add, subtract, or add 0, and the number of bits to be shifted after the add/subtract/zero operation. Also whether to set or clear the *carry-control* bit. Complete the truth table below. Each input combination corresponds to the low-order four bits of the multiplier at the time of the iteration step.

Carry-ctl mpy bits | carry-ctl pre-sh +/-/0 post-sh
--------------------|--------------------------------------
             0 0000 | Clear        0     0      4
             0 0001 | Clear        0     +      4
             0 0010 | Clear        1     +      3
             0 0011 | Set          0     -      2
             0 0100 | Clear        2     +      2
             0 0101 | Clear        0     +      2
             0 0110 | Set          1     -      2
             0 0111 |
             0 1000 |
             0 1001 |
             0 1010 |
             0 1011 |
             0 1100 |
             0 1101 |
             0 1110 | Set          1     -      3
             0 1111 | Set          0     -      4
             1 0000 | Clear        0     +      4
             1 0001 |
             1 0010 |
             1 0011 |
             1 0100 |
             1 0101 |
             1 0110 |
             1 0111 |
             1 1000 |
             1 1001 |
             1 1010 |
             1 1011 |
             1 1100 |
             1 1101 |
             1 1110 |
             1 1111 | Set          0     0      4

Your job:
--------

Step 1: Design the data path for this functional unit.
Step 2: Complete the output columns of the truth table above, which provides the control signals.
Step 3: Compute the average number of bits consumed on each iteration of an integer multiply instruction. (Hint: more than 3)

5. Write an assembly program to add two 32-bit integers. Assume that the contents of memory are as follows:
x4000: Low 16 bits of first integer
x4001: High 16 bits of first integer
x4002: Low 16 bits of second integer
x4003: High 16 bits of second integer
x4004: Low 16 bits of the result
x4005: High 16 bits of the result
Assume the LC-2 architecture has the following features:

a. A carry bit in addition to the condition codes.
b. A new instruction in the instruction set namely,
ADDC R1, R2, R3

which adds the contents of R1 and R2 together with the carry of the previous operation to generate the sum in R1 along with the carry resulting from the operation.

6.Suppose we have the following loop executing on a pipelined LC-2 machine.

DOIT     STR   R1, R6, #0
         ADD   R6, R6, #1
         AND   R3, R1, R2
         BRz   EVEN
         ADD   R1, R1, #3
         ADD   R5, R5, #-1
         BRp   DOIT
EVEN     ADD   R1, R1, #1
         ADD   R7, R7, #-1 
         BRp   DOIT

Assume that before the loop starts, the registers are initialized to the following integer values:
R1 <- 0
R2 <- 1
R5 <- 49
R6 <- 4000
R7 <- 49

"Fetch" takes 1 cycle, "Decode" takes 1 cycle, "Execute" stage takes variable number of cycles depending on the type of instruction (see below), and "Store Result" stage takes 1 cycle.

All execution units (including the load/store unit) are fully pipelined and the following instructions that use these units take the indicated number of cycles:

STR: 3 cycles
ADD: 3 cycles
AND: 2 cycles
BR : 1 cycle

For example, the execution of an ADD instruction followed by a BR would look like:

ADD       F | D | E1 | E2 | E3 | ST
BR            F | D  | -  | -  | E1  | ST
TARGET                                 F  | D

This figure shows several things about the structure of the pipeline:

Whenever possible, data forwarding is used. Instructions that are dependent on the previous instructions can make use of the results produced before right after the previous instruction finishes the "Execute" stage.
Branches are resolved at the end of 1 "execute" cycle. Hence, the target instruction can be fetched when the BR instruction is in ST1 stage.

Also, you are given the following information about the pipeline and the ISA:

The pipeline implements "in-order execution". Scoreboarding scheme is used as discussed in class.
The pipeline emulates the LC-2 ISA. Hence, the above instructions are all LC-2 instructions. Assume that NOP is in the LC-2 ISA.
STR instruction accesses the memory during the "Execute" and "store result" stages.

Answer the following questions:

a) How many cycles does the loop above take to execute if no branch prediction is used?

b) Suppose that a static BTFNT (backward taken-forward not taken) branch prediction scheme is used to predict branches.

i. How many cycles does the above loop take to execute with this scheme?

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?

c) Suppose that we would like to use "delayed branch" instead of predicting the branches. The branch delay slot comprises 3 instructions. This means that the three instructions that are after the branch instruction in the straightline program are always executed (regardless of whether or not the branch is taken). Compiler finds the appropriate instructions to insert into these slots.

i. You are the compiler. Modify the above code so that it takes into account the branch delay slots and makes the best possible use of them.

ii. How many cycles does the modified code take to execute? (Assume that NOP instructions exist and spend 1 cycle in the "Execute" stage.

d) Suppose that two-bit saturating up/down counters (as discussed in lecture) are used for branch prediction. Each branch instruction has its own counter. The counters are initialized to '10' state. Top bit of the counter is used as the prediction. Hence, the first time a branch is seen it will be predicted taken.

i. How many cycles does the above loop take to execute if two-bit counters are used for branch prediction.

ii. What is the branch prediction accuracy?

iii. What is the prediction accuracy for each branch?