EE 360N
Spring 2001
Y. N. Patt, Instructor
Kameswar Subramaniam, Onur Mutlu, TAs
Problem Set 4
Due: April 16, 2001
1. Given the following code segment:
for (i=1; i<=100; i++)Write Cray-like assembly code to perform the calculation. Then compute the number of cycles required for the code segment to execute on the following machines:
A[i] = (B[i] * C[i] + D[i])/2;
a) Scalar processorAssume each machine has vector registers of length 64. The multiply, add, and load units are pipelined and take 6, 4, and 11 cycles, respectively, to complete one operation. Memory is 16-way interleaved.
b) Vector processor without chaining, 1 port to memory (1 load or store / cycle)
c) Vector processor with chaining, 1 port to memory
d) Vector processor with chaining, 2 read ports and 1 write port to memory
2. (Tanenbaum, 4th edition: Appendix A, 10,11,12) Signed decimal numbers consisting of n digits can be represented in n+1 digits without a sign.Positive numbers have 0 as the leftmost digit. Negative numbers are formed by subtracting each each digit from 9. Such numbers are called nine's complement numbers and are analogous to one's complement binary numbers.Express the following as three-digit nine's complement numbers: 6, -2, 100, -14, -1,0.
Determine the rule for addition of nine's complement numbers and then perform the following additions.
0001 0001 9997 9241
+9999 +9998 +9996 +0802
----- ----- ----- -----
Ten's complement is analogous to two's complement.A ten's complement
negative number is formed by adding 1 to the corresponding nine's complement
number, ignoring the carry. What is the rule for ten's complement addition?
3. (Tanenbaum, 4th edition: Appendix B, 4,5,6) The following binary floating-point numbers consist of a sign bit, an excess 63, radix 2 exponent, an a 16-bit fraction. Normalize them.
a. 0 1000000 0001010100000001To add two floating point numbers, you must adjust the exponents(by shifting the fraction) to make them the same. Then you can add the fractions and normalize the result if need be.Add the single precision IEEE numbers 3EE00000H and 3D800000H and express the normalized result in hexadecimal.
b. 0 0111111 0000001111111111
c. 0 1000011 1000000000000000
The Tightwad Computer Company has decided to come out with a machine
having 16-bit floating point numbers. The model 0.001 has a floating-point
format with a sign bit, 7 bit, excess 63 exponent and 8-bit fraction.Model
0.002 has a sign bit, 5-bit excess 15 exponent and a 10-bit fraction.Both
use radix 2 exponentiation. Assume that the all 0s and all 1s exponents are
reserved similar to the IEEE standard. What are the smallest and largest
positive normalized numbers on both models? About how many decimal digits of
precision does each have? Would you buy either one?
4. Booth's algorithm for multiplication, speeds up the shift/add process by consuming 2 bits during each shift/add step. But why stop at 2 bits? Your task: design the data path and control (like the 2bit Booth's algorithm) for consuming up to 4 bits at each step. Each step will be one or two shifts combined with either an add or a subtract.
Control inputs to the data path are: number of bits to be shifted before the add/subtract, whether to add, subtract, or add 0, and the number of bits to be shifted after the add/subtract/zero operation. Also whether to set or clear the *carry-control* bit. Complete the truth table below. Each input combination corresponds to the low-order four bits of the multiplier at the time of the iteration step.
Carry-ctl mpy bits | carry-ctl pre-sh +/-/0 post-shYour job:
--------------------|--------------------------------------
0 0000 | Clear 0 0 4
0 0001 | Clear 0 + 4
0 0010 | Clear 1 + 3
0 0011 | Set 0 - 2
0 0100 | Clear 2 + 2
0 0101 | Clear 0 + 2
0 0110 | Set 1 - 2
0 0111 |
0 1000 |
0 1001 |
0 1010 |
0 1011 |
0 1100 |
0 1101 |
0 1110 | Set 1 - 3
0 1111 | Set 0 - 4
1 0000 | Clear 0 + 4
1 0001 |
1 0010 |
1 0011 |
1 0100 |
1 0101 |
1 0110 |
1 0111 |
1 1000 |
1 1001 |
1 1010 |
1 1011 |
1 1100 |
1 1101 |
1 1110 |
1 1111 | Set 0 0 4
Step 1: Design the data path for this functional unit.
Step 2: Complete the output columns of the truth table above, which provides the control signals.
Step 3: Compute the average number of bits consumed on each iteration of an integer multiply instruction. (Hint: more than 3)
5. Write an assembly program to add two 32-bit integers. Assume that the contents of memory are as follows:
x4000: Low 16 bits of first integer
x4001: High 16 bits of first integer
x4002: Low 16 bits of second integer
x4003: High 16 bits of second integer
x4004: Low 16 bits of the result
x4005: High 16 bits of the result
Assume the LC-2 architecture has the following features:
a. A carry bit in addition to the condition codes.which adds the contents of R1 and R2 together with the carry of the previous operation to generate the sum in R1 along with the carry resulting from the operation.
b. A new instruction in the instruction set namely,ADDC R1, R2, R3
6.Suppose we have the following loop executing on a pipelined LC-2 machine.
DOIT STR R1, R6, #0 ADD R6, R6, #1 AND R3, R1, R2 BRz EVEN ADD R1, R1, #3 ADD R5, R5, #-1 BRp DOIT EVEN ADD R1, R1, #1 ADD R7, R7, #-1 BRp DOIT
"Fetch" takes 1 cycle, "Decode" takes 1 cycle, "Execute" stage takes
variable number of cycles depending on the type of instruction (see below), and "Store Result" stage takes
1 cycle.
All execution units (including the load/store unit) are fully pipelined and the following instructions that
use these units take the indicated number of cycles:
STR: 3 cycles
ADD: 3 cycles
AND: 2 cycles
BR : 1 cycle
For example, the execution of an ADD instruction followed by a BR would look like:
ADD F | D | E1 | E2 | E3 | ST BR F | D | - | - | E1 | ST TARGET F | D
This figure shows several things about the structure of the pipeline:
Answer the following questions:
a) How many cycles does the loop above take to execute if no branch prediction is used?
b) Suppose that a static BTFNT (backward taken-forward not taken) branch prediction scheme is used to predict branches.
i. How many cycles does the above loop take to execute with this scheme?
ii. What is the branch prediction accuracy?
iii. What is the prediction accuracy for each branch?
c) Suppose that we would like to use "delayed branch" instead of predicting the branches. The branch delay slot comprises 3 instructions. This means that the three instructions that are after the branch instruction in the straightline program are always executed (regardless of whether or not the branch is taken). Compiler finds the appropriate instructions to insert into these slots.
i. You are the compiler. Modify the above code so that it takes into account the branch delay slots and makes the best possible use of them.
ii. How many cycles does the modified code take to execute? (Assume that NOP instructions exist and spend 1 cycle in the "Execute" stage.
d) Suppose that two-bit saturating up/down counters (as discussed in lecture) are used for branch prediction. Each branch instruction has its own counter. The counters are initialized to '10' state. Top bit of the counter is used as the prediction. Hence, the first time a branch is seen it will be predicted taken.
i. How many cycles does the above loop take to execute if two-bit counters are used for branch prediction.
ii. What is the branch prediction accuracy?
iii. What is the prediction accuracy for each branch?