Exam 1. solution

1. pdf
(1) opcode:83 mode:00 r/m:101 disp:04030f88 imm:ffffff88 dest:MEM[disp32] src1: MEM[disp32] src2:SEXT(imm8)

(2) opcode:FC dest:EFLAGS(DF)

(3) opcode:OF A8 dest:SS:ESP src1: GS

(4) prefix:66 opcode:F7 mode:10 r/m:000 disp:01020301 dest:MEM[EAX+disp32] src1:MEM[EAX+disp32]

3. Option A will mispredict 10% of the time, Option B will mispredict 2% of the time. With a single-issue, in-order, 5 stage pipeline, the mispredict penalty is probably 3 cycles. So, for each branch, on average, option A incurs a penalty of 0.1 times 3 cycles = 0.3 cycles. Option B incurs a penalty of 0.02 times 3 cycles = 0.06 cycles.

Option A will take a cache miss 1% of the time, Option B will take a cache miss 2% of the time. So, for each data memory access, on average, option A incurs a penalty of 0.01 times 100 cycles = 1 cycle. Option B incurs a penalty of 0.02 times 100 cycles = 2 cycles.

Assuming equal numbers of branches and memory accesses, option A incurs an average penalty, per memory access and branch of 0.3 + 1 cycle = 1.3 cycles. Option B incurs an average penalty per memory access and branch of 0.06 + 2 = 2.06 cycles. So, if there are equal numbers of data accesses and branches, Option A incurs a smaller penalty. That difference gets larger as the ratio of memory accesses to branches increases. Only if the ratio of memory accesses to branches is absurdly, unrealistically small would option B win. What would that ratio have to be?

4. (a) the 8 way cache has a higher miss ratio for mcf.

(b) Example (recall, it asked for an example): Suppose the following nine lines compete for slots in the same 8 entry set: A1, A2, A3, A4, B1, B2, B3, B4, and B5. And if the cache were 4-way, the four lines Ai all compete for slots in one set, and the five lines Bi all compete for slots in a second set. Suppose the access pattern is to data in lines as follows: A1,A2,A3,A4,B1,B2, B3,B4,B5,A1,A2,A3,A4,B1,B2,B3,B4,B5,A1,A2,A3,A4,B1,B2,... Then the 4-way cache has a hit ratio of 4/9 (after the initial run) and the 8 way has a hit-ratio of 0.

5. Enlarging the block provides more producer/consumer pairs within the enlarged block, and fewer liveouts, reducing the pressure on the register file. Also, the larger the entity, the more parallelism is likely.

The bad news is (a) code bloat and (b) re-execution of the basic block that terminates in the "faulting" branch.

6. Branch promotion allows larger use of the trace segment storage. Rather than terminating the trace at the third branch, the fill unit can go on to the 4th branch, before stopping. On a misprediction of the promoted branch, the machine takes an exception.

Inactive issue keeps the non-predicted path instructions in the reservation station, rather than trashing them. The plus: on a misprediction, they do not have to be refetched, and issued to the reservation stations. The minus: they take up space in the reservation stations. [Note: Inactive issue does not execute these instructions, it keeps them in the reservation station until they are either needed due to branch misprediction or not needed and can be trashed.]

7. ability to exploit irregular parallelism. The "readiness" of an operation to execute is managed NOT by a global controller, but rather by the operation, itself. [Note: data flow is NOT the fastest way to execute programs exhibiting regular parallelism.]

8. a. reservation station. holds operations until they are ready to fire.

b. reorder buffer. holds operations to enable in-order retirement.

c. in order. to maintain sequential dependencies, program semantics.

d. no, phase 3 can pretty much be done at the microarchitecture's leisure.

[Note: On grading part d, I noticed an overwhelming attention to a balanced design, leading one to answer yes. I decided to be very lenient in the grading of 8 (d). The concept of balanced design is a correct concept, but one must go deeper to understand its meaning. In the case of retirement, one has to ask if retiring slowly somehow adversely affects the ability to aggressively fetch, decode, rename, execute out-of-order, etc.]

a. (1) to hold the data until we can retire the ST to memory in program order.

(2) to enable memory disambiguation.

bi. process LD immediately.
bii. satisfy the LD with the data associated with the ST.
biii. process LD immediately.
biv. (a) put LD in the queue and wait until ST address is known. Safe, although slower.
(b) assume addresses are different and process LD. Aggressive, but runs the danger of having to back up from this point if the addresses turn out to be the same.

a. gshare reduces negative interference by directing the histories of two different branches to two different 2-bit counters.

b. the resulting hybrid allows the currently more accurate predictor to drive the prediction. It also allows the 2bit counter to be used while the two-level predictor is warming up.

c. the agree predictor turns negative interference into positive interference by allowing both to predict according to the "saved" direction, rather than the absolute direction.

d, the path predictor distinguishes predictions due to paths to a problematic branch, rather than lumping many paths if they happen to have the same recent sequence of directions.