Exam 1. solution
1. pdf
2.
(1) opcode:83 mode:00 r/m:101 disp:04030f88 imm:ffffff88 dest:MEM[disp32] src1: MEM[disp32] src2:SEXT(imm8)
(2) opcode:FC dest:EFLAGS(DF)
(3) opcode:OF A8 dest:SS:ESP src1: GS
(4) prefix:66 opcode:F7 mode:10 r/m:000 disp:01020301 dest:MEM[EAX+disp32] src1:MEM[EAX+disp32]
3. Option A will mispredict 10% of the time, Option B will mispredict
2% of the time. With a single-issue, in-order, 5 stage pipeline, the
mispredict penalty is probably 3 cycles. So, for each branch, on
average, option A incurs a penalty of 0.1 times 3 cycles = 0.3 cycles.
Option B incurs a penalty of 0.02 times 3 cycles = 0.06 cycles.
Option A will take a cache miss 1% of the time, Option B will take a
cache miss 2% of the time. So, for each data memory access, on
average, option A incurs a penalty of 0.01 times 100 cycles = 1 cycle.
Option B incurs a penalty of 0.02 times 100 cycles = 2 cycles.
Assuming equal numbers of branches and memory accesses, option A incurs
an average penalty, per memory access and branch of 0.3 + 1 cycle = 1.3 cycles.
Option B incurs an average penalty per memory access and branch of 0.06 + 2
= 2.06 cycles. So, if there are equal numbers of data accesses and branches,
Option A incurs a smaller penalty. That difference gets larger as the ratio
of memory accesses to branches increases. Only if the ratio of memory accesses
to branches is absurdly, unrealistically small would option B win. What would
that ratio have to be?
4. (a) the 8 way cache has a higher miss ratio for mcf.
(b) Example (recall, it asked for an example): Suppose the
following nine lines compete for slots in the same 8 entry set: A1,
A2, A3, A4, B1, B2, B3, B4, and B5. And if the cache were 4-way, the
four lines Ai all compete for slots in one set, and the five lines Bi
all compete for slots in a second set. Suppose the access pattern is
to data in lines as follows: A1,A2,A3,A4,B1,B2,
B3,B4,B5,A1,A2,A3,A4,B1,B2,B3,B4,B5,A1,A2,A3,A4,B1,B2,... Then the
4-way cache has a hit ratio of 4/9 (after the initial run) and the 8
way has a hit-ratio of 0.
5. Enlarging the block provides more producer/consumer pairs within the
enlarged block, and fewer liveouts, reducing the pressure on the register file.
Also, the larger the entity, the more parallelism is likely.
The bad news is (a) code bloat and (b) re-execution of the basic block that
terminates in the "faulting" branch.
6. Branch promotion allows larger use of the trace segment storage. Rather than
terminating the trace at the third branch, the fill unit can go on to the 4th
branch, before stopping. On a misprediction of the promoted branch, the
machine takes an exception.
Inactive issue keeps the non-predicted path instructions in the reservation
station, rather than trashing them. The plus: on a misprediction, they do not
have to be refetched, and issued to the reservation stations. The minus: they
take up space in the reservation stations. [Note: Inactive issue does not
execute these instructions, it keeps them in the reservation station until
they are either needed due to branch misprediction or not needed and can be
trashed.]
7. ability to exploit irregular parallelism. The "readiness" of an operation
to execute is managed NOT by a global controller, but rather by the operation,
itself. [Note: data flow is NOT the fastest way to execute programs exhibiting
regular parallelism.]
8. a. reservation station. holds operations until they are ready to fire.
b. reorder buffer. holds operations to enable in-order retirement.
c. in order. to maintain sequential dependencies, program semantics.
d. no, phase 3 can pretty much be done at the microarchitecture's leisure.
[Note: On grading part d, I noticed an overwhelming attention to a balanced
design, leading one to answer yes. I decided to be very lenient in the
grading of 8 (d). The concept of balanced design is a correct concept, but
one must go deeper to understand its meaning. In the case of retirement, one
has to ask if retiring slowly somehow adversely affects the ability to
aggressively fetch, decode, rename, execute out-of-order, etc.]
9.
a. (1) to hold the data until we can retire the ST to memory in program order.
(2) to enable memory disambiguation.
bi. process LD immediately.
bii. satisfy the LD with the data associated with the ST.
biii. process LD immediately.
biv. (a) put LD in the queue and wait until ST address is known. Safe, although slower.
(b) assume addresses are different and process LD. Aggressive, but
runs the danger of having to back up from this point if the addresses
turn out to be the same.
10.
a. gshare reduces negative interference by directing the histories of two
different branches to two different 2-bit counters.
b. the resulting hybrid allows the currently more accurate predictor to drive
the prediction. It also allows the 2bit counter to be used while the two-level
predictor is warming up.
c. the agree predictor turns negative interference into positive interference
by allowing both to predict according to the "saved" direction, rather than
the absolute direction.
d, the path predictor distinguishes predictions due to paths to a problematic
branch, rather than lumping many paths if they happen to have the same recent
sequence of directions.