Exam 1 Solution Sheet

Key ideas I was looking for on each of the problems of the midterm.

1. RAS -- always part of the microarchitecture. Alpha ISA does provide
hint bits within its instruction, but compilers are not obliged to use them.

Delayed branch -- part of ISA. Semantics of branch instruction reflects
the delay slot.

second-level cache -- depends; historically part of microarchitecture, more
and more today it is part of ISA, with prefetch instructions that manipulate it.

2. Unless POLYH contributes to the bread and butter, he should not be allowed
to waste those transistors making it run fast. Bread and butter design: make
sure everything works tolerably, but invest most of the transistors in
improving those things that matter, that is, that the machine will be called
upon to do a lot.

Maximum IPC is 2. ALU and LD/ST unit are the bottleneck. Balanced design
violated. Issue width is 8, very large (today's standards) reservation
station, but only two functional units. Balanced design says invest in some
more functional units.

3. predicated execution -- removes the control dependency and saves a
misprediction penalty.

static scheduling -- e.g., move loads up so cache miss latency can be hidden.

superblock scheduling -- make fall through (on branches) the more common case.

insert prefetch instructions

eliminate branches -- RS 6000 -- by combining multiple relationals into one
composite predicate

organize data to make cache lines more useful

4. Predicated instructions remove a branch from the instruction stream.
Fewer branches mean fewer misprediction penalties. Use when prediction
accuracy is not high, since the additional flow dependency is better than
the negative of a misprection penalty. Don't use when branch prediction
accuracy is high, since the additional data dependency will slow you down;
better to speculate and go. Secondary negative effect of predication is
increased ifetch bandwidth, which means weaker cache utilization. If this
is an issue, it can work against predicating, depending on the degree of
code bloat due to fetching down both paths.

5. The crux of this problem was to examine and compare two paradigms that
deal with wide-issue of a single thread. Paradigms that do not deal with
wide-issue of a single thread were not so helpful to the discussion.

Superscalar: advantage -- packing; disadvantage -- dependency check at
rename can stretch cycle time, or add cycles to decode/rename.

vliw: advantage -- no dependency check, potential for shorter cycle; diadvantage
is the fixed length usually requires too many no-ops, worse use of cache,
bandwidth off-chip.

Two examples:

imminent commercial product: I had in mind EPIC, which is VLIW when you need
it to be (template bits). Some people indicated Trace Cache which is indeed
imminently on a commercial product.

strictly research: Block-structured ISA. Much as I believe in it, no one
has embraced it YET. Wide-issue organization of the block at compile time;
allows dependencies in the block, which is dealt with at run time. internal
producer/consumer does not affect cycle time because it is established at
compile time.

6. Condition codes: plus -- an extra piece of work in the same instruction,
more effectively uses cache, also does not tie up gprs with the result of
a relational; minus -- forces (except for RS 6000) serialization
since if you don't use the cc, the next instruction will probably clobber them.

RS 6000 has multiple sets of cc, so one instruction can set one set, and several
instructions downstream, that set can be tested.

RS 6000 bonus: combining multiple relationals into one predicate eliminates
a branch.

7. Variable-length: advantage denser code
Fixed length: easier decode

In the future: wider issue means decoding is a bigger problem for variable
length. faster on-chip frequencies (or greater disparity between on-chip and
off-chip) means denser code yields better use of caches and less off-chip
bandwidth requirements.

Some people took features of particular fixed length ISAs that had nothing
to do with fixed length vs variable length and argued that such a feature
made fixed length better or worse. Not good.

8. Major advantage of John Cocke's approach: No wasted microcycles. However,
it does require a more complex compiler, code bloat which translates into
both lesser cache effectiveness and greater memory bandwidth need.

9. Load/Store ISA is one where the only way you get data into the data path
is via a LD or a ST, AND you are not allowed to operate on a datum in the
same instruction that you perform a memory access on it. IA-32 is not a
LD/ST ISA, Alpha, Power-PC, SPARC are three examples of LD/ST ISAs.

Advantage of LD/ST: more flexible static scheduling since memory access and
operates are decoupled at ISA level.

Advantage of non-LD/ST: denser code, resulting in better cache utilization,
smaller demand for memory bandwidth. Secondary consideration: usually, non-
LD/ST yields a simpler compiler that can match patterns of HLL to the available
instructions in the ISA. Not always a win, since it depends on how well this
matching is. Sometimes it can actually be a disaster, since the stuff is
implemented but not terribly useful (but I digress).

LD/ST advantage is less important today with ooo execution that decouples
the non-LD/ST instruction into its component pieces. In fact, some
manufacturers even go so far as to call these pieces *RISCops*! Dense
encoding advantage of non-LD/ST is even more relevant today with higher
off-chip latencies.

Wide-issue decoding is not an issue, since one can have non-LD/ST and fixed
length instructions. That is, non-LD/ST does not demand variable length
instructions.

10. Always the case that cost, power are issues. Always the case that cost
vs. performance is the basic tradeoff. More than that, the focus to decide
which to build should involve the nature of the applications that form the
bread and butter.

Some characteristics that would make the choice a no-brainer:

importance of aggressive branch predictor, wide-issue enabler:
that is, how badly do I need these aggressive features.

availability of multiple threads in behalf of the same task

multiple threads with limited ILP but lots of interprocessor
communication.

I looked for justification for your choice. No points for simply choosing
wide-issue superscalar without strong justification. Lots of points for
telling me wide-issue superscalar is a crock with substantive back-up!