Key ideas I was looking for on each of the problems of the midterm.
1. RAS -- always part of the microarchitecture. Alpha ISA does
hint bits within its instruction, but compilers are not obliged to use them.
Delayed branch -- part of ISA. Semantics of branch instruction
the delay slot.
second-level cache -- depends; historically part of microarchitecture,
and more today it is part of ISA, with prefetch instructions that manipulate it.
2. Unless POLYH contributes to the bread and butter, he should not be
to waste those transistors making it run fast. Bread and butter design: make
sure everything works tolerably, but invest most of the transistors in
improving those things that matter, that is, that the machine will be called
upon to do a lot.
Maximum IPC is 2. ALU and LD/ST unit are the bottleneck.
violated. Issue width is 8, very large (today's standards) reservation
station, but only two functional units. Balanced design says invest in some
more functional units.
3. predicated execution -- removes the control dependency and saves
static scheduling -- e.g., move loads up so cache miss latency can be hidden.
superblock scheduling -- make fall through (on branches) the more common case.
insert prefetch instructions
eliminate branches -- RS 6000 -- by combining multiple relationals into
organize data to make cache lines more useful
4. Predicated instructions remove a branch from the instruction stream.
Fewer branches mean fewer misprediction penalties. Use when prediction
accuracy is not high, since the additional flow dependency is better than
the negative of a misprection penalty. Don't use when branch prediction
accuracy is high, since the additional data dependency will slow you down;
better to speculate and go. Secondary negative effect of predication is
increased ifetch bandwidth, which means weaker cache utilization. If this
is an issue, it can work against predicating, depending on the degree of
code bloat due to fetching down both paths.
5. The crux of this problem was to examine and compare two paradigms
deal with wide-issue of a single thread. Paradigms that do not deal with
wide-issue of a single thread were not so helpful to the discussion.
Superscalar: advantage -- packing; disadvantage -- dependency check
rename can stretch cycle time, or add cycles to decode/rename.
vliw: advantage -- no dependency check, potential for shorter cycle;
is the fixed length usually requires too many no-ops, worse use of cache,
imminent commercial product: I had in mind EPIC, which is VLIW when
it to be (template bits). Some people indicated Trace Cache which is indeed
imminently on a commercial product.
strictly research: Block-structured ISA. Much as I believe in
it, no one
has embraced it YET. Wide-issue organization of the block at compile time;
allows dependencies in the block, which is dealt with at run time. internal
producer/consumer does not affect cycle time because it is established at
6. Condition codes: plus -- an extra piece of work in the same instruction,
more effectively uses cache, also does not tie up gprs with the result of
a relational; minus -- forces (except for RS 6000) serialization
since if you don't use the cc, the next instruction will probably clobber them.
RS 6000 has multiple sets of cc, so one instruction can set one set,
instructions downstream, that set can be tested.
RS 6000 bonus: combining multiple relationals into one predicate eliminates
7. Variable-length: advantage denser code
Fixed length: easier decode
In the future: wider issue means decoding is a bigger problem for variable
length. faster on-chip frequencies (or greater disparity between on-chip and
off-chip) means denser code yields better use of caches and less off-chip
Some people took features of particular fixed length ISAs that had nothing
to do with fixed length vs variable length and argued that such a feature
made fixed length better or worse. Not good.
8. Major advantage of John Cocke's approach: No wasted microcycles.
it does require a more complex compiler, code bloat which translates into
both lesser cache effectiveness and greater memory bandwidth need.
9. Load/Store ISA is one where the only way you get data into the data
is via a LD or a ST, AND you are not allowed to operate on a datum in the
same instruction that you perform a memory access on it. IA-32 is not a
LD/ST ISA, Alpha, Power-PC, SPARC are three examples of LD/ST ISAs.
Advantage of LD/ST: more flexible static scheduling since memory access
operates are decoupled at ISA level.
Advantage of non-LD/ST: denser code, resulting in better cache utilization,
smaller demand for memory bandwidth. Secondary consideration: usually, non-
LD/ST yields a simpler compiler that can match patterns of HLL to the available
instructions in the ISA. Not always a win, since it depends on how well this
matching is. Sometimes it can actually be a disaster, since the stuff is
implemented but not terribly useful (but I digress).
LD/ST advantage is less important today with ooo execution that decouples
the non-LD/ST instruction into its component pieces. In fact, some
manufacturers even go so far as to call these pieces *RISCops*! Dense
encoding advantage of non-LD/ST is even more relevant today with higher
Wide-issue decoding is not an issue, since one can have non-LD/ST and
length instructions. That is, non-LD/ST does not demand variable length
10. Always the case that cost, power are issues. Always the case
vs. performance is the basic tradeoff. More than that, the focus to decide
which to build should involve the nature of the applications that form the
bread and butter.
Some characteristics that would make the choice a no-brainer:
importance of aggressive
branch predictor, wide-issue enabler:
that is, how badly do I need these aggressive features.
availability of multiple threads in behalf of the same task
multiple threads with limited
ILP but lots of interprocessor
I looked for justification for your choice. No points for simply
wide-issue superscalar without strong justification. Lots of points for
telling me wide-issue superscalar is a crock with substantive back-up!