Key ideas I was looking for on each of the problems of the midterm.
1. RAS -- always part of the microarchitecture. Alpha ISA does
provide
hint bits within its instruction, but compilers are not obliged to
use them.
Delayed branch -- part of ISA. Semantics of branch instruction
reflects
the delay slot.
second-level cache -- depends; historically part of microarchitecture,
more
and more today it is part of ISA, with prefetch instructions that manipulate
it.
2. Unless POLYH contributes to the bread and butter, he should not be
allowed
to waste those transistors making it run fast. Bread and butter
design: make
sure everything works tolerably, but invest most of the transistors
in
improving those things that matter, that is, that the machine will
be called
upon to do a lot.
Maximum IPC is 2. ALU and LD/ST unit are the bottleneck.
Balanced design
violated. Issue width is 8, very large (today's standards) reservation
station, but only two functional units. Balanced design says
invest in some
more functional units.
3. predicated execution -- removes the control dependency and saves
a
misprediction penalty.
static scheduling -- e.g., move loads up so cache miss latency can be hidden.
superblock scheduling -- make fall through (on branches) the more common case.
insert prefetch instructions
eliminate branches -- RS 6000 -- by combining multiple relationals into
one
composite predicate
organize data to make cache lines more useful
4. Predicated instructions remove a branch from the instruction stream.
Fewer branches mean fewer misprediction penalties. Use when prediction
accuracy is not high, since the additional flow dependency is better
than
the negative of a misprection penalty. Don't use when branch
prediction
accuracy is high, since the additional data dependency will slow you
down;
better to speculate and go. Secondary negative effect of predication
is
increased ifetch bandwidth, which means weaker cache utilization.
If this
is an issue, it can work against predicating, depending on the degree
of
code bloat due to fetching down both paths.
5. The crux of this problem was to examine and compare two paradigms
that
deal with wide-issue of a single thread. Paradigms that do not
deal with
wide-issue of a single thread were not so helpful to the discussion.
Superscalar: advantage -- packing; disadvantage -- dependency check
at
rename can stretch cycle time, or add cycles to decode/rename.
vliw: advantage -- no dependency check, potential for shorter cycle;
diadvantage
is the fixed length usually requires too many no-ops, worse use of
cache,
bandwidth off-chip.
Two examples:
imminent commercial product: I had in mind EPIC, which is VLIW when
you need
it to be (template bits). Some people indicated Trace Cache which
is indeed
imminently on a commercial product.
strictly research: Block-structured ISA. Much as I believe in
it, no one
has embraced it YET. Wide-issue organization of the block at
compile time;
allows dependencies in the block, which is dealt with at run time.
internal
producer/consumer does not affect cycle time because it is established
at
compile time.
6. Condition codes: plus -- an extra piece of work in the same instruction,
more effectively uses cache, also does not tie up gprs with the result
of
a relational; minus -- forces (except for RS 6000) serialization
since if you don't use the cc, the next instruction will probably clobber
them.
RS 6000 has multiple sets of cc, so one instruction can set one set,
and several
instructions downstream, that set can be tested.
RS 6000 bonus: combining multiple relationals into one predicate eliminates
a branch.
7. Variable-length: advantage denser code
Fixed length: easier decode
In the future: wider issue means decoding is a bigger problem for variable
length. faster on-chip frequencies (or greater disparity between on-chip
and
off-chip) means denser code yields better use of caches and less off-chip
bandwidth requirements.
Some people took features of particular fixed length ISAs that had nothing
to do with fixed length vs variable length and argued that such a feature
made fixed length better or worse. Not good.
8. Major advantage of John Cocke's approach: No wasted microcycles.
However,
it does require a more complex compiler, code bloat which translates
into
both lesser cache effectiveness and greater memory bandwidth need.
9. Load/Store ISA is one where the only way you get data into the data
path
is via a LD or a ST, AND you are not allowed to operate on a datum
in the
same instruction that you perform a memory access on it. IA-32
is not a
LD/ST ISA, Alpha, Power-PC, SPARC are three examples of LD/ST ISAs.
Advantage of LD/ST: more flexible static scheduling since memory access
and
operates are decoupled at ISA level.
Advantage of non-LD/ST: denser code, resulting in better cache utilization,
smaller demand for memory bandwidth. Secondary consideration:
usually, non-
LD/ST yields a simpler compiler that can match patterns of HLL to the
available
instructions in the ISA. Not always a win, since it depends on
how well this
matching is. Sometimes it can actually be a disaster, since the
stuff is
implemented but not terribly useful (but I digress).
LD/ST advantage is less important today with ooo execution that decouples
the non-LD/ST instruction into its component pieces. In fact,
some
manufacturers even go so far as to call these pieces *RISCops*!
Dense
encoding advantage of non-LD/ST is even more relevant today with higher
off-chip latencies.
Wide-issue decoding is not an issue, since one can have non-LD/ST and
fixed
length instructions. That is, non-LD/ST does not demand variable
length
instructions.
10. Always the case that cost, power are issues. Always the case
that cost
vs. performance is the basic tradeoff. More than that, the focus
to decide
which to build should involve the nature of the applications that form
the
bread and butter.
Some characteristics that would make the choice a no-brainer:
importance of aggressive
branch predictor, wide-issue enabler:
that is, how badly do I
need these aggressive features.
availability of multiple threads in behalf of the same task
multiple threads with limited
ILP but lots of interprocessor
communication.
I looked for justification for your choice. No points for simply
choosing
wide-issue superscalar without strong justification. Lots of
points for
telling me wide-issue superscalar is a crock with substantive back-up!