On Tue, Mar 31, 2020 at 6:52 AM Yale N. Patt <patt@ece.utexas.edu> wrote:

My students,

I was delighted to listen to some of the discussion during Aniket's review
session. I thought he did an excellent job. A few items I thought I would
comment on:

1. Braids. Think of a program as a huge data flow graph. Draw a circle around
the whole thing. To process it you need a very expensive superscalar
out-of-order processor if you are serious about performance. Instead draw a
circle around a chain of nodes (micro-ops). That you can process with a very
simple processr, call it a chain execution unit. The problem is that there are
so many dependencies between the chains (one's live out is a source of another's
live in) that even though the chains could execute fast, they end up stalling
a lot waiting for values produced by another chain. The soul of braids: you
want to put a circle around a piece of the data flow graph (the braid) that is
more complex than a chain, but has very few live-in, live-out dependencies. 
If the compiler gets it right, the braid can be executed very fast by a simple
processor and stall waiting for a live-in, live-out dependency infrequently.
The idea is that since the braids can execute pretty much in parallel with
braid execution units, you get almost the performance of the heavy-weight
superscalar, out-of-order processor with a collection of very simple braid
execution units.

SSMT. First, the value of SMT comes from the fact that the multiple threads
can keep all the functional units working almost all the time. That is, the
capacity of the microarchitecture is heavily used. But what if the
application can not be broken into more than one thread. The result
would be heavily underutilized functional units. SSMT (or helper threads
as others have correctly renamed them) are small threads that do not process
code from the single threaded application that is running, but rather
sample the behavior of the on-chip structures in order to improve their 
ability to do their job. JObs like: a better branch predictor, or cache 
replacement policy, or re-compile on the basis of the actual runtime data.

Superpipelined. I talked about it more to give you a more comprehensive picture
of superscalar, etc. Superpipelined is actually a silly idea since it allows
non-pipelined designs to be superpipelined, and since every design has some
structure that takes longer than the fetch time, essentially every machine is
superpipelined. So, saying a processor is superpipelined does not provide
much useful information.

Good fortune. Something you are able to do because something lucky that you
did not have anything to do with happened. No need for a separate 487
floating point co-processor. The addition of MMX instructions to the Pentium
chip. Both happened due to Moore's Law providing extra transistors.

One more thing: I hope you are taking advantage of your classmates and
picking their brains in prepartion for the exam. That is a big benefit
of study groups, although I will admit much more difficult during the
current nightmare.

Good luck on the exam. 

Yale Patt