On Tue, Mar 31, 2020 at 6:52 AM Yale N. Patt wrote: My students, I was delighted to listen to some of the discussion during Aniket's review session. I thought he did an excellent job. A few items I thought I would comment on: 1. Braids. Think of a program as a huge data flow graph. Draw a circle around the whole thing. To process it you need a very expensive superscalar out-of-order processor if you are serious about performance. Instead draw a circle around a chain of nodes (micro-ops). That you can process with a very simple processr, call it a chain execution unit. The problem is that there are so many dependencies between the chains (one's live out is a source of another's live in) that even though the chains could execute fast, they end up stalling a lot waiting for values produced by another chain. The soul of braids: you want to put a circle around a piece of the data flow graph (the braid) that is more complex than a chain, but has very few live-in, live-out dependencies. If the compiler gets it right, the braid can be executed very fast by a simple processor and stall waiting for a live-in, live-out dependency infrequently. The idea is that since the braids can execute pretty much in parallel with braid execution units, you get almost the performance of the heavy-weight superscalar, out-of-order processor with a collection of very simple braid execution units. SSMT. First, the value of SMT comes from the fact that the multiple threads can keep all the functional units working almost all the time. That is, the capacity of the microarchitecture is heavily used. But what if the application can not be broken into more than one thread. The result would be heavily underutilized functional units. SSMT (or helper threads as others have correctly renamed them) are small threads that do not process code from the single threaded application that is running, but rather sample the behavior of the on-chip structures in order to improve their ability to do their job. JObs like: a better branch predictor, or cache replacement policy, or re-compile on the basis of the actual runtime data. Superpipelined. I talked about it more to give you a more comprehensive picture of superscalar, etc. Superpipelined is actually a silly idea since it allows non-pipelined designs to be superpipelined, and since every design has some structure that takes longer than the fetch time, essentially every machine is superpipelined. So, saying a processor is superpipelined does not provide much useful information. Good fortune. Something you are able to do because something lucky that you did not have anything to do with happened. No need for a separate 487 floating point co-processor. The addition of MMX instructions to the Pentium chip. Both happened due to Moore's Law providing extra transistors. One more thing: I hope you are taking advantage of your classmates and picking their brains in prepartion for the exam. That is a big benefit of study groups, although I will admit much more difficult during the current nightmare. Good luck on the exam. Yale Patt