Thu, 12 Feb 2009, 19:03





I wanted to give you all a more complete introduction to pipelining
on Wednesday, but per usual, time got away from me.  Given that you
have a lab due Sunday night, feel free to put this on the stack
until the lab is turned in. 

We will certainly revisit pipelining again and again within the context
of performance enhancements such as branch prediction and out-of-order
execution.  But before I left the subject on Wednesday, there are a few
key ideas I wanted to be sure I got across, particularly in the context
of the earlier lectures on the microcoded LC-3b implementation.

I think you all understand that a pipeline (or assembly line, as I prefer
it) consists of stages, with one piece of data path and relevant control
signals present in each stage.  The relevant picture (for a five stage
pipeline) is as follows:


    |<--fetch--->|<--decode-->|<--third--->|<--fourth-->|<--fifth--->| 
    |          | |          | |          | |          | |          | |
    |          | |          | |          | |          | |          | |
    |          | |          | |          | |          | |          | |
    |          | |          | |          | |          | |          | |
    |          | |          | |          | |          | |          | |
    |          | |          | |          | |          | |          | |

                L            L            L            L            L

Processing goes on as follows:

        cycle1 cycle2 cycle3 cycle4 cycle5 cycle6 cycle7 cycle8 
inst 1:   f      d      3      4      5
inst 2:          f      d      3      4      5
inst 3:                 f      d      3      4      5
inst 4:                        f      d      3      4      5
inst 5:                               f      d      3      4      
inst 6:                                      f      d      3    
inst 7:                                             f      d  

A few points:

1. The clock cycle is just long enough for everything that is necessary to
get done in each stage to get done during the cycle the results can be 
latch at the end of the cycle.  Latches are labeled L above.

2. At the start of each clock cycle, each stage has the state information 
(BEN, IR[11]. R) available in the latches it is sourcing.  Recall the 
question about Loading the BEN internal one-bit register during the 
DECODE stage.

3. And, each stage ends with the computed results loaded into the latches 
at the right on the diagram above.

4. A good design would have all the stages take approximately the same time.
The cycle time is determined by the propagation delays in the longest stage.
Therefore, that amount of time is available to every stage.  So the circuitry
may as well make use of it.

5. If we can keep fetching every cycle (no bubbles), the effective processing
rate is one instruction every clock cycle.  We could break all the stages into
two stages, like so:


	|<---cycle--->|            |<-new->|<-new->|
	|           | |            |     | |     | |
	|           | |            |     | |     | |
	|  work     | |  --->      |1/2  | |1/2  | |
	|           | |            |wk   | |wk   | |
	|           | |            |     | |     | |

Then we finish off the program, still one instruction at a time, but now the
time is half.  So the program executes twice as fast.

We do this to some extent.  Pentium had a 5 stage pipeline, Pentium Pro
had 12 stages, Pentium 4 had around 20.  Intel has suggested designs with 
more than 30 stages that I am personally aware of.   Perhaps more.  As we do
this, we keep getting the program done faster.  Then why doesn't everyone 
agree that more stages are better?  Answer:

a. We can not keep the pipeline constantly full (e.g., conditional branches). 

b. Part of each cycle is wasted (set-up time, skew) so when we divide a cycle
into two half-cycles, the sum of the useful propagation delays through the 
two half cycles is less than the original full cycle.

Questions on anything above in class or discussion at any time.  As I said,
we will see it again, AFTER we get through some fundamental concepts of
physical memory, virtual memory, cache memory, etc.

Good luck getting the second program done by Sunday night.

Yale Patt