Thu, 12 Feb 2009, 19:03
I wanted to give you all a more complete introduction to pipelining on Wednesday, but per usual, time got away from me. Given that you have a lab due Sunday night, feel free to put this on the stack until the lab is turned in. We will certainly revisit pipelining again and again within the context of performance enhancements such as branch prediction and out-of-order execution. But before I left the subject on Wednesday, there are a few key ideas I wanted to be sure I got across, particularly in the context of the earlier lectures on the microcoded LC-3b implementation. I think you all understand that a pipeline (or assembly line, as I prefer it) consists of stages, with one piece of data path and relevant control signals present in each stage. The relevant picture (for a five stage pipeline) is as follows: |<--fetch--->|<--decode-->|<--third--->|<--fourth-->|<--fifth--->| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | L L L L L Processing goes on as follows: cycle1 cycle2 cycle3 cycle4 cycle5 cycle6 cycle7 cycle8 inst 1: f d 3 4 5 inst 2: f d 3 4 5 inst 3: f d 3 4 5 inst 4: f d 3 4 5 inst 5: f d 3 4 inst 6: f d 3 inst 7: f d A few points: 1. The clock cycle is just long enough for everything that is necessary to get done in each stage to get done during the cycle the results can be latch at the end of the cycle. Latches are labeled L above. 2. At the start of each clock cycle, each stage has the state information (BEN, IR[11]. R) available in the latches it is sourcing. Recall the question about Loading the BEN internal one-bit register during the DECODE stage. 3. And, each stage ends with the computed results loaded into the latches at the right on the diagram above. 4. A good design would have all the stages take approximately the same time. The cycle time is determined by the propagation delays in the longest stage. Therefore, that amount of time is available to every stage. So the circuitry may as well make use of it. 5. If we can keep fetching every cycle (no bubbles), the effective processing rate is one instruction every clock cycle. We could break all the stages into two stages, like so: |<---cycle--->| |<-new->|<-new->| | | | | | | | | | | | | | | | | | work | | ---> |1/2 | |1/2 | | | | | |wk | |wk | | | | | | | | | | Then we finish off the program, still one instruction at a time, but now the time is half. So the program executes twice as fast. We do this to some extent. Pentium had a 5 stage pipeline, Pentium Pro had 12 stages, Pentium 4 had around 20. Intel has suggested designs with more than 30 stages that I am personally aware of. Perhaps more. As we do this, we keep getting the program done faster. Then why doesn't everyone agree that more stages are better? Answer: a. We can not keep the pipeline constantly full (e.g., conditional branches). b. Part of each cycle is wasted (set-up time, skew) so when we divide a cycle into two half-cycles, the sum of the useful propagation delays through the two half cycles is less than the original full cycle. Questions on anything above in class or discussion at any time. As I said, we will see it again, AFTER we get through some fundamental concepts of physical memory, virtual memory, cache memory, etc. Good luck getting the second program done by Sunday night. Yale Patt