some comments based on your questions to ED

Fri, 14 Feb, 2025
    
    My students,

    I thought I would attempt to help with understanding the questions on
    pipelines.

    First, the simple notion that I talked about in class: the assembly line
    in an automobile assembly plant can assemble a car every 2 minutes if the
    assembly line moves the car from beginning to end in 2 hours, but I have
    60 mechanics, each spending 2 minutes on each car as it passes by, and
    all 60 mechanics working concurrently.  It takes 2 hours to fill the
    assembly line, but once that's done, every two minutes another car completes
    its assembly.

    Our microarchitecture pipeline works the same way.  The instruction cycle
    has stages.  Suppose each stage takes one clock cycle.  In cycle 1, we fetch
    intruction 1; in cycle 2, we fetch instruction 2 and decode instruction 1;
    in cycle 3, we fetch instruction 3, decode instsruction 2, and execute
    instruction 1.  Between each pair of stages we have a set of flip-flops.
    The work done in cycle n is loaded into those flip-flops at the end of
    clock cycle n, and used as input by the next stage in cycle n+1.  Because
    instructions in a program are implicitly executed in order, the pipeline is
    kept full and we get the desired result (a car every two minutes) as long as
    nothing bad happens.  However, unlike the cars which are all independent,
    things can disrupt this but as long as nothing does, we get great performance.

    When bad things do happen, we need to be able to deal with it.  We talked about
    the flow dependency, such as the following two instrucion sequence:

    MUL R1,R2,R3
    ADD R4,R5,R1

    We saw in class that when the ADD wants to source R1 in cycle 3, the MUL has
    yet to be completed, ADD would get an old, incorrect value.  So we added a bit
    (we called it a scoreboard bit) to make sure we do not use bad data.  When MUL
    is decoded, since the value in R1 is no longer correct for future instructions,
    we set the scoreboard bit to 0.  When the ADD tries to source R1, it sees that
    the scoreboard bit is 0 so it waits.  It waits until the MUL finishes and the
    result of the MUL is loaded into R1 and R1's scoreboard bit set to 1.  Since
    instructions flow in-order, the ADD remains in the Decode stage as long as R1's
    scoreboard bit is 0.  That means the instruction after the ADD can not move
    to Decode because the ADD is in Decode.  So the instruction after the ADD has
    to stay in Fetch.  ...and the instruction after that can not even enter Fetch.
    That is how computers used to work.  Correctly, which is most important, but
    slowly because too often the pipeline was clogged.

    IN SHORT: IN ORDER FOR AN INSTRUCTION TO MOVE TO THE NEXT STAGE, THE INSTRUCTION
    THAT WAS IN THAT STAGE HAS TO MOVE ON.

    Otherwise, the pipeline stays clogged and performance suffers.

    What to do?

    The answer was provided by Bob Tomasulo, who devised the mechanism to allow
    instructions to execute out-of-order if there was no flow dependency violated.
    He used his mechanism on the IBM System 360/91 in the Floating Point functional
    unit.  That is, he used it only for floating point instructions.  My students
    and I expanded it to allow it to be used for ALL instructions in the program
    and the industry has done it our way.

    The problem was unclogging the pipeline. 

    We will explain how we did that in the context of our example of a MUL
    followed by an ADD.

    Tomasulo unclogged the pipeline by adding another item (which he called a tag)
    to each register, so registers now had three items: a value (if the scoreboard
    bit was a 1), a scoreboard bit, and a tag (which identified the result before
    that result was known (i.e., the scoreboard bit was a 0). Let's give the tag
    the name x.  Instructions entered
    the pipeline in the order prescribed by the program: inst 1 fetched in cycle 1,
    decoded in cycle 2, ...inst n fetched in cycle n, decoded in cycle n+1, ... 

    After decode, instead of clogging the pipeline waiting for a scoreboard bit
    to be 1, the MUL left the pipeline sat idle for awhile in a location called a
    reservation station, eventually was executed in a functional unit, and
    eventually produced a result.

    Meanwhile, any later instruction (in our example, the ADD) entering the
    pipeline that wanted to source the result of the MUL would instead record the
    tag (in our example, x).  This entry would show not the value (since it had
    not been computed yet), but rather the tag associated with in this case the
    MUL instruction it was waiting for.  When the MUL finished execution, the
    scoreboard bit would be set to 1 and the subsequent ADD instruction waiting for
    the result of the MUL would match the tag x of the ADD with the tag x of the
    MUL and source the result of the MUL. 

    No constraints as to when the MUL would finish or when the ADD would start
    execution, allowing instructions to execute out of the order specified by
    the program, without having to deal with a clogged pipeline.

    Software believes that programs should execute in the order specified by the
    program, so we still have more to do to make that happen.  We will talk about
    that on Monday.

    Enjoy the weekend.

    YNP