Fri, 14 Feb, 2025
My students,
I thought I would attempt to help with understanding the questions on
pipelines.
First, the simple notion that I talked about in class: the assembly line
in an automobile assembly plant can assemble a car every 2 minutes if the
assembly line moves the car from beginning to end in 2 hours, but I have
60 mechanics, each spending 2 minutes on each car as it passes by, and
all 60 mechanics working concurrently. It takes 2 hours to fill the
assembly line, but once that's done, every two minutes another car completes
its assembly.
Our microarchitecture pipeline works the same way. The instruction cycle
has stages. Suppose each stage takes one clock cycle. In cycle 1, we fetch
intruction 1; in cycle 2, we fetch instruction 2 and decode instruction 1;
in cycle 3, we fetch instruction 3, decode instsruction 2, and execute
instruction 1. Between each pair of stages we have a set of flip-flops.
The work done in cycle n is loaded into those flip-flops at the end of
clock cycle n, and used as input by the next stage in cycle n+1. Because
instructions in a program are implicitly executed in order, the pipeline is
kept full and we get the desired result (a car every two minutes) as long as
nothing bad happens. However, unlike the cars which are all independent,
things can disrupt this but as long as nothing does, we get great performance.
When bad things do happen, we need to be able to deal with it. We talked about
the flow dependency, such as the following two instrucion sequence:
MUL R1,R2,R3
ADD R4,R5,R1
We saw in class that when the ADD wants to source R1 in cycle 3, the MUL has
yet to be completed, ADD would get an old, incorrect value. So we added a bit
(we called it a scoreboard bit) to make sure we do not use bad data. When MUL
is decoded, since the value in R1 is no longer correct for future instructions,
we set the scoreboard bit to 0. When the ADD tries to source R1, it sees that
the scoreboard bit is 0 so it waits. It waits until the MUL finishes and the
result of the MUL is loaded into R1 and R1's scoreboard bit set to 1. Since
instructions flow in-order, the ADD remains in the Decode stage as long as R1's
scoreboard bit is 0. That means the instruction after the ADD can not move
to Decode because the ADD is in Decode. So the instruction after the ADD has
to stay in Fetch. ...and the instruction after that can not even enter Fetch.
That is how computers used to work. Correctly, which is most important, but
slowly because too often the pipeline was clogged.
IN SHORT: IN ORDER FOR AN INSTRUCTION TO MOVE TO THE NEXT STAGE, THE INSTRUCTION
THAT WAS IN THAT STAGE HAS TO MOVE ON.
Otherwise, the pipeline stays clogged and performance suffers.
What to do?
The answer was provided by Bob Tomasulo, who devised the mechanism to allow
instructions to execute out-of-order if there was no flow dependency violated.
He used his mechanism on the IBM System 360/91 in the Floating Point functional
unit. That is, he used it only for floating point instructions. My students
and I expanded it to allow it to be used for ALL instructions in the program
and the industry has done it our way.
The problem was unclogging the pipeline.
We will explain how we did that in the context of our example of a MUL
followed by an ADD.
Tomasulo unclogged the pipeline by adding another item (which he called a tag)
to each register, so registers now had three items: a value (if the scoreboard
bit was a 1), a scoreboard bit, and a tag (which identified the result before
that result was known (i.e., the scoreboard bit was a 0). Let's give the tag
the name x. Instructions entered
the pipeline in the order prescribed by the program: inst 1 fetched in cycle 1,
decoded in cycle 2, ...inst n fetched in cycle n, decoded in cycle n+1, ...
After decode, instead of clogging the pipeline waiting for a scoreboard bit
to be 1, the MUL left the pipeline sat idle for awhile in a location called a
reservation station, eventually was executed in a functional unit, and
eventually produced a result.
Meanwhile, any later instruction (in our example, the ADD) entering the
pipeline that wanted to source the result of the MUL would instead record the
tag (in our example, x). This entry would show not the value (since it had
not been computed yet), but rather the tag associated with in this case the
MUL instruction it was waiting for. When the MUL finished execution, the
scoreboard bit would be set to 1 and the subsequent ADD instruction waiting for
the result of the MUL would match the tag x of the ADD with the tag x of the
MUL and source the result of the MUL.
No constraints as to when the MUL would finish or when the ADD would start
execution, allowing instructions to execute out of the order specified by
the program, without having to deal with a clogged pipeline.
Software believes that programs should execute in the order specified by the
program, so we still have more to do to make that happen. We will talk about
that on Monday.
Enjoy the weekend.
YNP