Fri, 14 Feb, 2025
My students, I thought I would attempt to help with understanding the questions on pipelines. First, the simple notion that I talked about in class: the assembly line in an automobile assembly plant can assemble a car every 2 minutes if the assembly line moves the car from beginning to end in 2 hours, but I have 60 mechanics, each spending 2 minutes on each car as it passes by, and all 60 mechanics working concurrently. It takes 2 hours to fill the assembly line, but once that's done, every two minutes another car completes its assembly. Our microarchitecture pipeline works the same way. The instruction cycle has stages. Suppose each stage takes one clock cycle. In cycle 1, we fetch intruction 1; in cycle 2, we fetch instruction 2 and decode instruction 1; in cycle 3, we fetch instruction 3, decode instsruction 2, and execute instruction 1. Between each pair of stages we have a set of flip-flops. The work done in cycle n is loaded into those flip-flops at the end of clock cycle n, and used as input by the next stage in cycle n+1. Because instructions in a program are implicitly executed in order, the pipeline is kept full and we get the desired result (a car every two minutes) as long as nothing bad happens. However, unlike the cars which are all independent, things can disrupt this but as long as nothing does, we get great performance. When bad things do happen, we need to be able to deal with it. We talked about the flow dependency, such as the following two instrucion sequence: MUL R1,R2,R3 ADD R4,R5,R1 We saw in class that when the ADD wants to source R1 in cycle 3, the MUL has yet to be completed, ADD would get an old, incorrect value. So we added a bit (we called it a scoreboard bit) to make sure we do not use bad data. When MUL is decoded, since the value in R1 is no longer correct for future instructions, we set the scoreboard bit to 0. When the ADD tries to source R1, it sees that the scoreboard bit is 0 so it waits. It waits until the MUL finishes and the result of the MUL is loaded into R1 and R1's scoreboard bit set to 1. Since instructions flow in-order, the ADD remains in the Decode stage as long as R1's scoreboard bit is 0. That means the instruction after the ADD can not move to Decode because the ADD is in Decode. So the instruction after the ADD has to stay in Fetch. ...and the instruction after that can not even enter Fetch. That is how computers used to work. Correctly, which is most important, but slowly because too often the pipeline was clogged. IN SHORT: IN ORDER FOR AN INSTRUCTION TO MOVE TO THE NEXT STAGE, THE INSTRUCTION THAT WAS IN THAT STAGE HAS TO MOVE ON. Otherwise, the pipeline stays clogged and performance suffers. What to do? The answer was provided by Bob Tomasulo, who devised the mechanism to allow instructions to execute out-of-order if there was no flow dependency violated. He used his mechanism on the IBM System 360/91 in the Floating Point functional unit. That is, he used it only for floating point instructions. My students and I expanded it to allow it to be used for ALL instructions in the program and the industry has done it our way. The problem was unclogging the pipeline. We will explain how we did that in the context of our example of a MUL followed by an ADD. Tomasulo unclogged the pipeline by adding another item (which he called a tag) to each register, so registers now had three items: a value (if the scoreboard bit was a 1), a scoreboard bit, and a tag (which identified the result before that result was known (i.e., the scoreboard bit was a 0). Let's give the tag the name x. Instructions entered the pipeline in the order prescribed by the program: inst 1 fetched in cycle 1, decoded in cycle 2, ...inst n fetched in cycle n, decoded in cycle n+1, ... After decode, instead of clogging the pipeline waiting for a scoreboard bit to be 1, the MUL left the pipeline sat idle for awhile in a location called a reservation station, eventually was executed in a functional unit, and eventually produced a result. Meanwhile, any later instruction (in our example, the ADD) entering the pipeline that wanted to source the result of the MUL would instead record the tag (in our example, x). This entry would show not the value (since it had not been computed yet), but rather the tag associated with in this case the MUL instruction it was waiting for. When the MUL finished execution, the scoreboard bit would be set to 1 and the subsequent ADD instruction waiting for the result of the MUL would match the tag x of the ADD with the tag x of the MUL and source the result of the MUL. No constraints as to when the MUL would finish or when the ADD would start execution, allowing instructions to execute out of the order specified by the program, without having to deal with a clogged pipeline. Software believes that programs should execute in the order specified by the program, so we still have more to do to make that happen. We will talk about that on Monday. Enjoy the weekend. YNP