Department of Electrical and Computer Engineering

The University of Texas at Austin

EE 360N, Spring 2009
Problem Set 5 Solution
Yale N. Patt, Instructor
Ramapriyan Chakravarthy, Khubaib, Vivekanand Venugopal, TAs
  1. Assume an ADD operation is executed like this in the pipeline:

    1234567
    FDAAAAS

    and a MUL operation is executed like this in the pipeline:

    123456789
    FDMMMMMMS

    F: Fetch, D: Decode, A: Execute stage (for ADD), M: Execute stage for MUL, S: Store result (Write-back)

    1. ADDs require 7 cycles (fetch, decode, 4 execute, store), and MULs require 9 cycles (fetch, decode, 6 execute, store). For 3 ADD instructions and 3 MUL instructions, the execution time is 3 × 7 + 3 × 9 = 48 cycles.
    2. Pipeline with scoreboarding and five adders and five multipliers (assuming one instruction fetched per cycle):
      1. No data forwarding: the destination register is marked valid in the S stage (a dependent instruction starts executing after the S stage of the instruction it depends on)
        1234567891011121314151617181920212223242526
        FDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDAAAAS
        FDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDMMMMMMS

        Execution time: 26 cycles

      2. With forwarding:
        123456789101112131415161718192021222324
        FDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDAAAAS
        FDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDMMMMMMS

        Execution time: 24 cycles

    3. Pipeline with scoreboarding and one adder and one multiplier (assuming one instruction fetched per cycle):

      1. The adder and multiplier are not pipelined and there is no data forwarding:

        1234567891011121314151617181920212223242526272829
        FDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDDDDAAAAS
        FFFFDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDMMMMMMS

        Execution time: 29 cycles

      2. The adder and multiplier are not pipelined and there is data forwarding:

        123456789101112131415161718192021222324252627
        FDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDDDDAAAAS
        FFFDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDMMMMMMS

        Execution time: 27 cycles

      3. The adder and multiplier are pipelined and there is no data forwarding:

        1234567891011121314151617181920212223242526
        FDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDAAAAS
        FDMMMMMMS
        FDDDDDDDAAAAS
        FFFFFFFDMMMMMMS

        Execution time: 26 cycles

      4. The adder and multiplier are pipelined and there is data forwarding:

        123456789101112131415161718192021222324
        FDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDAAAAS
        FDMMMMMMS
        FDDDDDDAAAAS
        FFFFFFDMMMMMMS

        Execution time: 24 cycles

  2. Assumptions:
    1. Pipeline diagram:

      Instruction123456789101112131415
      Start of first iteration (R1 is even)
      STWFDEEES
      ADDFDEEES
      ANDFDEES
      BRzFDDES
      ADDFDEEES
      ADDFDEEES
      BRpFDDDES
      End of the first iteration (R1 is odd now)
      STWF

      The loop takes the same number of cycles to execute for even and odd values of R1. Each iteration takes 14 cycles in the steady state. There are 5 iterations for even values of R1 and 4 iterations for odd values of R1. The total number of cycles is:

      (14 × 5) + (14 × 4) + 1 = 127

      The extra 1 cycle comes from the last iteration (Store result stage of the BRp instruction).

    2. Pipeline diagram:

      Instruction12345678910111213141516
      Start of first iteration (R1 is even)
      STWFDEEES
      ADDFDEEES
      ANDFDEES
      BRzFDDES
      ADDFFDEEES
      ADDFDEEES
      BRpFDDDES
      End of the first iteration (R1 is odd now)
      STWFFFDEEES

      The loop takes the same number of cycles to execute for even and odd values of R1. Each iteration takes 13 cycles but 3 cycles can be overlapped with the next iteration. The total number of cycles is:

      (10 × 9) + 3 = 93
    3. BRz instruction will always be predicted not taken. It is taken when R1 is even. So it will be mispredicted when R1 is even and correctly predicted when R1 is odd. The following diagram shows three consecutive iterations of the loop. In the first iteration, BRz is mispredicted, in the second iteration it is correctly predicted.

      The first BRp instruction is always predicted taken. It is always predicted correctly. The second BRp instruction is also always predicted taken. It is mispredicted only once in the last iteration of the loop.

              1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33    
      -------------------------------------------------------------------------------------------------------------- Start of first iteration (R1 is even)
      STW     F | D | E | E | E | S
      ADD         F | D | E | E | E | S
      AND             F | D | E | E | S
      BRz                 F | D | D | E | S  (Mispredicted)
      ADD R1, R1, #3          F | F | D | Flushed
      ADD R5, R5, #-1                 F | Flushed
      ADD R1, R1, #1                      F | D | E | E | E | S
      ADD R7, R7, #-1                         F | D | E | E | E | S
      BRp DOIT                                    F | D | D | D | E | S (Correctly predicted)
      ------------------------------------------------------------------------------------------------------------- Start of second iteration (R1 is odd)
      STW                                             F | F | F | D | E | E | E | S
      ADD                                                         F | D | E | E | E | S
      AND                                                             F | D | E | E | S
      BRz                                                                 F | D | D | E | S (Correctly predicted)
      ADD R1, R1, #3                                                          F | F | D | E | E | E | S
      ADD R5, R5, #-1                                                                 F | D | E | E | E | S
      BRp DOIT                                                                            F | D | D | D | E | S (Correctly predicted)
      ------------------------------------------------------------------------------------------------------------ Third iteration (R1 is even)
      STW                                                                                     F | F | F | D | E | E | S 
      ADD                                                                                                 F | D | E | E | E | S
      AND                                                                                                     F | D | E | E | S
      BRz                                                                                                         F | D | D | E | S  (Mispredicted)
      ADD R1, R1, #3                                                                                                  F | F | D | Flushed
      ADD R5, R5, #-1                                                                                                         F | Flushed
      ADD R1, R1, #1                                                                                                              F | D | E | E | E | S
      ADD R7, R7, #-1                                                                                                                 F | D | E | E | E | S
      BRp DOIT                                                                                                                            F | D | D | D | E | S (Correctly predicted)
              <--------- beginning of the loop ----><------------------------ steady state (22 cycles) ------------------------------------>
      

      Loop steady state is shown above. It takes 22 cycles and it is repeated 4 times. The beginning of the loop (until the steady state) takes 10 cycles as shown above. The end of the loop (part of the last iteration which is not in steady state) takes 5 more cycles to execute. The total number of cycles is:

      10 + (4 × 22) + 5 = 103

      Prediction accuracies for each branch are:

      • BRz = 4/9 (Correctly predicted when R1 is odd, R1 is odd for 4 iterations of the loop)
      • first BRp = 4/4
      • second BRp = 4/5 (Mispredicted only in the last iteration - Note that this misprediction does not affect the number of cycles it takes to execute the loop)

      Combined branch prediction accuracy = 12/18 = 67%


    1. Filled out instruction sequence

    2. Filled out reservation stations

    3. Filled out register alias table
  3. Assume the destination register is indicated first, followed by the source(s).

    (As stated in the question, cycle counts assume a 16-way interleaved memory so that a new access can be started each cycle. Also, the adder and multiplier are pipelined.)

    1. Scalar processor (one of the possible solutions)

              MOVI   R0, 0            (1 cycle)
              LEA    R4, A            (1 cycle)
              LEA    R5, B            (1 cycle)
              LEA    R6, C            (1 cycle)
              LEA    R7, D            (1 cycle)
        LOOP  LD     R1, R5, R0       (11 cycles)
              LD     R2, R6, R0       (11 cycles)
              LD     R3, R7, R0       (11 cycles)
              MUL    R1, R1, R2       (6 cycles)
              ADD    R1, R1, R3       (4 cycles)
              RSHFA  R1, R1, 1        (1 cycle)
              ST     R1, R4, R0       (11 cycles)
              ADD    R0, R0, 1        (4 cycles)
              ADD    R1, R0, -100     (4 cycles)
              BNZ    LOOP             (1 cycle)
      
      5 × 1 + 100 × (11 + 11 + 11 + 6 + 4 + 1 + 11 + 4 + 4 + 1) = 6405 cycles
    2. Vector Processor:

      The loop could be split into two parts as 64 and 36. Assume the vector code looks as follows: (This solution assumes that the addressing mode VLD V0, B+64 exists - if it doesn't, then you would need 4 + 3 cycles using a pipelined adder to add 64 to A,B,C,D)

              LD     Vln, #64           (1 cycle)
              LD     Vst, #1            (1 cycle)
              VLD    V0, B              (11 + 63 cycles)
              VLD    V1, C              (11 + 63 cycles)
              Vmul   V2, V0, V1         (6 + 63 cycles)
              VLD    V3, D              (11 + 63 cycles)
              Vadd   V4, V2, V3         (4 + 63 cycles)
              Vrshfa V5, V4, 1          (1 + 63 cycles)
              VST    V5, A              (11 + 63 cycles)
              LD     Vln, #36           (1 cycle)
              VLD    V0, B+64           (11 + 35 cycles)
              VLD    V1, C+64           (11 + 35 cycles)
              Vmul   V2, V0, V1         (6 + 35 cycles)
              VLD    V3, D+64           (11 + 35 cycles)
              Vadd   V4, V2, V3         (4 + 35 cycles)
              Vrshfa V5, V4, 1          (1 + 35 cycles)
              VST    V5, A+64           (11 + 35 cycles)
      
      1. Vector processor without chaining (vector instructions done serially)

        First part:

        |-1-|-1-|--11--|---63---|--11--|---63---|-6-|---63---|--11--|---63---|-4-|---63---|-1-|---63---|--11--|---63---|
        

        Second part:

        |-1-|--11--|--35--|--11--|--35--|-6-|--35--|--11--|--35--|-4-|--35--|-1-|--35--|--11--|--35--|
        
        2 + (74 × 4 + 69 + 67 + 64) + 1 + (46 × 4 + 41 + 39 + 36) = 799 cycles
      2. Vector processor with chaining, 1 port to memory.

        Chaining means the machine begins the next operation as soon as the operands are ready.

        First part:

         |-1-|-1-|--11--|---63---|--11--|---63---|--11--|---63---|--11--|---63---|
                                        |-6-|---63---|  |-4-|---63---|
                                                            |-1-|---63---|
        

        Second part:

         |-1-|--11--|--35--|--11--|--35--|--11--|--35--|--11--|--35--|
                                  |-6-|--35--|  |-4-|--35--|
                                                    |-1-|--35--|
        

        Chaining, in this instance, hides the VMULT, VADD, and VSHL operations. Memory becomes the primary bottleneck.

        483 cycles
      3. Vector processor with chaining; 2 loads, 1 store per cycle.

        First part:

         |-1-|-1-|--11--|---63---|
                  |--11--|---63---|
                         |-6-|---63---|
                                 |--11--|---63---|
                                        |-4-|---63---|
                                            |-1-|---63---|
                                                |--11--|---63---|
        
        #cycles = 1 + 1 + 11 + 63 + 11 + 4 + 1 + 11 + 63 = 166

        Second part:

         |-1-|--11--|--35--|
              |--11--|--35--|
                     |-6-|--35--|
                           |--11--|--35--|
                                  |-4-|--35--|
                                      |-1-|--35--|
                                          |--11--|--35--|
        
        #cycles = 1 + 11 + 35 + 11 + 4 + 1 + 11 + 35 = 109
        Total cycles = 166 + 109 = 275 cycles

      Another solution is to split the loop into two equal parts as 50 and 50.

              LD     Vln, #50           (1 cycle)
              LD     Vst, #1            (1 cycle)
              VLD    V0, B              (11 + 49 cycles)
              VLD    V1, C              (11 + 49 cycles)
              Vmul   V2, V0, V1         (6 + 49 cycles)
              VLD    V3, D              (11 + 49 cycles)
              Vadd   V4, V2, V3         (4 + 49 cycles)
              Vrshfa V5, V4, 1          (1 + 49 cycles)
              VST    V5, A              (11 + 49 cycles)
              VLD    V0, B+50           (11 + 49 cycles)
              VLD    V1, C+50           (11 + 49 cycles)
              Vmul   V2, V0, V1         (6 + 49 cycles)
              VLD    V3, D+50           (11 + 49 cycles)
              Vadd   V4, V2, V3         (4 + 49 cycles)
              Vrshfa V5, V4, 1          (1 + 49 cycles)
              VST    V5, A+50           (11 + 49 cycles)
      
      1. Vector processor without chaining (vector instructions done serially)

        First part:

         |-1-|-1-|--11--|---63---|--11--|---63---|-6-|---63---|--11--|---63---|-4-|---63---|-1-|---63---|--11--|---63---|
        

        Second part:

         |--11--|--35--|--11--|--35--|-6-|--35--|--11--|--35--|-4-|--35--|-1-|--35--|--11--|--35--|
        
        2 + (74 × 4 + 69 + 67 + 64) + (46 × 4 + 41 + 39 + 36) = 798 cycles
      2. Vector processor with chaining, 1 port to memory. Again, this would save only 1 cycle over solution A, since the first load for the second part must wait till the store of the first part finishes. Total = 482 cycles

      3. Vector processor with chaining; 2 loads, 1 store per cycle

         |-1-|-1-|-11-|-----49-----|
                  |-11-|-----49-----|
                       |-6-|-----49-----|
                                   |-11-|-----49-----|   ** LD D would need to wait for LD B to finish
                                        |-4-|-----49-----|
                                            |-1-|-----49-----|
                                                |-11-|-----49-----|
                                                 |-11-|-----49-----|
                                                     |-11-|-----49-----|  ** LD C+50 would need to wait till the LD D from the first part finishes
                                                          |-6-|-----49-----|
                                                                   |-11-|-----49-----|  ** LD D+50 would need to wait for LD B+50 to finish 
                                                                        |-4-|-----49-----|
                                                                            |-1-|-----49-----|
        
                                                                                |-11-|-----49-----|
        
        1 + 1 + (11 + 49) + 11 + 4 + 1 + 1 + (11 + 49) + 11 + 4 + 1 + (11 + 49) = 215 cycles
    1. X is the Vector Index Register, which holds the pointer to the element that is being processed by the vector instruction.

    2. The two control signals are RESET.VINDEX and INCREMENT.VINDEX


    3. Logic diagram of the grey box
    4. The following solution assumes a change to the microsequencer to branch according to the Z condition code.

      Modified state diagram
  4. The state diagram for the FSM device controller is shown below. Note that in this figure, the transition from BGout to IDLE is based on the SACK signal. This is to illustrate the first race condition which is corrected by basing the transition on (NOT BGin).

    The device controller state diagram

    The input and output signals of the controller are:

    SignalTypeFunction
    DEVInput Asserted when the device needs to initiate a bus transaction
    BGinInput Incoming bus grant signal, asserted by the priority arbitration unit
    BBSYinInput Asserted by the current bus master. Negative edge indicates the end of a bus cycle.
    MSYNInput Master-side handshaking signal that controls the bus transaction between the bus master and the slave
    SSYNInput Slave-side handshaking signal that controls the bus transaction between the bus master and the slave
    BRoutOutput Asserted to request the bus. Goes to the priority arbitration unit.
    SACKOutput Asserted by the device that has won the arbitration
    BBSYoutOutput Same as BBSYin
    BGoutOutput Asserted when the controller needs to pass the bus grant signal down the daisy chain.

    Two race conditions are:

    1. Let's say device controller D1 is in BGout state. This means that some device D2 that is down the same daisy chain as D1 had requested and is granted the bus. Let's say the device of D1 asserts the DEV signal while D1 is in BGout state. D3 will eventually receive the BGin signal and transition to the SACK state. It will take some time for the SACK signal to travel to the priority arbitration unit. The SACK signal probably reaches D1 before it reaches the priority arbitration unit. Hence, when the SACK signal reaches D1 the BGin input of D1 is still being asserted. Therefore, upon receiving the SACK signal D1 will immediately transition to IDLE to BRout to SACK states. Hence, both D1 and D3 will be asserting the SACK signal which is not desirable. A simple solution that fixes this race condition is not transitioning to IDLE state if the BGin signal is still high.

    2. This race condition is a little bit more subtle. From the PAU side the PAU asserts BGj, which works its way down the daisy chain to all devices at BRj. Before SACK is asserted, a controller asserts BRk, where k is higher priority than j. PAU now asserts BGk, and you have two BG signals propagating, which will result in two controllers thinking they are the next bus master.

      Solution: PAU latches BR signals when it sees NOT-SACK, indicating it is okay to grant the bus again. NOT-SACK is also gated (after sufficient delay) with the BG signals, guaranteeing that a BG signal cannot be asserted until after PAU logic has taken effect. Any subsequent BR signal does not get latched and so can not affect the PAU logic.

    1. The solution is shown below:

      The bus logic diagram
    2. It is possible. Consider a case where device 3 and 4 just alternate the bus.

    3. The solution is shown below.

      The device controller state machine