Department of Electrical and Computer Engineering

The University of Texas at Austin

EE 460N, Fall 2014
Problem Set 3 Solutions
Yale N. Patt, Instructor
Stephen Pruett, Emily Bragg, Siavash Zangeneh, TAs

Problem Set 3 Solutions

  1. Assume the destination register is indicated first, followed by the source(s).

    (As stated in the question, cycle counts assume a 16-way interleaved memory so that a new access can be started each cycle. Also, the adder and multiplier are pipelined.)

    1. Scalar processor (one of the possible solutions)

                MOVI   R0, 0            (1 cycle)
                LEA    R4, A            (1 cycle)
                LEA    R5, B            (1 cycle)
                LEA    R6, C            (1 cycle)
                LEA    R7, D            (1 cycle)
          LOOP  LD     R1, R5, R0       (11 cycles)
                LD     R2, R6, R0       (11 cycles)
                LD     R3, R7, R0       (11 cycles)
                MUL    R1, R1, R2       (6 cycles)
                ADD    R1, R1, R3       (4 cycles)
                RSHFA  R1, R1, 1        (1 cycle)
                ST     R1, R4, R0       (11 cycles)
                ADD    R0, R0, 1        (4 cycles)
                ADD    R1, R0, -100     (4 cycles)
                BNZ    LOOP             (1 cycle)
      
      5 × 1 + 100 × (11 + 11 + 11 + 6 + 4 + 1 + 11 + 4 + 4 + 1) = 6405 cycles
    2. Vector Processor:

      The loop could be split into two parts as 64 and 36. Assume the vector code looks as follows: (This solution assumes that the addressing mode VLD V0, B+64 exists - if it doesn't, then you would need 4 + 3 cycles using a pipelined adder to add 64 to A,B,C,D)

              LD     Vln, #64           (1 cycle)
              LD     Vst, #1            (1 cycle)
              VLD    V0, B              (11 + 63 cycles)
              VLD    V1, C              (11 + 63 cycles)
              Vmul   V2, V0, V1         (6 + 63 cycles)
              VLD    V3, D              (11 + 63 cycles)
              Vadd   V4, V2, V3         (4 + 63 cycles)
              Vrshfa V5, V4, 1          (1 + 63 cycles)
              VST    V5, A              (11 + 63 cycles)
              LD     Vln, #36           (1 cycle)
              VLD    V0, B+64           (11 + 35 cycles)
              VLD    V1, C+64           (11 + 35 cycles)
              Vmul   V2, V0, V1         (6 + 35 cycles)
              VLD    V3, D+64           (11 + 35 cycles)
              Vadd   V4, V2, V3         (4 + 35 cycles)
              Vrshfa V5, V4, 1          (1 + 35 cycles)
              VST    V5, A+64           (11 + 35 cycles)
      
      1. Vector processor without chaining (vector instructions done serially)

        First part:

        |-1-|-1-|--11--|---63---|--11--|---63---|-6-|---63---|--11--|---63---|-4-|---63---|-1-|---63---|--11--|---63---|
        

        Second part:

        |-1-|--11--|--35--|--11--|--35--|-6-|--35--|--11--|--35--|-4-|--35--|-1-|--35--|--11--|--35--|
        
        2 + (74 × 4 + 69 + 67 + 64) + 1 + (46 × 4 + 41 + 39 + 36) = 799 cycles
      2. Vector processor with chaining, 1 port to memory.

        Chaining means the machine begins the next operation as soon as the operands are ready.

         |-1-|-1-|--11--|---63---|--11--|---63---|--11--|---63---|--11--|---63---|--11--|--35--|--11--|--35--|--11--|--35--|--11--|--35--|
                                        |-6-|---63---|  |-4-|---63---|                                |-6-|--35--|  |-4-|--35--|
                                                            |-1-|---63---|                                              |-1-|--35--|
                                                                   |-1-|
        

        Chaining, in this instance, hides the VMULT, VADD, and VSHL operations. Memory becomes the primary bottleneck.

        482 cycles
      3. Vector processor with chaining; 2 loads, 1 store per cycle.

         |-1-|-1-|--11--|---63---|       VLD V0, B
                  |--11--|---63---|       VLD V1, C
                         |-6-|---63---|
                                 |--11--|---------------63-----------------|     VLD V3, D  Needs to wait until VLD B is finished
                                        |-4-|---63---|
                                            |-1-|---63---|
                                                |--11--|----------------63-----------------|      VST V5,A  
                                                  |-1-|--11--|---35---|         VLD V0, B+64 (this finishes sooner than VLD D)
                                                                      |--11--|----35----|           VLD V1, C+64 Needs to wait for VLF B+64 
                                                                             |-6-|----35----|
                                                                                 |--11--|--35--|
                                                                                        |-4-|--35--|
                                                                                            |-1-|--35--|
                                                                                                |--11--|--35--|
        
        #cycles = 1 + 1 + (11 + 63) + 11 + 4 + 1 + 1 + 1 + (11 + 35) + 11 + 6 + 11 + 4 + 1 + (11 + 35) = 219 cycles

      Another solution is to split the loop into two equal parts as 50 and 50.

              LD     Vln, #50           (1 cycle)
              LD     Vst, #1            (1 cycle)
              VLD    V0, B              (11 + 49 cycles)
              VLD    V1, C              (11 + 49 cycles)
              Vmul   V2, V0, V1         (6 + 49 cycles)
              VLD    V3, D              (11 + 49 cycles)
              Vadd   V4, V2, V3         (4 + 49 cycles)
              Vrshfa V5, V4, 1          (1 + 49 cycles)
              VST    V5, A              (11 + 49 cycles)
              VLD    V0, B+50           (11 + 49 cycles)
              VLD    V1, C+50           (11 + 49 cycles)
              Vmul   V2, V0, V1         (6 + 49 cycles)
              VLD    V3, D+50           (11 + 49 cycles)
              Vadd   V4, V2, V3         (4 + 49 cycles)
              Vrshfa V5, V4, 1          (1 + 49 cycles)
              VST    V5, A+50           (11 + 49 cycles)
      
      1. Vector processor without chaining (vector instructions done serially)

        First part:

         |-1-|-1-|--11--|---63---|--11--|---63---|-6-|---63---|--11--|---63---|-4-|---63---|-1-|---63---|--11--|---63---|
        

        Second part:

         |--11--|--35--|--11--|--35--|-6-|--35--|--11--|--35--|-4-|--35--|-1-|--35--|--11--|--35--|
        
        2 + (74 × 4 + 69 + 67 + 64) + (46 × 4 + 41 + 39 + 36) = 798 cycles
      2. Vector processor with chaining, 1 port to memory. This part for solution B will be the same as solution B. Total = 482 cycles

      3. Vector processor with chaining; 2 loads, 1 store per cycle

         |-1-|-1-|-11-|-----49-----|
                  |-11-|-----49-----|
                       |-6-|-----49-----|
                                   |-11-|-----49-----|   ** LD D would need to wait for LD B to finish
                                        |-4-|-----49-----|
                                            |-1-|-----49-----|
                                                |-11-|-----49-----|
                                                 |-11-|-----49-----|
                                                     |-11-|-----49-----|  ** LD C+50 would need to wait till the LD D from the first part finishes
                                                          |-6-|-----49-----|
                                                                   |-11-|-----49-----|  ** LD D+50 would need to wait for LD B+50 to finish 
                                                                        |-4-|-----49-----|
                                                                            |-1-|-----49-----|
        
                                                                                |-11-|-----49-----|
        
        1 + 1 + (11 + 49) + 11 + 4 + 1 + 1 + (11 + 49) + 11 + 4 + 1 + (11 + 49) = 215 cycles
    1. X is the Vector Index Register, which holds the pointer to the element that is being processed by the vector instruction.

    2. The two control signals are RESET.VINDEX and INCREMENT.VINDEX


    3. Logic diagram of the grey box
    4. The following solution assumes a change to the microsequencer to branch according to the Z condition code.

      Modified state diagram
  2. Postponed until HW 4
  3. In this problem, we assume that both of the rotators are right rotators. If the rotator for read is a right rotator and the rotator for write is a left rotator, PA[1:0] can be used as the control for both rotators.

    PA[1:0]SIZERD/WR1st/2nd LD.MDR[3:0]ROT[1:0]WE[3:0]
    00BRDX XXX1000000
    00BWRX XXX0000001
    00HRDX XX11000000
    00HWRX XX00000011
    00WRDX 1111000000
    00WWRX 0000001111
    01BRDX XXX1010000
    01BWRX XXX0110010
    01HRDX XX11010000
    01HWRX XX00110110
    01WRD1st X111010000
    01WRD2nd 1000010000
    01WWR1st 0000111110
    01WWR2nd 0000110001
    10BRDX XXX1100000
    10BWRX XXX0100100
    10HRDX XX11100000
    10HWRX XX00101100
    10WRD1st XX11100000
    10WRD2nd 1100100000
    10WWR1st 0000101100
    10WWR2nd 0000100011
    11BRDX XXX1110000
    11BWRX XXX0011000
    11HRD1st XXX1110000
    11HRD2nd XX10110000
    11HWR1st XX00011000
    11HWR2nd XX00010001
    11WRD1st XXX1110000
    11WRD2nd 1110110000
    11WWR1st 0000011000
    11WWR2nd 0000010111
    Legend
    B(yte)00
    H(alf word)01
    W(ord)10
    RD(read)0
    WR(write)1
    1st0
    2nd1
  4. Interleave into 64 banks in order to hide the latency in sequential accesses (note, minimum needed is 37 banks but one would really prefer to use a power of 2, therefore 64).