You are encouraged to work on the problem set in groups and turn in one problem set for the entire group. Remember to put all your names on the solution sheet. Also remember to put the name of the TA in whose discussion section you would like the problem set returned to you.
Consider the following piece of code:
for(i = 0; i < 100; i++)
A[i] = ((B[i] * C[i]) + D[i]) / 2;
Translate this code into assembly language using the following instructions in the ISA (note the number of cycles each instruction takes is shown with each instruction):
Opcode | Operands | Number of Cycles | Description |
---|---|---|---|
LEA | Ri, X | 1 | Ri ← address of X |
LD | Ri, Rj, Rk | 11 | Ri ← MEM[Rj + Rk] |
ST | Ri, Rj, Rk | 11 | MEM[Rj + Rk] ← Ri |
MOVI | Ri, Imm | 1 | Ri ← Imm |
MUL | Ri, Rj, Rk | 6 | Ri ← Rj × Rk |
ADD | Ri, Rj, Rk | 4 | Ri ← Rj + Rk |
ADD | Ri, Rj, Imm | 4 | Ri ← Rj + Imm |
RSHFA | Ri, Rj, amount | 1 | Ri ← RSHFA (Rj, amount) |
BRcc | X | 1 | Branch to X based on condition codes |
Assume it takes one memory location to store each element of the array. Also assume that there are 8 registers (R0-R7).
How many cycles does it take to execute the program?
Now write Cray-like vector/assembly code to perform this operation in the shortest time possible. Assume that there are 8 vector registers and the length of each vector register is 64. Use the following instructions in the vector ISA:
Opcode | Operands | Number of Cycles | Description |
---|---|---|---|
LD | Vst, #n | 1 | Vst ← n |
LD | Vln, #n | 1 | Vln ← n |
VLD | Vi, X + offset | 11, pipelined | |
VST | Vi, X + offset | 11, pipelined | |
Vmul | Vi, Vj, Vk | 6, pipelined | |
Vadd | Vi, Vj, Vk | 4, pipelined | |
Vrshfa | Vi, Vj, amount | 1 |
How many cycles does it take to execute the program on the following processors? Assume that memory is 16-way interleaved.
Little Computer Inc. is now planning to build a new computer that is more suited for scientific applications. LC-3b can be modified for such applications by replacing the data type Byte with Vector. The new computer will be called LmmVC-3 (Little 'mickey mouse' Vector Computer 3). Your job is to help us implement the datapath for LmmVC-3. LmmVC-3 ISA will support all the scalar operations that LC-3b currently supports except the LDB and STB will be replaced with VLD and VST respectively. Our datapath will need to support the following new instructions:
Note: VDR means “Vector Destination Register” and VSR means “Vector Source Register.”
VLD, VST, and VADD do not modify the content of Vstride and Vlength registers.
The following five hardware structures have been added to LC-3b in order to implement LmmVC-3.
These structures are shown in the LmmVC-3 datapath diagram:
A 6-bit input to the Vector Register file has been labeled X on the datapath diagram. What is the purpose of this input? (Answer in less than 10 words )
The logic structure X contains a 6-bit register and some additional logic. X has two control signals as its inputs. What are these signals used for?
Grey box A contains several additional muxes on both input lines to the ALU. Complete the logic diagram of grey box A (shown below) by showing all muxes and interconnects. You will need to add new signals to the control store; be sure to clearly label them in the logic diagram.
We show the beginning of the state diagram necessary to implement VLD. Using the notation of the LC-3b State Diagram, add the states you need to implement VLD. Inside each state describe what happens in that state. You can assume that you are allowed to make any changes to the microsequencer that you find necessary. You do not have to make/show these changes. You can modify BaseR and the condition codes. Make sure your design works when Vlength equals 0. Full credit will be awarded to solutions that require no more than 7 states.
Consider the following piece of code:
for(i = 0; i < 8; ++i){
for(j = 0; j < 8; ++j){
sum = sum + A[i][j];
}
}
The figure below shows an 8-way interleaved, byte-addressable memory. The total size of the memory is 4KB. The elements of the 2-dimensional array, A, are 4-bytes in length and are stored in the memory in column-major order (i.e., columns of A are stored in consecutive memory locations) as shown. The width of the bus is 32 bits, and each memory access takes 10 cycles.
A more detailed picture of the memory chips in Rank 0 of Bank 0 is shown below.
Since the address space of the memory is 4KB, 12 bits are needed to uniquely
identify each memory location, i.e., Addr[11:0]
. Specify which bits of the address
will be used for:
Addr[_____:_____]
Addr[_____:_____]
Addr[_____:_____]
Addr[_____:_____]
How many cycles are spent accessing memory during the execution of the above code? Compare this with the number of memory access cycles it would take if the memory were not interleaved (i.e., a single 4-byte wide array).
Can any change be made to the current interleaving scheme to optimize the number of cycles spent accessing memory? If yes, which bits of the address will be used to specify the byte on bus, interleaving, etc. (use the same format as in part a)? With the new interleaving scheme, how many cycles are spent accessing memory? Remember that the elements of A will still be stored in column-major order.
Using the original interleaving scheme, what small changes can be made to the piece of code to optimize the number of cycles spent accessing memory? How many cycles are spent accessing memory using the modified code?
The figure below illustrates the logic and
memory to support 512 MB (byte addressable) of physical memory, supporting
unaligned accesses. The ISA contains LDByte
, LDHalfWord
, LDWord
,
STByte
, STHalfWord
and STWord
instructions, where a word is 32 bits. Bit 28
serves as a chip enable (active high). If this bit is high the data of the
memory is loaded on the bus, otherwise the output of the memory chip floats (tri-stated).
Note: the byte rotators in the figure are right rotators.
Construct the truth table to implement the LOGIC block, having inputs SIZE
, R/W
,
1st or 2nd access, PHYS_ADDR[1:0]
and the outputs shown in the above figure. Assume
that the value of SIZE
can be Byte
(00
), HalfWord
(01
), and Word
(10
). Clearly
explain what function each output serves.
If the latency of a DRAM memory bank is 37 cycles, into how many banks would you interleave this memory in order to fully hide this latency when making sequential memory accesses?