## Department of Electrical and Computer Engineering

### The University of Texas at Austin

EE 360N, Fall 2003
Problem Set 4
Due: 15 October 2003, before class
Yale N. Patt, Instructor
Santhosh Srinath, Danny Lynch, TAs

Instructions:
You are encouraged to work on the problem set in groups and turn in one problem set for the entire group. Remember to put all your names on the solution sheet. Also remember to put the name of the TA in whose discussion section you would like the problem set returned to you.

1. (Hamacher, pg.255, question 5.13)  A byte-addressable computer has a small data cache capable of holding eight 32-bit words. Each cache block consists of one 32-bit word. When a given program is executed, the processor reads data from the following sequence of hex addresses:
`     200, 204, 208, 20C, 2F4, 2F0, 200, 204, 218, 21C, 24C, 2F4`
This pattern is repeated four times.
a. Show the contents of the cache at the end of each pass throughout this loop if a direct-mapped cache is used. Compute the hit rate for this example. Assume that the cache is initially empty.

b. Repeat part (a) for a fully-associative cache that uses the LRU-replacement algorithm.

c. Repeat part (a) for a four-way set-associative cache that uses the LRU replacement algorithm.

2. Clarifications (Updated 10/14/03)
As we did not talk about the priority based pseudo-LRU policy in detail in class, you are free to use the Victim/Next-Victim instead if you wish. Please state the policy you used in your solution.

a. Give a sequence of addresses that results in a different block to be replaced from the cache if the replacement policy is the priority-based pseudo-LRU policy Professor Patt discussed in class instead of a true LRU replacement policy. Remember that this scheme gives priority to one half of the cache on a cache hit so that the block that happens to be in the same half as the accessed block is marked as the second most recently used block in the same set. Assume that the cache is initially empty. The cache is 4-way set associative, its size is 256 bytes and each block is 4 bytes. Assume that the address space is 10 bits and memory is byte-addressable.

b. Here is an LC-3b assembly language program that performs array accesses. Show the address trace produced by load instructions. Suppose that we have a 1024-byte, 4-way set-associative cache with 4-byte blocks. Calculate the data cache hit ratio when we have a perfect LRU replacement scheme.

```  	 .ORIG x5000
AND R0, R0, #0 ; R0 : array sum
AND R3, R3, #0
AND R1, R1, #0 ; R1 : array index
ADD R3, R3, #2 ; R3 : outloop index
LEA R6, DATA
OUTLOOP LDW R4, R6, #0
AND R2, R2, #0
ADD R2, R2, #5  ; R2: inloop index
LDB R7, R1, #0
BRP INLOOP
BRZP OUTLOOP
HALT
DATA	 .FILL x400
.END
```
3. A computer has an 8KB write-through cache. Each cache block is 64 bits, the cache is 4-way set associative and uses a victim/next-victim pair of bits in each block for its replacement policy. Assume a 24-bit address space and byte-addresable memory. How big (in bits) is the tag store?
4. Clarifications (Updated 10/14/03)
For this problem, please serialize the accesses to the caches. So if you miss in the L1, and hit in the L2, it takes 11 cycles. Also, start the memory access once you know that you miss in the two caches. Assume that it takes the time needed to access the cache to know if an access was a hit or miss. If you had done the problem assuming that you can access the L2 in parallel with the L1 and so on, please state this assumption. You don't need to redo the problem.
When you receive data from memory, you can access the data as the writes to the caches are done in parallel.

(Hamacher et al., p. 255, question 5.18) You are working with a computer that has a primary and secondary cache. The cache block consists of 8 words. Make the following assumptions:
• The hit rate is the same for both the caches: 0.95 for instructions and 0.90 for data.
• The times needed to access an 8-word block in these caches are C1 = 1 cycle and C2 = 10 cycles.
• It takes one clock cycle to send an address to main memory.
• It takes one clock cycle to send one word from the memory to the cache.
• 30 percent of all instructions are loads and stores.

1. What is the average access time experienced by the CPU if the main memory uses 4 way interleaving? Assume that the memory is built with DRAM chips that allow the first word to be accesses in 8 cycles, but subsequent words of the block are in 4 clock cycles per word. Also, one clock cycle is required to send one word to the cache.

2. What is the average access time if the main memory is not interleaved?

3. What is the improvement obtained with interleaving?

5. Below, we have given you four different sequences of addresses generated by a program running on a processor with a data cache. Cache hit ratio for each sequence is also shown below. Assuming that the cache is initially empty at the beginning of each sequence, find out the following parameters of the processor's data cache:
* Associativity (1, 2, or 4 ways)
* Block size (1, 2, 4, 8, 16, or 32 bytes)
* Total cache size (256B, or 512B)
* Replacement policy (LRU or FIFO)
Assumptions:All memory accesses are one byte accessses. All addresses are byte addresses.
sequence 1:  0, 2, 4, 8, 16, 32                                                    hit ratio: 0.33
sequence 2:  0, 512, 1024, 1536, 2048, 1536, 1024, 512, 0      hit ratio: 0.33
sequence 3:  0, 64, 128, 256, 512, 256, 128, 64, 0                    hit ratio: 0.33
sequence 4:  0, 512, 1024, 0, 1536, 0, 2048, 512                      hit ratio: 0.25
6. Explain the differences between exceptions and interrupts. Be concise in your explanations.

Explain the similarities of exceptions and interrupts. Clearly describe the steps required to handle an exception or an interrupt.

7. Postponed to the next problem set
In class, we discussed two types of busses: "pending bus" and "split transaction bus". What is the advantage of a split-transaction bus over a pending bus?

8. Postponed to the next problem set
In class, we discussed the asynchronous finite state machine for the device controller of an input-output device within the context of a priority arbitration system. Draw the state diagram for this device controller (as drawn in lecture), identify the input and output signals, and briefly explain the function of each input and output signal.

9. Postponed to the next problem set
A group of students have decided to build a computer system using the LC-3b. The system will have one LC-3b processor connected to physical memory and several disk units via a shared bus. The disk units have the ability to transfer data directly to and from memory via the Direct Memory Access controller.

Every time a disk unit finishes a transfer, the LC-3b is interrupted, and the disk unit is given another transfer operation. The unit of transfer between the disk and the memory is a 212 B page and the disk units are capable of maintaining a transfer rate of 218 B/s. The bus itself is the fastest technology and is able to keep up with the transfer rate of the disk units (i.e., the bus does not slow down the transfer between disk and memory).

After a few experiments, the students found that the average disk transfer consisted of 2 pages of data. The disk interrupt handler on the LC-3b was known to take 5 ms of processing time per interrupt. The goal of their experiment was to figure out how many disk units could be connected to the system and fully utilized. Help them out.

10. AND, for those who cannot get enough of this stuff, the following problem:
(Note: You can get full credit without doing this problem)

You will be given a cache simulator (just the executable) with a hard-coded configuration. Your job is to determine the configuration of the cache. The simulator takes a trace of memory addresses as input and provides a hit ratio as output. Find the following:

* Associativity (1, 2, 4, or 8 ways)
* Block size (1, 2, 4, 8, 16, or 32 bytes)
* Total cache size (256B, 512B, or 1024B)
* Replacement policy (LRU or Pseudo-LRU)
Show the traces you used to determine each parameter of the cache. Assumptions:All memory accesses are one byte accessses. All addresses are byte addresses.
Simulator
The syntax for running the program is:
%./cachesim   <trace.txt>
The traces are just text files with one integer memory address per line. For example, the following trace would cause conflict misses in a direct-mapped, 256B cache:
0
256
512
0
256
Simulator for Linux
Simulator for Solaris