## Department of Electrical and Computer Engineering

### The University of Texas at Austin

EE 360N, Spring 2003
Problem Set 3
Due: 3 March 2003, before class
Yale N. Patt, Instructor
Hyesoon Kim, Moinuddin Qureshi, Onur Mutlu, Santhosh Srinath, TAs

Instructions:
You are encouraged to work on the problem set in groups and turn in one problem set for the entire group. Remember to put all your names on the solution sheet. Also remember to put the name of the TA in whose discussion section you would like the problem set returned to you.

1. If the latency of a DRAM memory bank is 37 cycles, into how many banks would you interleave this memory in order to fully hide this latency when making sequential memory accesses?
2. An ISA supports 8-bit, byte-addressable virtual address space. The corresponding physical memory has only 128 bytes. Each page contains 16 bytes. A simple, one-level translation scheme is used and the page table resides in physical memory. The initial contents of the frames of physical memory are shown below.

 Frame 0 empty Frame 1 Page 13 Frame 2 Page 5 Frame 3 Page 10 Frame 4 empty Frame 5 Page 1 Frame 6 empty Frame 7 Page Table

A three-entry Translation Lookaside Buffer that uses LRU replacement is added to this system. Initially, this TLB contains the entries for pages 0, 2, and 13. For the following sequence of references, put a circle around those that generate a TLB hit and put a rectangle around those that generate a page fault. What is the hit rate of the TLB for this sequence of references? (Note: LRU policy is used to select pages for replacement in physical memory.)

References (to pages): 14, 14, 13, 14, 6, 6, 13, 14, 15, 14, 15, 13, 4, 3.

At the end of this sequence, what three entries are contained in the TLB?

3. We have been referring to the LC-3b memory as 2^16 bytes of memory, byte-addressable. This is the memory that the user sees, and may bear no relationship to the actual physical memory. Suppose that the actual physical address space is 8K bytes, and our page size is 512 bytes. What is  the size of the PFN? Suppose we have a virtual memory system similar to the VAX in which virtual memory is divided into User Space (P0) and System Space and System Page Table remains resident in physical memory. System space includes trap vector table, interrupt vector table, operating system and supervisor stack as shown in Figure A.1 in Appendix A. If each PTE contained, in addition to the PFN, a Valid bit, a modified bit, and two bits of access control, how many bits of physical memory would be required to store the System Page Table?
4. Let's say we added a virtual memory system to the LC-3b. Which instructions can possibly generate a page fault? What is the maximum number of page faults an instruction can possibly generate while it is being processed? Which instructions can possibly generate that maximum number of page faults?

Assume that the virtual memory system added uses a one-level translation scheme and the page table is always resident in physical memory.

An instruction is said to generate a page fault if a page fault occurs at any time during the processing of that instruction.

5. (Hamacher, pg.255, question 5.20) 1024x1024 array of 32-bit numbers is to be "normalized" as follows. For each column the largest element is found and all elements of the column are divided by this maximum value.Assume that each page in the virtual memory consists of 4Kbytes and that 1Mbytes of the main memory are allocated for storing data during this computation. Suppose that it takes 40 ms to load a page from the disk to the main memory when a page fault occurs (assume that when we start, the main memory is empty ).
a. How many page faults would occur if the elements of the array are stored in column order in the virtual memory?

b. How many page faults would occur if the elements are stored in row order?

c. Estimate the total time needed to perform this normalization for both arrangements a & b. Assume that it takes 2 ns to do a comparison, 20 ns to do a divide and 100 ns to do a load/store to memory.

6. (Hamacher, pg.255, question 5.13)  A byte-addressable computer has a small data cache capable of holding eight 32-bit words. Each cache block consists of one 32-bit word. When a given program is executed, the processor reads data from the following sequence of hex addresses:
`     200, 204, 208, 20C, 2F4, 2F0, 200, 204, 218, 21C, 24C, 2F4`
This pattern is repeated four times.
a. Show the contents of the cache at the end of each pass throughout this loop if a direct-mapped cache is used. Compute the hit rate for this example. Assume that the cache is initially empty.

b. Repeat part (a) for a fully-associative cache that uses the LRU-replacement algorithm.

c. Repeat part (a) for a four-way set-associative cache that uses the LRU replacement algorithm.

7. a.Give a sequence of addresses that results in a different block to be replaced from the cache if the replacement policy is the priority-based pseudo-LRU policy Professor Patt discussed in class instead of a true LRU replacement policy. Remember that this scheme gives priority to one half of the cache on a cache hit so that the block that happens to be in the same half as the accessed block is marked as the second most recently used block in the same set. Assume that the cache is initially empty. The cache is 4-way set associative, its size is 256 bytes and each block is 4 bytes. Assume that the address space is 10 bits and memory is byte-addressable.

b.Here is an LC-3b assembly language program that performs array accesses. Show the address trace produced by load instructions. Suppose that we have a 1024-byte, 4-way set-associative cache with 4-byte blocks. Calculate the data cache hit ratio when we have a perfect LRU replacement scheme.

```  	 .ORIG x5000
AND R0, R0, #0 ; R0 : array sum
AND R3, R3, #0
AND R1, R1, #0 ; R1 : array index
ADD R3, R3, #2 ; R3 : outloop index
LEA R6, DATA
OUTLOOP LDW R4, R6, #0
AND R2, R2, #0
ADD R2, R2, #5  ; R2: inloop index
LDB R7, R1, #0
BRP INLOOP
BRZP OUTLOOP
HALT
DATA	 .FILL x400
.END
```
8. A computer has an 8KB write-through cache. Each cache block is 64 bits, the cache is 4-way set associative and uses a victim/next-victim pair of bits in each block for its replacement policy. Assume a 24-bit address space and byte-addresable memory. How big (in bits) is the tag store?
9. (Hamacher et al., p. 255, question 5.18) You are working with a computer that has a primary and secondary cache. The cache block consists of 8 words. Make the following assumptions:
• The hit rate is the same for both the caches: 0.95 for instructions and 0.90 for data.
• The times needed to access an 8-word block in these caches are C1 = 1 cycle and C2 = 10 cycles.
• It takes one clock cycle to send an address to main memory.
• It takes one clock cycle to send one word from the memory to the cache.
• 30 percent of all instructions are loads and stores.

1. What is the average access time experienced by the CPU if the main memory uses 4 way interleaving? Assume that the memory is built with DRAM chips that allow the first word to be accesses in 8 cycles, but subsequent words of the block are in 4 clock cycles per word. Also, one clock cycle is required to send one word to the cache.

2. What is the average access time if the main memory is not interleaved (i.e. there is a single DRAM chip)?

3. What is the improvement obtained with interleaving?

10. Below, we have given you four different sequences of addresses generated by a program running on a processor with a data cache. Cache hit ratio for each sequence is also shown below. Assuming that the cache is initially empty at the beginning of each sequence, find out the following parameters of the processor's data cache:
* Associativity (1, 2, or 4 ways)
* Block size (1, 2, 4, 8, 16, or 32 bytes)
* Total cache size (256B, or 512B)
* Replacement policy (LRU or FIFO)
Assumptions:All memory accesses are one byte accessses. All addresses are byte addresses.
sequence 1:  0, 2, 4, 8, 16, 32                                                    hit ratio: 0.33
sequence 2:  0, 512, 1024, 1536, 2048, 1536, 1024, 512, 0      hit ratio: 0.33
sequence 3:  0, 64, 128, 256, 512, 256, 128, 64, 0                    hit ratio: 0.33
sequence 4:  0, 512, 1024, 0, 1536, 0, 2048, 512                      hit ratio: 0.25

AND, for those who cannot get enough of this stuff, the following problem:
(Note: You can get full credit without doing this problem)

You will be given a cache simulator (just the executable) with a hard-coded configuration. Your job is to determine the configuration of the cache. The simulator takes a trace of memory addresses as input and provides a hit ratio as output. Find the following:

* Associativity (1, 2, 4, or 8 ways)
* Block size (1, 2, 4, 8, 16, or 32 bytes)
* Total cache size (256B, 512B, or 1024B)
* Replacement policy (LRU or Pseudo-LRU)
Show the traces you used to determine each parameter of the cache. Assumptions:All memory accesses are one byte accessses. All addresses are byte addresses.
Simulator
The syntax for running the program is:
%./cachesim   <trace.txt>
The traces are just text files with one integer memory address per line. For example, the following trace would cause conflict misses in a direct-mapped, 256B cache:
0
256
512
0
256
Simulator for Linux
Simulator for Solaris