Exam 2 Annotated Buzzwords

Spring 2007

Note: buzzwords marked with an asterisk (*) are annotated in the Fall 2005 Exam 2 Buzzwords (except for cache buzzwords which are in Fall 2005 Exam 1 Buzzwords).

Interrupts and Exceptions

interrupt *
polling *
exception *
interrupt/exception handler *
precise exception *
consistent state *
priority/privilege: Priority is about urgency, privilege is about the right to do something. They are orthogonal issues. We sometimes confuse them in our human experiences because sometimes they go hand-in-hand. Example, the fire truck runs a red light to get to the fire. Running the red light is about privilege. We give it that privilege because of the urgency of getting to the fire. There are examples where they do not go hand-in-hand. Recall my example in class of the mail clerk who had to go to the bath room while waiting for the elevator on the 55th floor. In computer technology, they really are orthogonal. Real-time experiments may run at higher priority, because they need to sample the sensors at a precise moment in time. But they can not execute RTI or turn off the computer, or access certain files. System administrators can do that, even though they may run at a low priority level because it is not urgent when they do their job, just that they do it.
interrupt vector *
exception vector *
post increment (auto increment): This is an addressing mode that is common on many older machines and a few more recent machines. The address of the memory location containing the operand is in a register. After the address is read, the register is incremented, so the next time that instruction is executed, the register will point to the next data element. Side note: since data usually occupies some number of bytes, the register is incremented by the number of bytes of the operand.

Memory Hierarchy (Caches)

cache *
allocate on write miss *
write back(copy back)/ write through *
spatial locality *
temporal locality *
cache line (block) *
direct-mapped cache *
fully-associative cache *
set-associative cache *
tag/data store *
cache hit/miss *
hit ratio *
dirty bit *
sector cache *
replacement policy (LRU, FIFO, random, victim/next-victim) *
virtually-indexed, virtually-tagged cache *
physically-indexed, physically-tagged cache *
virtually-indexed, physically-tagged cache *
associative access *
multiple levels of caches *
inclusion property *
register spill: Registers are useful of storing values that will need to be sourced, rather than write them to memory and have to read them from memory again. More registers mean more values can be kept in registers. Unfortunately there are not enough registers, so at some point we need to write the values of registers to memory in order to make them available to hold new temporary values. We call that process of writing the registers to memory to temporarily get them out of the way and make room: register spill, and the compiler output that does this “register spill code.”
write back buffer (store buffer): Loads must be serviced as soon as possible, since subsequent instructions are usually dependent on the data being loaded. Stores, on the other hand, can take place when convenient. [Note: in a multiprocessor, that may not be quite so simplistically stated, but that is an issue we will deal with later this semester.] A store buffer allows the processor to put values to be written to the side and not tie up bus bandwidth unnecessarily. For example, on a load miss to a write back cache, put aside the dirty cache line to write to memory until after the new cache line is read in. For example, on a write through cache hit, we may want to buffer multiple writes to the same cache line, so as to only have to do one memory write and tie up bus bandwidth once. The write buffer is an extension of the memory system and needs to be considered as such. That is, an access to the cache also accesses the write buffer.
uniprocessor cache consistency: We probably also ought to note here that the rest of the world would call this “coherency” not “consistency.” Normally, we think of cache coherency only when we have more than one processor that can write to its own private cache. We will deal with this more after the exam. It is worth noting however that in a uniprocessor environment with intelligent I/O devices, the I/O device (e.g., DMA controller) can independently write to memory, in which case data in the cache may no longer be valid. Ergo, in an environment with intelligent I/O devices, the cache must monitor independent accesses to memory from the I/O subsystem in order to be sure that data in the cache has not been invalidated by what the I/O device wrote to memory.
multiprocessor cache consistency *
direct memory access (DMA) *

Input/Output

media *
device *
transducer *
volatile/non-volatile: Volatile storage loses its data when power is turned off. Memory is volatile. Non-volatile storage retains its data when power is turned off. CDROMS, disks, tapes are all non-volatile.
synchronous bus *
asynchronous bus *
carrier sense multiple access/collision detect (CSMA/CD): The way the Ethernet works. The ethernet is an asynchronous bus. Device controllers are attached to the bus. Any controller can access any bus at any time. How do we keep from transmitting garbage. Answer: CSMA/CD. Each controller senses the carrier to see if anyone else is accessing the bus. If yes, the controller does not access the bus. All (multiple) the controllers are sensing the carrier all the time. If no one else is using the bus, the controller is free to put signals on the bus. However, it may be, due to the nature of the asynchronous bus and propagation time of those signals that controller A may think there is no one else using the bus because the signals being propagated by controller B may not have propagated all the way to A yet. Thus, controller A must continue to monitor the bus to be sure it is always the case that the signals controller A is putting on the bus are the same as the signals controller A is monitoring. If they differ, controller A has detectd a collision (CD). In this case, controller A stops transmitting, waits a random amount of time, and then tries again. Meanwhile, controller B will have detected the collision due to A, so it stops transmitting, waits a random amount of time, and then tries again. The hope is that randomA is not equal to randomB. If it does, we repeat.
protocol: Taken straight from our embassies. The sequence of actions that must take place in order to transact business. In the case of an asynchronous bus, business is the transmission of a value from A to B. Also called handshaking.
race condition: A condition where the result depends on the inadvertent propagation delay of signals from one device to another.
arbitration *
central arbitration *
distributed arbitration *
priority arbitration unit (PAU) *
starvation: A condition in which a controller may never get the bus it is requesting. Recall the example of the daisy chain grant of the BG signal.
bus cycle *
bus master *
slave *
data lines *
address lines *
control lines *
multiplex bus *
pending bus *
split-transaction bus *
vertical priority *
horizontal priority *
device controller *
bus request *
bus grant *
bus busy: A control signal indicating that a bus cycle is in progress.
daisy chaining *
SACK *
MSYN *
SSYN *

Pipelining and Branch Prediction

pipelining *
pipeline stages *
pipeline bubble *
branch misprediction penalty *
data dependency *
flow dependency *
anti-dependency *
output dependency *
control dependency *
data forwarding *
sequential programming model *
in-order execution *
scoreboard *
out-of-order execution *
dual fetch
delayed branch: A branch instruction whose effect does not take effect immediately. That is the semantics of the branch instruction is: execute n more instructions and then branch.
delay slot: The number of instructions between the branch and when it executes. Also refers to the instructions that are contained in those n locations after the branch that will execute before the branch redirects the instruction stream.
fall through problem
performance metric: The basic equation for performance is:
1 ÷ (N × CPI × t)
N is the number of instructions in the program being measured. CPI is the average number of cycles it takes to complete (retire) each instruction. t is the cycle time of the clock. Note that if make the clock smaller, I probably increase CPI. CPI vs. t is a fundamental tradeoff. Note also that t is 1/freq, and CPI is 1/IPC. Most machines quote their IPC values (bigger is better), although the microarchitect is more interested in CPI, since the microarchitect is interested in eliminating some of the cycles that contribute the CPI of an instruction. Recall, when we talked about virtual memory, I mentioned that the VAX 11-780 had a CPI of 10.5, that each VAX instruction on average made one memory reference, and that the page table translation mechanism took 22 cycles. The reason the CPI is only 10.5 is because there is a TLB on that machine which takes no extra time when it hits. Since it hits 95% of the time, only 5% of the time does the 22 cycle translation take place. 5% of 22 cycles is approximately one cycle. That is, approximately one cycle of the 10.5 CPI is due to virtual to physical memory translation.
branch target buffer (branch history table): A mechanism that is helpful in predicting branches. it generally contains the target address of a taken branch plus a bit to tell the instruction fetch mechanism whether to predict taken or the fall through path. The prediction can be used to select the source of a mux that will get loaded into the PC for the next cycle. If the prediction is taken, we load the address from the BTB. If we predict not taken, we load the existing PC plus the instruction size in bytes.
branch prediction *
always taken/always not taken *
compile-time predictor *
run-time predictor *
backward taken forward not-taken (BTFN) *
last time predictor *
2-bit saturating counter *
2-level adaptive branch predictor: The branch predictor we invented in 1990, where the branch decision is based on what the processor did in previous cycles where its history was the same as the current history. We need a history to store the current history, and a table of two bit counters to store what the processor did in past when it had exactly the same history.
history register: See above.
pattern history table: The table of 2-bit counters. If the history register contains n bits, the table contains 2ⁿ 2-bit counters.
profiling *
representativeness *

Out-of-Order Execution

Tomasulo's algorithm *

reservation station *

common data bus *

tag *

register renaming (aliasing)

Recall that because we do not have an infinite number of registers, we need to use the same register to hold different values at different times in the flow of the program. Due to the nature of out-of-order execution processing, it is likely that two instruction that write to the same register could be active at the same time. Thus it is important to separately identify each INSTANCE of each destination write. We do this by assigning a tag (usually a physical register number) to each destination operand. In Tomasulo's time, that tag was referred to as an alias. Recall our figure in class – the Greek letters represented tags. Recall the data flow example – the tags correspond to arcs.

retirement (commit, completion)

Important in an out-of-order execution machine, where precise exceptions must be maintained. Instructions execute when they are ready, but they must commit their results in the order of the program (in program order) – hence the term “in-order retirement.” We use the word retire or commit to designate that act of permanently changing the destination register.

in-order retirement *

reorder buffer (ROB)

A convenient register file that allows destination writes to take place in any order, but reads for the purpose of retirement must be done in program order.

architectural/physical register

Architectural registers are what the software knows about. Physical registers are internal storage that the microarchitecture uses to maintain the correctness of the execution. The re-order buffer for example is a file of physical registers.

address unknown problem (memory disambiguation problem)

A ST instruction has an addressing mode that requires the contents of a register. A younger LD instruction has a known address. Can the load take place before the store. It depends on whether the addresses they are accessing are different. What if the address associated with the store is not yet known. We have a problem. Maybe yes, maybe no. I called it the unknown address problem. Josh Fisher carried the day with his name for it: memory disambiguation. The requirement to disambiguate these two address. That is what everyone calls it today. What to do about it:

Conservative: Wait until the address of the store is known. The plus is it is simple. The minus is it loses performance.
Aggressive: Assume they are different. The plus is you don't wait. The minus is that if you are wrong, you need to throw away a lot of work and start over. The logic to do that is substantial.

memory order buffer (MOB)

The structure that keeps track of the memory accesses so as to be able to quickly tell if there is a disambiguation problem.

data flow *

data flow graph *

nodes *

edges *

Vector Processors

vectorizable *
vector processing *
single instruction multiple data (SIMD) *
array processing *
vector register *
vector length register *
stride *
vector stride register *
vector chaining *
load/store pipe
scalar computer *
superscalar processor: A processor that fetches more than one instruction at a time and by virtue of the logic it entails (multiple decode, branch prediction, out-of-order execution) is able to obtain (hopefully) an IPC greater than 1. Note: it does not have to obtain an IPC greater than 1. Rather it has the enablements to accomplish that. Some have argued that as long as more than one instruction is being fetched and decoded each cycle, and that number is determined at run-time, it is a superscalar processor. The name derives from the fact that before these multiple fetch processors came around, the only processors that had an IPC greater than one were vector processors. Ergo this scalar processor that provides in some sense capability previously reserved for vector processors were not simply scalar processors, they were super scalar processors.
backup registers *
loop buffers *
software managed cache *
stripmining *
Flynn's bottleneck: As long as we fetch one instruction at a time into our pipeline, we can never get an IPC at the other end of the pipeline greater than one.
loop unrolling (software pipelining): The name given to the process of executing multiple loop interations before encountering a branch. For example, a loop that iterates ten times could be unrolled once, such that each iteration now consists of what was before two iterations. That is, the code would do iteration one, iteration two, and then test to see whether to branch back. The technique is used in those cases where results produced in one iteration are needed by the next iteration, wherein if the loop is enrolled the whole second iteration does not have to wait for the completion of the whole first iteration.