Exam 2 Annotated Buzzwords
Spring 2007
Note: buzzwords marked with an asterisk (*) are annotated in the
Fall 2005 Exam 2 Buzzwords (except for cache buzzwords which are in Fall 2005 Exam 1 Buzzwords).
Interrupts and Exceptions
- interrupt *
- polling *
- exception *
- interrupt/exception handler *
- precise exception *
- consistent state *
- priority/privilege
- Priority is about urgency, privilege is about the right
to do something. They are orthogonal issues. We sometimes confuse them in
our human experiences because sometimes they go hand-in-hand. Example, the
fire truck runs a red light to get to the fire. Running the red light is
about privilege. We give it that privilege because of the urgency of getting
to the fire. There are examples where they do not go hand-in-hand. Recall
my example in class of the mail clerk who had to go to the bath room while
waiting for the elevator on the 55th floor. In computer technology, they
really are orthogonal. Real-time experiments may run at higher priority,
because they need to sample the sensors at a precise moment in time. But
they can not execute RTI or turn off the computer, or access certain files.
System administrators can do that, even though they may run at a low priority
level because it is not urgent when they do their job, just that they do it.
- interrupt vector *
- exception vector *
- post increment (auto increment)
- This is an addressing mode that is common
on many older machines and a few more recent machines. The address of the
memory location containing the operand is in a register. After the address
is read, the register is incremented, so the next time that instruction is
executed, the register will point to the next data element. Side note: since
data usually occupies some number of bytes, the register is incremented by
the number of bytes of the operand.
Memory Hierarchy (Caches)
- cache *
- allocate on write miss *
- write back(copy back)/ write through *
- spatial locality *
- temporal locality *
- cache line (block) *
- direct-mapped cache *
- fully-associative cache *
- set-associative cache *
- tag/data store *
- cache hit/miss *
- hit ratio *
- dirty bit *
- sector cache *
- replacement policy (LRU, FIFO, random, victim/next-victim) *
- virtually-indexed, virtually-tagged cache *
- physically-indexed, physically-tagged cache *
- virtually-indexed, physically-tagged cache *
- associative access *
- multiple levels of caches *
- inclusion property *
- register spill
- Registers are useful of storing values that will need to
be sourced, rather than write them to memory and have to read them from
memory again. More registers mean more values can be kept in registers.
Unfortunately there are not enough registers, so at some point we need to
write the values of registers to memory in order to make them available
to hold new temporary values. We call that process of writing the registers
to memory to temporarily get them out of the way and make room: register spill,
and the compiler output that does this “register spill code.”
- write back buffer (store buffer)
- Loads must be serviced as soon as possible,
since subsequent instructions are usually dependent on the data being loaded.
Stores, on the other hand, can take place when convenient. [Note: in a
multiprocessor, that may not be quite so simplistically stated, but that is
an issue we will deal with later this semester.] A store buffer allows the
processor to put values to be written to the side and not tie up bus
bandwidth unnecessarily. For example, on a load miss to a write back cache,
put aside the dirty cache line to write to memory until after the new cache
line is read in. For example, on a write through cache hit, we may want to
buffer multiple writes to the same cache line, so as to only have to do one
memory write and tie up bus bandwidth once. The write buffer is an extension
of the memory system and needs to be considered as such. That is, an access
to the cache also accesses the write buffer.
- uniprocessor cache consistency
- We probably also ought to note here that
the rest of the world would call this “coherency” not “consistency.” Normally,
we think of cache coherency only when we have more than one processor that can
write to its own private cache. We will deal with this more after the exam.
It is worth noting however that in a uniprocessor environment with intelligent
I/O devices, the I/O device (e.g., DMA controller) can independently write to
memory, in which case data in the cache may no longer be valid. Ergo, in an
environment with intelligent I/O devices, the cache must monitor independent
accesses to memory from the I/O subsystem in order to be sure that data in
the cache has not been invalidated by what the I/O device wrote to memory.
- multiprocessor cache consistency *
- direct memory access (DMA) *
Input/Output
- media *
- device *
- transducer *
- volatile/non-volatile
- Volatile storage loses its data when power
is turned off. Memory is volatile. Non-volatile storage retains its data
when power is turned off. CDROMS, disks, tapes are all non-volatile.
- synchronous bus *
- asynchronous bus *
- carrier sense multiple access/collision detect (CSMA/CD)
- The way the Ethernet
works. The ethernet is an asynchronous bus. Device controllers are attached
to the bus. Any controller can access any bus at any time. How do we keep
from transmitting garbage. Answer: CSMA/CD. Each controller senses the
carrier to see if anyone else is accessing the bus. If yes, the controller
does not access the bus. All (multiple) the controllers are sensing the
carrier all the time. If no one else is using the bus, the controller is free
to put signals on the bus. However, it may be, due to the nature of the
asynchronous bus and propagation time of those signals that controller A may
think there is no one else using the bus because the signals being propagated
by controller B may not have propagated all the way to A yet. Thus, controller
A must continue to monitor the bus to be sure it is always the case that the
signals controller A is putting on the bus are the same as the signals
controller A is monitoring. If they differ, controller A has detectd a
collision (CD). In this case, controller A stops transmitting, waits a random
amount of time, and then tries again. Meanwhile, controller B will have
detected the collision due to A, so it stops transmitting, waits a random amount
of time, and then tries again. The hope is that randomA is not equal to
randomB. If it does, we repeat.
- protocol
- Taken straight from our embassies. The sequence of actions that
must take place in order to transact business. In the case of an asynchronous
bus, business is the transmission of a value from A to B. Also called
handshaking.
- race condition
- A condition where the result depends on the inadvertent
propagation delay of signals from one device to another.
- arbitration *
- central arbitration *
- distributed arbitration *
- priority arbitration unit (PAU) *
- starvation
- A condition in which a controller may never get the bus it is
requesting. Recall the example of the daisy chain grant of the BG signal.
- bus cycle *
- bus master *
- slave *
- data lines *
- address lines *
- control lines *
- multiplex bus *
- pending bus *
- split-transaction bus *
- vertical priority *
- horizontal priority *
- device controller *
- bus request *
- bus grant *
- bus busy
- A control signal indicating that a bus cycle is in progress.
- daisy chaining *
- SACK *
- MSYN *
- SSYN *
Pipelining and Branch Prediction
- pipelining *
- pipeline stages *
- pipeline bubble *
- branch misprediction penalty *
- data dependency *
- flow dependency *
- anti-dependency *
- output dependency *
- control dependency *
- data forwarding *
- sequential programming model *
- in-order execution *
- scoreboard *
- out-of-order execution *
- dual fetch
- delayed branch
- A branch instruction whose effect does not take effect
immediately. That is the semantics of the branch instruction is: execute
n more instructions and then branch.
- delay slot
- The number of instructions between the branch and when it
executes. Also refers to the instructions that are contained in those n
locations after the branch that will execute before the branch redirects the
instruction stream.
- fall through problem
- performance metric
- The basic equation for performance is:
1 ÷ (N × CPI × t)
N is the number of instructions in the program being measured. CPI is
the average number of cycles it takes to complete (retire) each instruction.
t is the cycle time of the clock. Note that if make the clock smaller,
I probably increase CPI. CPI vs. t is a fundamental tradeoff. Note also
that t is 1/freq, and CPI is 1/IPC. Most machines quote their IPC values
(bigger is better), although the microarchitect is more interested in CPI,
since the microarchitect is interested in eliminating some of the cycles
that contribute the CPI of an instruction. Recall, when we talked about
virtual memory, I mentioned that the VAX 11-780 had a CPI of 10.5, that each
VAX instruction on average made one memory reference, and that the page table
translation mechanism took 22 cycles. The reason the CPI is only 10.5 is
because there is a TLB on that machine which takes no extra time when it hits.
Since it hits 95% of the time, only 5% of the time does the 22 cycle translation
take place. 5% of 22 cycles is approximately one cycle. That is, approximately
one cycle of the 10.5 CPI is due to virtual to physical memory translation.
- branch target buffer (branch history table)
- A mechanism that is helpful
in predicting branches. it generally contains the target address of a taken
branch plus a bit to tell the instruction fetch mechanism whether to predict
taken or the fall through path. The prediction can be used to select the
source of a mux that will get loaded into the PC for the next cycle. If the
prediction is taken, we load the address from the BTB. If we predict not
taken, we load the existing PC plus the instruction size in bytes.
- branch prediction *
- always taken/always not taken *
- compile-time predictor *
- run-time predictor *
- backward taken forward not-taken (BTFN) *
- last time predictor *
- 2-bit saturating counter *
- 2-level adaptive branch predictor
- The branch predictor we invented in 1990,
where the branch decision is based on what the processor did in previous cycles
where its history was the same as the current history. We need a history to
store the current history, and a table of two bit counters to store what the
processor did in past when it had exactly the same history.
- history register
- See above.
- pattern history table
- The table of 2-bit counters. If the history register
contains n bits, the table contains 2n 2-bit counters.
- profiling *
- representativeness *
Out-of-Order Execution
- Tomasulo's algorithm *
- reservation station *
- common data bus *
- tag *
- register renaming (aliasing)
- Recall that because we do not have
an infinite number of registers, we need to use the same register to hold
different values at different times in the flow of the program. Due to the
nature of out-of-order execution processing, it is likely that two instruction
that write to the same register could be active at the same time. Thus it is
important to separately identify each INSTANCE of each destination write. We
do this by assigning a tag (usually a physical register number) to each
destination operand. In Tomasulo's time, that tag was referred to as an alias.
Recall our figure in class – the Greek letters represented tags. Recall the
data flow example – the tags correspond to arcs.
- retirement (commit, completion)
- Important in an out-of-order execution
machine, where precise exceptions must be maintained. Instructions execute
when they are ready, but they must commit their results in the order of the
program (in program order) – hence the term “in-order retirement.”
We use the word retire or commit to designate
that act of permanently changing the destination register.
- in-order retirement *
- reorder buffer (ROB)
- A convenient register file that allows destination
writes to take place in any order, but reads for the purpose of retirement
must be done in program order.
- architectural/physical register
- Architectural registers are what the
software knows about. Physical registers are internal storage that the
microarchitecture uses to maintain the correctness of the execution. The
re-order buffer for example is a file of physical registers.
- address unknown problem (memory disambiguation problem)
- A ST instruction
has an addressing mode that requires the contents of a register. A younger
LD instruction has a known address. Can the load take place before the store.
It depends on whether the addresses they are accessing are different. What
if the address associated with the store is not yet known. We have a problem.
Maybe yes, maybe no. I called it the unknown address problem. Josh Fisher
carried the day with his name for it: memory disambiguation. The requirement
to disambiguate these two address. That is what everyone calls it today.
What to do about it:
- Conservative: Wait until the address of the store is
known. The plus is it is simple. The minus is it loses performance.
- Aggressive: Assume they are different. The plus is you don't wait. The
minus is that if you are wrong, you need to throw away a lot of work and
start over. The logic to do that is substantial.
- memory order buffer (MOB)
- The structure that keeps track of the memory
accesses so as to be able to quickly tell if there is a disambiguation problem.
- data flow *
- data flow graph *
- nodes *
- edges *
Vector Processors
- vectorizable *
- vector processing *
- single instruction multiple data (SIMD) *
- array processing *
- vector register *
- vector length register *
- stride *
- vector stride register *
- vector chaining *
- load/store pipe
- scalar computer *
- superscalar processor
- A processor that fetches more than one instruction
at a time and by virtue of the logic it entails (multiple decode, branch
prediction, out-of-order execution) is able to obtain (hopefully) an IPC
greater than 1. Note: it does not have to obtain an IPC greater than 1.
Rather it has the enablements to accomplish that. Some have argued that
as long as more than one instruction is being fetched and decoded each
cycle, and that number is determined at run-time, it is a superscalar
processor. The name derives from the fact that before these multiple
fetch processors came around, the only processors that had an IPC greater
than one were vector processors. Ergo this scalar processor that provides
in some sense capability previously reserved for vector processors were not
simply scalar processors, they were super scalar processors.
- backup registers *
- loop buffers *
- software managed cache *
- stripmining *
- Flynn's bottleneck
- As long as we fetch one instruction at a time into our
pipeline, we can never get an IPC at the other end of the pipeline greater
than one.
- loop unrolling (software pipelining)
- The name given to the process of
executing multiple loop interations before encountering a branch. For
example, a loop that iterates ten times could be unrolled once, such that
each iteration now consists of what was before two iterations. That is,
the code would do iteration one, iteration two, and then test to see whether
to branch back. The technique is used in those cases where results produced
in one iteration are needed by the next iteration, wherein if the loop is
enrolled the whole second iteration does not have to wait for the completion
of the whole first iteration.