- profiling
- the act, usually as part of the compilation process, of executing
a program on a sample input data set to get a feel for the behavior of branches
and other run-time activities. The intent is to have the compiler generate a
better object file as a result of this information. Whether profiling provides
value or not is highly dependent on how close the sample input data set
resembles the actual input data set when the program is executed for real. We
call this "closeness" representativeness.
- representativeness (of data)
- see profiling.
- compile-time (static)
- that which is done before the program is actually run.
For example, a compile-time branch predictor makes its prediction based on
information that is available before the program starts execution.
- runtime (dynamic)
- that which is done while the program is running. For
example, a run-time branch predictor makes its prediction based on information
that is collected while the program is being executed. It is important to
note that in the old days, there was a clear distinction between compile time
and run time. That is, the program was compiled. Then the program was
executed. Today, more and more research has gone into the notion that the
program should be amenable to re-compiling while it is running. That is,
once the hardware knows that a decision that was made at compile time was
a bad decision, the hardware ought to be able to reinvoke the compiler with
this information and generate new object code that takes advantage of this
new information. This concept is still in its infancy.
- levels of transformation
- my term for the transformations that must take place
in order to get a problem stated in some natural language like English, Spanish,
Gujarati or Tagalog to be solved by the electrons that do the work. Problem -->
algorithm --> high level or assembly language program --> object code in the
ISA of the machine --> control signals that drive the microarchitecture -->
logic --> circuits --> electrons.
- algorithm
- a step by step procedure, wherein each step is definite (precisely
defined), effectively computable (able to be carried out by a computer), and
wherein the procedure is guaranteed to terminate (finite). For a problem to
be solved with a computer, one must be able to construct an algorithm to solve
it.
- ISA
- Instruction Set Architecture. The specification of all aspects of the
interface between the software and the hardware; for example, instruction
formats, opcodes, data types, addressing modes, memory address space, memory
addressibility, etc. This allows the software to specify a sequence of
operations for the hardware to carry out (the program) and for the hardware
to know what each of those instructions is asking it (the hardware) to do.
For example: x86, Power-PC, SPARC, MIPS, VAX, Alpha are all ISAs.
- microarchitecture
- a particular data path and its control that carries out
the specification of an ISA. For example, the x86 ISA has been implemented
by Intel by the 8086, 386, Pentium, Pentium-Pro, Pentium 4, Pentium M, etc.
and by AMD by the Athlon, Opteron, etc.
- opcode
- the part of an instruction that specifies the operation(s) that the
hardware is expected to perform in carrying out this instruction.
- prefix property
- the property that a bit pattern is a prefix of a longer
bit pattern. For example, 1001 is a prefix of 1001000001111.
- addressing modes
- mechanisms for obtaining operands required by an instruction.
For example, the LDW instruction obtains the address of the memory location
to be accessed by adding an offset to a Base Register. We say the addressing
mode is Base+offset. The JSR instruction computes the starting address of the
subroutine by adding an offset to the PC. We say the addressing mode is
PC-relative.
- data type
- a representation of information for which the ISA provides
instructions that operate on that representation. For example, in the LC-3b,
the ADD instruction operates on values represented as 2's complement integers.
We say the 2's complement integer is a data type. There is no floating point
data type in the LC-3b ISA since there is no instruction in the LC-3b that
operates on values represented as floating point numbers. Note that we can
still operate on floating point representations by writing a subroutine to
operate on these representation. For example, if the LC-3b did have a FLOATADD
instruction, the LC-3b could compute a floating point add in ONE instruction.
Since it does not, one could write a procedure that isolates the exponent and
fractional part of each value, compares the exponents, shifts right the smaller
fraction accordingly, does the add, and renormalizes the result. However, such
a procedure would take tens of LC-3b instructions.
- addressability
- the number of bits stored at each memory location. Most
modern computers are byte addressable. That is, 8 bits are stored at each
location.
- byte-addressable
- see addressability.
- address space
- the number of distinct memory addresses. That is, the maximum
number of memory locations possible such that each can be specified by a
unique address. Older machines, the original x86, for example, had an
address space of 2^16. We sometimes say: the x86 had 16 bits of address
space. The original IBM 360, I believe, had an address space of 2^24.
Most ISAs today, have an address space of 2^32, and some have an address
space of 2^64.
- endianness (big-endian/little-endian)
- Suppose a 32-bit value is to be stored
in a memory having byte-addressability. Four memory locations are required.
We store the 32-bit value in four consecutive memory locations, say those
haveing addresses A, A+1, A+2, and A+3. The question still remains: which
8 bits of the value get stored in which location. If we store bits 31:24
into location A, 23:16 into A+1, etc., we say the memory system is big-endian.
If we store bits 7:0 into location A, 15:8 into A+1, etc., we say the memory
system is little-endian. That is, if the number of bits required to represent
a value is a multiple of the addressability of the memory, we store the value
in contiguous locations of memory, starting with the low-numbered address.
If we store the most significant bits at that address, we say we are storing
according to the big-endian convention. If we store the least significant
bits there, we say we are storing according to the little-endian convention.
- fixed length vs. variable length
- used to describe the number of bits used
to encode an entity, usually used to refer to instructions, although data can
also be fixed or variable length. If fixed length, the entity is always the
same size. If variable length, the size is determined by the specific entity
involved. LC-3b instructions are fixed length; every one consists of 16 bits.
The x86 instruction is variable length. Depending on the instruction, there
may be anywhere from one to 15 bytes constituting a single instruction.
- uniform vs. non-uniform decode
- fixed length instructions may have the
property that the various fields of the instruction always occur in the same
place, simplifying the decode process. We call such an encoding of the
instruction format: uniform decode. The LC-3b for example has the opcode
ALWAYS in bits 15:12 and the result operand always in bits 11:9.
- memory-mapped I/O
- I/O activity can be specified by special instructions in
the ISA or by using the load and store instructions, but assigning each I/O
device register an address from the memory address space. The latter is
referred to as memory-mapped I/O. For example, with memory-mapped I/O,
keyboard input is specified by a load instruction where the address is the
one assigned to the keyboard data register.
- operate instructions
- instructions that operate on data: ADD,SUB,XOR,SHR, etc.
- logically complete
- a set of logical primitives that allows one to specify
all logical functions. NAND, for example is by itself logically complete.
AND is not by itself logically complete. The set {AND, NOT}, however, is
logically complete.
- arithmetic/logical shift
- A right shift is conveniently used to perform an
integer divide by 2, where the value being divided is expressed in 2's
complement representation. Thus, we need to insert into the high bit position
the sign bit of the value. On the other hand, bit patterns are often
right-shifted for other (logical) purposes, in which case we need to insert
a 0 into the the high (left-most) bit position. Given these two uses, we
need to differentiate an arithmetic shift where we insert the sign bit into
the left-most position from a logical shift where we insert a 0 into the
left-most position. [Note that an arithmetic left-shift, corresponding to
multiply by 2, requires inserting a 0 into the right-most bit position,
exactly the same as what we would do for a logical left-shift. Therefore, we
generally do not differentiate arithmetic left shift from logical left shift.]
- 2-pass assembler
- A simple assembler that makes two passes at an assembly
language program, first to identify all the symbols and generate the symbol
table, and second, to generate the actual bit patterns of the code.
- pseudo-op
- a directive provided by the assembly language programmer to the
assembler to help the assembler translate the assembly language program to
the 0s and 1s of the machine's ISA. In LC-3b assembly language, .ORIG, .FILL,
and .STRINGZ are examples of pseudo-ops. Pseudo-ops are not executable at
run-time. They are strictly messages to help the assembler assemble the
program.
- label
- a symbol that identifies the location of an instruction. Necessary
if that location contains the targe of a branch or a location that contains
data that will be accessed by a load or a store instruction.
- data movement instructions
- instructions that move data. Loads/stores to/from
memory and I/O instructions.
- control instructions
- instructions that redirect the instruction stream.
The default in instruction processing is to increment the PC to point to
the next instruction. If the next instruction is other than the next
sequential instruction, a control instruction is needed to provide the new
PC. The LC-3b control instructions are BR, JMP, JSR(R), TRAP, and RTI.
- condition codes
- single-bit registers (often referred to as "flags") which
provide additional information about the result of the last instruction. The
LC-3b has three condition codes (N,Z,P). Upon completion of an operate
instruction, the condition codes specify whether the result of that operate
was negative (N=1,Z=0,P=0), zero (N=0,Z=1,P=0), or positive (N=0,Z=0,P=1).
Similarly, upon completion of a load, the condition codes specify whether the
value loaded into the destination register was negative, zero, or positive.
Most real ISAs also have C and V condition codes. C designates the carry
out of an add instruction. V designates whether the previous integer
arithmetic instruction resulted in an overflow condition. One example is the
case where the source operands of an ADD are both positive 2's complement
numbers, but the result produced by the ALU is the representation of a
negative number. [Say, I have 4-bit integers. +6 = 0110, +5 = 0101. If I
tell the ALU to add them, the output would be 1011, which is the representation
for -5. The ALU is saying +6 plus +5 = -5. Ergo, set V.]
- 0, 1, 2, 3 address machine
- An n address machine is an ISA where n indicates
the number of operands that need to specified EXPLICITY for binary operations
in the ISA. The ADD is a good example of a binary operate instruction. In a
0 address machine, none need to be specified, since the microarchitecture
obtains both sources from the top of the stack, and pushes the result onto
the stack. In a 3 address machine, like the LC-3b, for example, both sources
and the destination ALL need to be specified explicitly (ADD R1,R2,R3). The
x86 is a 2 address machine.
- load/store ISA
- An ISA, where operate instructions can only operate on operands
that are stored in the registers. That is, one can not load a value from
memory and operate on that value in the SAME instruction, for example. The
LC-3b is an example of a load/store ISA. The x86 is not.
- fetch
- the process of accessing an instruction from the instruction memory.
- hardware interlock
- a hardware signal that prevents improper behavior from
occurring in the execution of instructions. For example, consider the
pipelined processing of a MUL that stores to R3, immediately followed by an
ADD that sources R3. At the time the ADD wishes to source R3, the value
stored there is the old value, not the value that will be produced in a few
cycles by the MUL instruction. A hardware interlock (called a scoreboard bit)
prevents the ADD from sourcing R3 until after the MUL stores its result there.
- semantic gap
- an old term, used to describe the closeness of fit between the
instructions in a high level language and the opcodes in an ISA. For example,
the VAX had an opcode AOBLEQ which added 1 to an operand, and branched if the
result of the add was less than or equal to some limit operand. This
instruction was clearly in the ISA to accomodate "for" loops. The semantic
gap between the "for" and AOBLEQ was tiny. In the old days, there was strong
interest in specifying opcodes that had a small semantic gap relative to the
high level languages commonly in use at that time. That interest has waned
during the past 20 years, but there is good reason why we may see it return
in the future.
- microsequencer
- the mechanism in the microarchitecture that figures out
what control signals will be needed to process during the next clock cycle.
- control store
- a private hardware structure for storing the set of control
signals needed in a particular clock cycle. In a microprogrammed machine,
the job of the microsequencer is to identify the location in the control
store that contains the control signals for the next clock cycle.
- microinstruction
- the set of control signals needed in one clock cycle.
In a microarchitecture with a control store: the contents of one location
in the control store.
- datapath
- the part of the microarchitecture that processes the information.
The register file, alu, and various muxes are all part of the data path.
- state machine
- the structure that specifies, for each state of the machine,
and each set of relevant values (opcode, interrupt present, memory ready, etc.)
what set of control signals should be invoked during the next clock cycle.
- VLIW
- Very Long Instruction Word. A processing paradigm wherein the compiler
generates a long word, consisting of more than one instruction, all stored
at the same PC address. At instruction fetch, all instructions in that long
word are fetched, decoded, and executed concurrently in lock step. To work,
it is necessary that there be no flow dependencies among the instructions in
a single long word. This paradigm provides benefit if the compiler can find
at compile time multiple instructions that have no flow dependencies, and
therefore, can execute concurrently. The number of instructions n in the
long word (VLIW instruction) is fixed. Therefore, if the compiler can not
find n independent instructions to put into one VLIW instruction, it must
fill the remaining slots with no-ops.
- compatibility
- a term used to describe the ability of new implementations of
an ISA to execute code written for an older implementation. Intel has been a
master at making sure that every new x86 implementation is able to execute
correctly all code written for all older x86 implementations. As time
passes, it becomes clear that some old opcodes, addressing modes, data types
are no longer useful, and some new opcodes, addressing modes, and data types
are necessary. In fact, the addition of the SSE1 and SSE2 instruction set
to the x86 ISA is a reflection of the new uses of that ISA. However, Intel's
insistence on compatibility means that in addition to the new SSE extensions,
new implementations also have to include in their microarchitecture the
capability to execute the old opcodes, addressing modes, data types, etc.
since some customers may want to buy a new laptop but still use their old
software.
- pipelining
- the process of carrying out instructions in an assembly line
fashion. Suppose instructions are fetched one per cycle. Then in a pipelined
implementation, in one cycle the fetch logic fetches an instruction, and
hands it off to the decode unit to decode in the next cycle. In that next
cycle, the fetch logic will be fetching the next instruction.
- interrupt
- an external event more urgent than the program running that causes
the machine to stop execution of the program that is running and turn its
attention to service the external event. Keyboard interrupt is an example of
a low priority interrupt. Machine check is an example of a high priority
interrupt.
- exception
- an event internal to the program that is running that must be
handled to properly deal with the instruction. Page fault, access control
violation are two examples of exceptions.
- precise exceptions
- if an instruction causes an exception, it is often
important to recover the machine state that corresponds to the situation
where all instructions preceding the exception-causing instruction have
executed correctly and no instruction following the exception-causing
instruction has executed. It is also useful to identify the instruction
that caused the exception. A microarchitecture that can do both is said
to support precise exceptions. The Tomasulo Algorithm running on the
IBM 360/91 did not support precise exceptions.
- critical path design
- the cycle time of a processor is one of the key
items that specifies the performance of the processor. Recall that
performance is to a first approximation determined by
1/(length * CPI * cycle time)
Decreasing cycle time has a major impact on performance. Cycle time is
the maximum of the various logic paths in the data path and its control.
Critical path design instructs one to examine the longest path in the
machine and attempt to reduce it, either by more clever logic design or
by latching a path in the middle and turning a long single-cycle operation
into a two-cycle operation.
- bread and butter design
- neither the supply of transistors on the chip nor
the design time of the designer are infinite. With finite resources, bread
and butter design suggests applying both to the chip's bread and butter, that
is, those opcodes, addressing modes, and data types that are important to
optimize for the expected usage of the chip.
- balanced design
- the whole point of a pipelined implementation is to allow
the various parts of the microarchitecture to do their respective things
concurrently. It makes no sense to provide capability in one stage of the
pipeline that goes unused in the next stage. Better to use the extra
transistors required to provide that capability for something more useful.
Example: a fetch engine that can fetch and decode 6 instructions each cycle,
but only has a single ALU for executing these instructions is not a balanced
design. There is no point wasting resources to be able to fetch and decode
6 instructions concurrently if the instructions just sit in a queue downstream
waiting for the single ALU.
- SRAM
- Static random access memory. An SRAM cell stores a single bit of
information with two cross-coupled inverters. The feedback path maintains
the state of the bit. As long as power is supplied to the two inverters,
the SRAM cell maintains its value. The SRAM cell is also referred to as
volatile memory, since if power is removed from the inverters, the bit that
is stored is lost. [Cross-reference, non-volatile, where the bits retain
their values even when power is removed, as in a disk or tape.] The minimum
number of transistors for an SRAM cell is 4 -- two inverters, with 2 transistors
per inverter. However, most SRAM cells use more transistors to provide for
other features, such as switching speed, gating, noise immunity, for example.
- DRAM
- Dynamic random access memory. A DRAM cell stores a single bit of
information as the absence or presence of charge on a capacitor. The
capacitor charged is one value, the capacitor uncharged is the other value.
A single transistor is required to read (sample the charge) and write (store
the charge) the bit. Unfortunately, the capacitor will not maintain its
charge indefinitely. There are RC paths to ground that will leak the charge.
Therefore, a DRAM cell requires periodic refresh. That is, periodically,
capacitors must be sampled and returned to their fully charged or uncharged
state.
- refresh
- See DRAM cell.
- page mode
- Typically DRAM chips are addressed in two steps. Consider the
DRAM chip as containing 2^n addresses. The high order n/2 bits (Row) come in
first, using RAS. This effectively preselects the contents of all addresses
having the same row address bits. Then the low order n/2 bits (column) of the
address comes in, using CAS. This selects the exact address from the 2^(n/2)
that have identical row address bits. Access time is the sum of RAS + CAS
times. If the next address has the same row address bits as the previous one,
the memory controller, which keeps track of this, only needs to send the
column address bits to the DRAM chip, thereby saving the time needed to latch
the row address bits. We say that all addresses having the same row address
bits are on the same "page," and the act of the memory controller exploiting
this by only sending subsequent column address bits, we refer to as
"page mode."
- row address strobe (RAS)
- see page mode.
- column address strobe (CAS)
- see page mode.
- chip enable
- output pins of a chip float and have no effect if the chip
enable bit is not asserted.
- parity
- a coding scheme that requires only one extra bit to detect a single
bit error, provided that all single bit errors are statistically independent.
"Even" parity dictates that this extra bit must be set to 1 or 0 such that the
entire n+1 bit transfer contains exactly an even number of 1s. In that way,
if the received n+1 bits contain a single error, the number of 1s will be
odd, so the fact that an error has occurred will be detected. Statistical
independence of the bit errors is necessary, since if two bits are in error,
parity will not detect it.
- ECC
- error correction code. a code which allows a single bit error to be
not only detected, but also corrected without requiring the bits to be
retransmitted. See my handout on the subject.
- Hamming Code
- the mechanism for performing ECC correction. See my handout.
- CRC check
- cyclic redundancy check. A mechanism for detecting burst errors,
wherein the individual bit errors are not statistically independent.
- memory bank
- a unit of memory that can process one memory access at a time.
- interleaving
- allocating addresses to memory locations "horizontally,"
rather than "vertically." That is, if address n is in bank 0, address n+k is
in bank 1, address n+2k in bank 2, address n+3k in bank 3, etc. where k is
the size of the access in a bank in bytes (assuming byte-addressable memory).
- unaligned access
- an access to memory that is not aligned on an access
boundary, thereby requiring two accesses to perform the desired operation.
See my handout on unaligned accesses.
- working set
- the set of pages belonging to a process that is resident in
memory at an instant of time is called the process' working set at that
instant of time.
- balance set
- the set of processes that have their working sets in memory.
- swap in/swap out
- the act of bringing in or out the working set of a process.
If the working set of a process is stored back to disk, we say the process is
swapped out.
- page
- the unit of virtual memory that is transferred between disk and
physical memory. That is, the whole page is brought in or out, or none
of it.
- frame
- the physical space that a virtual page occupies in physical memory.
- resident
- a term used to describe the fact that a virtual page is actually
occupying a frame of physical memory.
- page in/page out
- the act of moving a virtual page from the disk to physical
memory (page in), or from physical memory to the disk (page out).
- thrashing
- the condition whereby there is not enough frames of physical
memory to hold all the virtual pages that a process needs, resulting in
an inordinate amount of time being wasted by pages coming in and going out,
only to come in again, etc.
- PTE
- page table entry. A descriptor for each page of virtual memory,
containing information, including the frame of physical memory where
it resides, whether it has been modified since it has been loaded from
the disk, and protection information as to what levels of privilege are
required to access all information stored on that page. Some ISAs include
other information in each page's PTE.
- page table
- a data structure containing the PTEs of all pages in the virtual
image of the process.
- mapping virtual to physical addresses
- the transformation of a virtual address
to the physical location in memory where the information at that virtual
address is actually stored.
- PSR
- processor status register. A processor register that contains status
information about the running process, including its privilege level and
priority.
- privilege level
- every ISA defines a set of privilege levels, from unprivileged
to most privileged to distinguish what level of privilege is required to make
certain accesses. The simplest model has two levels of privileged: privileged
and unprivileged.
- protection
- the concept of protecting a computer resource from being utilized
by a process that does not have the desired level of privilege. For example,
a page can be protected so that only the highest privileged processes can
write to that page. Another page may be protected so that any process can
write to it. Some instructions can only be executed by processes that have
a certain level of privilege. For example, on machines that have a HALT
instruction, unprivileged processes are generally not allowed to execute
the HALT instruction ...and allowed to remotely stop the computer.
- virtual memory
- memory that a process thinks it has. Generally, the size of
the virtual address space.
- physical memory
- memory that actually exists.
- address translation
- conversion of a virtual address to the physical address
where the virtual location actually exists, so that the processor can
read/write values into that location.
- VPN
- virtual page number. The high bits of a virtual address; it specifies
the page number.
- PFN
- page frame number. The high bits of a physical address; it specifies the
frame number within physcial memory.
- TLB
- translation lookaside buffer. It is a cache of PTEs. It allows
translation to take place without going through the involved process of
"walking" through the page tables. That is, within a fraction of a cycle
instead of tens of cycles.
- valid bit
- a bit determining whether other bits are accurate. In the case of
the PTE, the valid bit tells whether the hardware can believe the PFN bits,
which really says whether the page is resident or not. If V=0, the page is not
resident, forcing the hardware to take a page fault.
- reference bit
- a mechanism used for page replacement. If the reference bit
mechanism is used for page replacement, each PTE contains a reference bit.
Periodically, all reference bits in the page table are set to 1. When a page
is accessed, its reference bit is cleared. When a page fault occurs, the
page chosen to be kicked out to make room is taken from the set of
pages whose reference bits are still 1.
- modify bit
- a bit in the PTE which is set when a page is written to. If the
Modify bit is clear, the page need not be written back to disk on replacement
since it has not changed since being read from the disk.
- page table base register
- the starting address of the page table. Used to
compute the address of a PTE.
- limit register
- the number of PTEs in the page table. Used to determine if
a virtual address actually corresponds to an address in the virtual image.
- system space
- virtual address space that is common to all processes, and is
used to house the data structures and code of system related functions.
- user space
- virtual address space specific to a process. Sometimes called
process space. Each process has its own version of process space.
- access control violation (ACV)
- an exception caused by one of two things.
Either the address was ill-formed and does not belong to the virtual address
space of that process (limit violation) or the privilege of the process does
not allow the access specified (protection violation).
- page fault (TNV)
- Translation not valid. the virtual address being
translated is not resident in physical memory.
- segmentation
- a protection mechanism to protect regions of memory for different
kinds of accesses.
- segment register
- a register used in conjunction with the address provided by
the program to produce an address which is either a physical address that can
be accessed directly or a virtual address which needs to be further translated
to a physical address.
- linear address
- Intel's name for virtual address. That is, a segment register
provides a base address which is combined with the address provided in the
program to form the linear address. The linear address is further translated
to a physical address using PTEs.
- cache memory
- a small, fast-access memory structure that stores a subset of
the physical memory, thereby reducing memory access time if the desired location
is present in the smaller structure. Since the cache only stores a subset of
the physical memory, it must store both the addresses of the locations it
contains as well as the data at those locations. The granularity of a cache
entry is a called a cache line (or synonymously, a cache block).
- spatial locality
- the property that if a location in memory is accessed, it is
likely that locations close by will be accessed soon.
- temporal locality
- the property that if a location in memory is accessed, it is
likely that location will be accessed again soon.
- cache line (block)
- the granularity of storage, see cache memory. This size is
generally greater than the size of a single access, due to our faith that
spatial locality exists.
- tag, index, offset bits
- A memory address is partitioned into three parts: its tag bits, its index bits
and its offset bits. The offset bits define which location within the cache
line is being accessed. The index bits are used to index into the tag store,
and specify the set of locations in the cache available for storing the
particular line. The tag bits are stored at the corresponding index in the
tag store to verify that the actual line stored in the cache is the one
desired. That is, all blocks/lines from memory having identical index bits
in their addresses compete for the same entries in the cache. The tag bits
are used to specify which of these many entries are actually stored in the
cache.
- direct-mapped cache
- a cache characterized by the property that if a line of
memory is stored in the cache, it can only be stored in one place.
- fully-associative cache
- a cache characterized by the property that if a line
of memory is stored in the cache, it can be stored in anyplace in the cache.
- set-associative cache
- a cache characterized by the property that a line of
memory can be stored in any of k entries in the cache. We say such a cache
is k-way set-associative, and the set size is k.
- tag/data store
- The tag store is the part of the cache that keeps all the
bookkeeping information associated with the blocks of memory that are
actually stored in the cache. Most important of these is the identity of
the actual block. The data store is the storage in the cache allocated for
storing the actual information that is contained in the corresponding lines
of memory. When we say a cache is of a certain size, we are referring to
the size of the data store. [A system having 16MB of memory and a 64KB cache
stores 64KB out of the 16MB in the cache's data store. We do not say how much
tag store is required. The tag store is just management overhead to allow the
cache to function properly.]
- cache hit/miss
- on a load or a store, the cache is queried to determine if
the location's information is contained in the cache. If yes, we say it is
a cache hit. If no, we say it is a cache miss.
- compulsory miss, conflict miss, capacity miss
- The three common causes of a
cache miss. A compulsory miss occurs the first time the line is accessed,
unless the line has been prefetched into the cache. No other way the line
could have gotten into the cache. A capacity miss occurs if a line that used
to be in the cache had been kicked out simply because there is not enough room
in the cache. A conflict miss occurs if a line was kicked out, not because of
the size of the cache, but rather because of the size of the set. That is,
the cache was not full, but the set for which the line competes was full.
Fully associative caches do not have conflict misses.
- hit ratio
- the percentage of accesses that are hits. That is
cache hit ratio = hits/(hits + misses)
- writeback/write through
- a property of caches. A write through cache is one
in which every write to the cache is also written to memory. In this case,
memory and the cache are always consistent. On replacing a block, it is
unnecessary to write that line back to memory. A write back cache is one
in which writes to the cache do NOT also include writes to memory. Therefore,
if a cache line is to replaced, its dirty bit is checked to see whether any
writes to that line have occurred since it was brought in from memory. If the
dirty bit is set, indicating that writes have occurred, then before the line
can be removed from the cache. it has to be written back to memory first.
- dirty bit
- in a write back cache, a bit that specifies whether the information
in the cache line has been changed since it was read in from memory. If not,
then on replacement, this line need not be written back to memory.
- allocate on write miss
- the act of setting aside space in the cache for data
to be stored, if the line written to is not in the cache.
- sector cache
- the characteristic of a cache whereby each cache line is
partitioned into sectors, whereby each sector is individually valid. Thus,
if a cache line is partitioned into four sectors, there are four valid bits
in the tag store entry for that line. Useful in those cases where one wishes
to allocate on a write miss, but not read in the entire line from memory
because it is likely that the entire line will be completely overwritten
before it is ever read.
- replacement policy (LRU, FIFO, random, victim/next-victim)
- algorithm to
determine which line within a set is to be evicted if a cache miss occurs, and
one of the lines in the set must go. The line to be evicted is often called
the victim. The line that will become the victim after the victim is evicted
is the next-victim.
- key transformation
- a fancy synonym for hashing.
- virtually-indexed, virtually-tagged cache
- a virtually addressed cache.
- physically-indexed, physically-tagged cache
- a physically addressed cache.
- virtually-indexed, physically-tagged cache
- a cache wherein the index bits
used to access the tag store are obtained from the virtual address, but the
tag bits actually stored in the tag store are physical address bits. This
allows TLB and tag store accesses to be made concurrently.
- cold start effect
- in general, the transient effect that occurs in most things
before a system is in service long enough to be operating in a stable mode.
In the case of a cache, this refers to the case where the cache is completely
empty and the first several accesses are compulsory misses.
- RAM
- Random Access storage. Given an access, the time to perform the next
access is independent of the location of that access relative to the
preceding access.
- direct access storage
- access is random to a starting address, and then
sequential locations are accessed from that point on. Most common type
is a disk.
- sequential access storage
- access is always sequential to the next location.
Most common type is a tape.
- associative access
- access is based in part at least on the contents of the
storage location. That is, part of the address is actually stored in the
storage location. Two common examples that you have already seen of this are
(1) the tag store of the cache, where the entry contains the tag bits, which are
part of the address, and (2) the TLB where each entry is a page number and its
corresponding PTE. The page number is the associative part of the access.
- CAM
- Content addressable memory. Memory that is accessed via associative
access.
- context switch
- the act of stopping the execution of instructions from one
process and starting the execution of instructions from another process.
Before the processor can do that, it needs to save the registers associated
with the process being stopped, and load the registers associated with the
process being started. This process of saving and loading the context is
referred to as a context switch.
- cache consistency/coherency
- in a shared memory multiprocessor system, where
each processor has its own cache, it is important that the contents of a
single location that is stored in multiple caches has the same information in
all caches that store that location. If yes, we say the cache is coherent
with respect to that location.
- multiple levels of caches
- A vast disparity exists between the fast access
needed by the cache that is dealing with the processor (say, one or two cycles)
and the very slow access that is available from memory (say, hundreds of
cycles). It is also the case that in order for the cache to be accessible in
one or two cycles, it must necessarily be small. Caches are either fast or
large, they can not be both. This model would result in too many misses,
resulting in too many long accesses to memory. The solution is to have
multiple levels of cache: L1 is small, but takes one or two cycles, L2 might
take 6 or 8 cycles but can be much, much larger. Sometimes there is an L3
cache, which is slower still and larger still. Finally, there is memory --
largest, but slowest. The notion is to extend the notion of the memory
hierarchy to within the levels of cache.
- inclusion property
- In a shared memory multiprocessor, it is important to have
cache coherency across all the caches. That could mean each processor's caches
monitoring the traffic associated with all the other processors' caches. It
would degrade performance to have the L1 cache involved in this monitoring
activity. One solution is to obey the inclusion property, which states that
any location present in a processor's L1 cache is also present in that
processor's L2 cache. In such cases, the L2 caches can do the monitoring
for cache coherency, freeing the L1 caches from that task.
- science of tradeoff
- My tongue in cheek phrase to emphasize the importance
of tradeoffs to the discipline of computer architecture. Clearly, computer
architecture is more art than science. Science, we like to think, involves
a coherent body of knowledge, even though we have yet to figure out all the
connections. Art, on the other hand, is the result of individual expressions
of the various artists. Since each computer architecture is the result of the
individual(s) who specified it, there is no such completely coherent structure.
So, I opined if computer architecture is a science at all, it is a science of
tradeoffs. In class, we keep coming up with design choices that involve
tradeoffs. In my view, "tradeoffs" is at the heart of computer architecture.
- latency
- how long it takes something to happen. memory latency -- the time it
takes from when the address is provided until the information stored at that
address is available. We will talk about interrupt latency after the spring
break -- the time it takes from the time some external device requests
service until the time that device starts getting serviced.
- PLA
- Programming Logic Array. A combinational structure consisting of an
array of AND gates all producing outputs which are fed to an array of OR gates.
The general structure of a PLA sources all external inputs to each AND gate,
making each AND gate able to identify one input combination (row of a truth
table). Also, in the general structure, all AND gate outputs are available
as input to every OR gate. Those actually connected as inputs to an OR gate
are those that produce a 1 in the output column of the truth table of the
relevant logic function. Thus, in principal, any arbitrary logic function
of the external inputs can be implemented as the output of one of the OR gates.
In practical structures, the fan-in (maximum number of inputs) to an OR gate
is fixed, thereby limiting the ability of the PLA to implement all
combinational logic functions.
- shift and add multiplication algorithm
- Recall your experience with the
multiplication of decimal numbers. We systematically multiply the multiplicand
by each digit of the multiplier and write each corresponding temporary result
right-aligned with the corresponding multiplier digit. Then we add up these
temporary results, producing the product. In multiplying binary numbers, the
only digits are 1 and 0 so the multiplication step consists of simply copying
the multiplicand or writing all zeroes as each temporary result. If we are
multiplying n bit numbers, we get n temporary results. To add these temporary
results properly, we need to take into account their relative alignment.
This can be done by n-1 steps where each step includes the sum of all
temporary steps before it. That is, after step 1, we have correctly added the
temporary results for bits 1 and 2. After step i, we have correctly added
the temporary results for bits 1, 2, ... i+1. The (i+1)th step consists of
SHIFTING right the (i+2)th temporary result one bit so as to align it with the
sum of the first i steps and then ADDING the two quantities. Thus, the name
shift and add to describe binary multiplication.
- static instruction stream
- The program as it sits in memory, or rather out
on the disk. Each instruction generated by the compiler exists once in the
static instruction stream.
- dynamic instruction stream
- The *sequence* of instructions as they are
executed. A single instruction (located in a single address of memory)
counts once in the static instruction stream. Its count vis-a-vis the
dynamic instruction stream is the number of times it is executed. The
dynamic instruction stream (often called a *trace*) gives a record of the
order that the intructions were carried out.
- state of machine
- Computer processing is all about systematically changing
the state of the machine in response to each instruction in the dynamic
execution of the program. That is a program executes by the machine being
in an initial state, and one by one each instruction execution changes the
state in accordance with the semantics (specification) of the instruction.
The machine state consists of the PC, the PSR, the general purpose registers,
the condition codes (if the ISA specifies condition codes), the contents of
each memory location, and the mapping registers that enable the memory
management process. More sophisticated ISAs provide additional state.
- design point
- a specification of the criteria to be optimized in the
design process. If we are interested in performance, or reliability, or
availability, or low cost, we colloquially say our design point is
performance, or reliability, or ...etc.
- assembly language
- a language very close in semantics to the ISA of the
machine, with the advantage that bit patterns are replaced by mnemonic
devices, and some additional useful features are often present. Nonetheless,
the defining characteristic of an assembly language is that the level of
control over changing the machine state is very much like what one would
have if one programmed in 0s and 1s. Also, sometimes called a "low level"
language to distinguish it from "high level" languages that are more tuned
to the expressiveness of humans. High level languages (more than 1000 out
there): C#, C++, Java, C, all the way back to antiquity, when Fortran and
Cobol showed up. Assembly language programs are translated by an assembler
which does little more than replace all the mnemonic devices with 0s and 1s.
It would be a tough job to try to assemble a program written in the assembly
language for ISA A into the 0s and 1s of ISA B. On the other hand, high level
language programs are generally translated by a compiler. Since the
instructions in the high level language are quite a bit removed from the
intricasies of any ISA, it is not difficult to change the ISA that a compiler
translates the high level language program into.
- instruction level simulator
- A simulator that changes its state at the
granularity of the instruction. That is, given the machine state and an
instruction, the instruction level simulator produces the state required by
the instruction. No within-instruction states are simulated.
- cycle level simulator
- A simulator that changes its state at the granularity
of the clock cycle. That is, given the machine state and the control signals
asserted during a clock cycle, the cycle level simulator produces the state
required by those control signals.
- steering bit
- A bit in an instruction that specifies the interpretation of
other bits in the instruction, depending on the value of the steering bit.
For example, bit[5] in the LC-3b is a steering bit, since it determines the
usage of bits[4:0] of the instruction.
- vector operation
- an operation that is performed on a set (or sets) of values
(referred to as vectors). For example, a vector A consists of n values A1,
A2, A3, ..An. And a vector B consists of n values B1, B2, ...Bn. A vector
add of vectors A and B would produce the vector consisting of A1+B1, A2+B2,
... An+Bn. We often refer to this mode as data-parallel, since there are
no dependences between the component adds, and they can all be executed in
parallel.
- prefetching
- Fetching instructions or data early so that they are available
quickly when they are needed. For example, most microprocessors have units
of storage called L1 caches that can be accessed in one or two cycles. It is
useful during instruction fetch or data fetch to be able to access this element
in one or two cycles. This can be accomplished by obtaining the element from
a much slower unit of storage before it is really needed. We call that access
in advance of when needed *prefetching.*
- natural word length
- the generic size of elements processed in the machine.
Generally, the size of the registers. Generally, the size of values processed
by the ALU in a single cycle.
- multi-dimensional array
- a classical mathematical structure, having multiple
dimensions.
- dynamic/static interface
- the interface between when the compilation process
ends, and the run time process begins. The compiler translates the source
program producing 0s and 1s that specify the work to be done. The hardware
carries out the work of those 0s and 1s at run time. Although it is now
changing, the compiler never used to have run-time behavior available to it
to make its decisions. It did not know which cache accesses missed, which
branches were actually taken, how long certain operate instructions took.
It is true that compilers have for a long time, as part of the compilation
process, observed what happened when the program ran on some sample data.
This is called *profiling.* But that data is not the *real* data, and in
fact can lead to decisions that could be counterproductive, depending on how
close to the sample data the real data is. We refer to the DSI as the
interface between when all activity is done without real knowledge of real
run-time data, and when all activity has that knowledge.
- indirect addressing
- An addressing mode, where the address (A) of the source
data is contained in a memory location (B), the address of B is computed by
the addressing mode. We say that B is a pointer variable because its contents
is not data but rather the address A of the data.
- virtual machine (emulation)
-
An ISA is implemented in hardware. Each
instruction in that ISA is fetched, decoded, etc. and at each step of the
instruction cycle, control signals are generated to carry out the phases of
that instruction. What if you wanted a program compiled into the ISA of
machine A to be executed on a machine that has a different ISA, say ISA B.
Clearly, the 0s and 1s of the ISA of A have no meaning with respect to the
machine of ISA B. We call the ISA of machine A a virtual machine because
it does not correspond to real hardware in our situation. What is needed
is a program (usually called an emulator) that translates each instruction
of ISA A into a sequence of instructions from the ISA B that can then be
carried out on the machine of ISA B and produce the same result as it would
have if the ISA A instruction had been carried out on a machine of ISA A.
We often refer to ISA B as the host machine since that is where the results
are actually being produced. And ISA A the target machine since that is the
code produced by the compiler from the original program. The concept has been
around for a very long time, although it is lately gaining a lot of popularity
both in research universities and product development centers. An early
example was the Burroughs 1700 of the 1970s. Programs were compiled to
three different virtual machines, depending on whether the nature of the code
was scientific, business, or system level in nature. The resulting three
different ISAs were then emulated on the host B1700 machine. More recently,
there have been plenty of Java programs that have been compiled into an ISA
referred to as the Java Virtual Machine. The result is that other companies
have produced software emulator to translate each JVM instruction into the ISA
of its own host machine. So, for example, a Pentium M can run JVM instructions
if it wishes by emulating each JVM instruction into a sequence of x86
instructions that does that job.
An analogy to virtual memory is in order. Like virtual memory, a virtual
machine exists in the eye of the beholder. Recall a virtual address has to
be translated to a physical address before it can be accessed. A virtual
machine instruction has to be translated to a sequence of instructions
in the host ISA before it can be carried out.
- flow dependency
- the dependency between a result produced by one instruction
that is needed by another instruction. For example the expression A(B+C).
We can not do the multiply until we do the add. We say there is a flow
dependency between the add and the multiply. A popular well-meaning textbook
refers to this as a RAW hazard, meaning that we can not read the source of the
multiply until after we write the result of the add. I personally do not like
this terminology because it totally distorts the meaning of the word hazard.
We will deal with this after Spring break, so you can ignore this for now.
- CISC/RISC
- In 1980, the term RISC was coined to characterize simple ISAs.
The term was an acronym for Reduced Instruction Set Computer, but was
ill-advised on a lot of levels, which we may go into late in the semester
if there is time. Suffice it to say right now, that the intent was to have
each instruction do a single operation in a very streamlined way. SPARC,
MIPS, HPPA, Alpha, Motorola 88000 Power-PC are all examples of ISAs that
attempted more or less to follow that mantra. x86, Motorola 68000, VAX
IBM Zseries are all examples of ISAs that did not. The acronym czars referred
to the "other" ISAs as Complicated Instruction Set Computers. For example,
the x86 which can load from memory, add to a register and store back to memory
in a single x86 instruction -- three distinct operations. ...which led their
advocates to shoot back: CISC stood for Comprehensive Instruction Set Computer.
- Load/Store architecture
- The term refers to ISAs that require separate load
and store instructions to move data to and from memory. That is a load/store
ISA does not allow operate instructions to operate on data loaded from memory
in the same instruction. Nor, does it allow the result of an operate
instruction to be written to memory in the same instruction. So-called RISC
ISAs were in general Load/Store. So-called CISC ISAs were in general not.
- control signals
- signals that control the processing within the data path, the
memory, and the microsequencer. The set of control signals that are effective
during a single clock cycle are often characterized as a microinstruction.
- atomic unit
- a unit that can not be further partitioned without losing its
intrinsic identity. It takes its meaning from chemistry where an atom consists
of protons, neutrons, and electrons, but once partitioned into these particles,
ceases to have the identifying characteristics of the atom. Oxygen is no
longer oxygen when the 6 electrons and 6 protons (if I remember my high school
chemistry correctly) are lopped off. We say an instruction is the atomic unit
of processing since we execute code at that lowest granularity. That is, we
never execute a piece of an instruction. Either the whole instruction or none
of it. That is, we do not change the machine state corresponding to a piece
of an instruction and leave the machine state corresponding to another piece
of the same instruction unchanged.
- byte write
- Although memory is usually byte addressable, the memory bus is
usually multiple bytes wide. A byte write is a memory access which stores
one byte of information into a designated byte-addressable location while
ignoring the rest of the bytes on the memory bus. Implementation requires
write enable signals of byte granularity.
- soft error (Updated: 03/29/2007)
-
A soft error is an error that is not hard. Duh.
So, what does that mean? We distinguish errors as transients errors
and permanent errors. A transient error is one that occurs due to some
event that for purposes of the computer is random. Someone started up
a washing machine in the same room, and there is not enough noise margin.
Or, some alpha particle got in the way, which gets more and more common
as the clock gets more and more into the GHz range. The idea is that if
you replay the program, the error will not show up the second time.
The other kind of error is the permanent error. Here the device is broken
and always give you the bad result.
Permanent errors are called hard errors.
Transient errors are called soft errors.
- checksum
- the bits used to detect whether a burst error has occurred within
a large transfer of information, where the errors are not statistically
independent. Suppose I wish to transfer N bits of information from X to Y.
A common implemenation of checksum error detection is to input the N bits
sequentially into an n-bit linear feedback shift register at X concurrently
with transferring them to Y. After the transfer is complete, the n bits
remaining in the shift register are then sequentially transferred to Y as
well. At the destination Y, the process is repeated and the contents of
the n-bit shift register after N bits have been input is compared with the
last n bits received at Y. If they match, the transfer is assumed to have
gone error free. Typical values for N is in the thousands of bits, typical
values for n is 16 or 32.
- barrel shifter/rotator (shift matrix rotator)
- a large combinational network
that allows an n-bit value to be rotated (or simply shifted) an arbitrary
number of bit positions. Example, if n is 32, the 32-bit value can be shifted
0, 1, 2, ... 31 bits. The combinational structure could be implemented with
32 32-to-1 muxes, but is more efficiently implemented in multiple layers as
a shift tree, as described in class. Inputs to a barrel shifter is the n-bit
source and a log n bit encoding of the shift amount. Output is the rotated
result.
- write enable
- A signal that must be asserted if a storage structure
is to be written. In the technology we use in 360N, if the signal is
asserted throughout the cycle, the storage device is written at the
end of the cycle.
- length register
- A register that is part of the virtual memory management
structure, identifying how many pages are described by the corresponding
page table. This is needed to protect against false translations wherein
the page number computed from the virtual address may correspond to a page
that is not part of the virtual address space that is mapped. For example,
if an executable image consists of 6 pages, the page table will contain 6
PTEs, and the length register will contain the value 5 (pages 0, 1, ...5).
Any virtual address whose bits that identify a page greater than 5 must
therefore be a bad address. This is easy to detect by simply comparing the
page- number bits to the contents of the length register.
- interrupts
- external events that cause the processor to stop processing the
current process, and after putting the machine in a consistent state, vector
to the starting address of the service routine specified by the specific
vector.
- exceptions
- events that are internal to the process running, such as illegal
opcode, page fault, floating point underflow, etc., that cause the processor
to stop processing the current program and deal with the exceptional event.
Like interrupts, it is necessary to first put the machine in a consistent state
and then vector to the starting address of the exception service routine. The
service routine that is executed depends on the exception which has occurred.
Each exception (or set of exceptions) has its own vector to tell the machine
which service routine.
- service routine (handler)
- see above.
- interrupt vector
- the bit pattern identifying the particular interrupt,
which is used to initiate the correct service routine.
- exception vector
- see interrupt vector
- consistent state
- the state of the machine between the execution of
instructions. In the middle of execution of an instruction, the internal
state of the processor can be whatever the microarchitecture wishes. Even
general purpose registers can be clobbered if the microarchitecture wishes.
But, at the end of each instruction, the contract with the software requires
that the state of the machine be recoverable to the state produced by the
program up to that point. That point is the consistent state.
- power failure
- loss of power causing a very high priority interrupt.
- machine check
- the highest priority interrupt, indicating that the processor
is generating bogus information. Cause: a hardware malfunction.
- sticky bit
- a condition code that once set retains its condition until
interrogated under program control.
- interrupt/exception priority level
- priority at which an event requests
interruption of the running program. Usually, the priority at which the
service routine runs. Priority depends on urgency of the event.
- maskable
- the ability to ignore an event so as to defer handling, or even
deny handling altogether.
- medium
- the material containing the information – the hole in a punched
card, the magnetic material on the disk, etc.
- device
- the I/O unit that houses all the stuff that causes the information
to be stored and transformed from one form to another. Disk drive, for example
- transducer
- the thing that tranduces from one medium to another. From light
passing through a punched card to the electrical signals. From the magnetized
material on the disk track to the electrical signals.
- polling
- I/O activity controlled by the processor executing its program.
- interrupt driven I/O
- I/O activity controlled by the device requesting the
processor interrupt what it is doing.
- DMA
- direct memory access. Allows memory and the disk (for example) to
transfer information without tying up the processor.
- I/O processor
- a special purpose processor that manages I/O activity, including
doing some processing on that information which is being transferred.
- dedicated bus
- bus dedicated to a single purpose.
- platter
- colloquial word to describe one of the surfaces of a disk.
- track
- on a platter, a complete circle, containing bits of information.
- aerial density
- the number of bits per unit length along a track.
- disk head
- that part of the mechanical arm that contains the ability
to read the magnetic material stored on the track, or write to the track.
- cylinder
- the set of tracks that can be read simultaneously.
- seek time
- the time it takes the head to get to the desired track to read or
write the information.
- rotation time
- the time it takes the correct starting point on the track to
come under the head where it can be read/written.
- disk crash
- when the head actually hits the spinning platter. …and scores
the surface, usually. BAD.
- disk block
- the unit of storage accessed from the disk. 512 bytes, for
example.
- redundancy
- the concept of using more than n bits to store n bits of
information in order to recover in case one or more of the bits get
corrupted. Parity provides one extra bit of redundancy. ECCL provides
log n + 1 extra bits of redundancy.
- RAID-0
- striped disk array with no redundancy.
- mirroring
- aka RAID-1. Two disks to store the information of one. Each bit
has its mirror bit.
- striping
- interleaving, when applied to disk.
- RAID-1
- see mirroring
- RAID-2
- array with ECC coding across the disks
- RAID-3
- array with parity coding. All parity bits in the same disk.
- RAID-4
- array with larger units on interleaving, so an entire file can
be accessed on the same disk.
- RAID-5
- RAID-3, except the parity bits are stored across all the disks,
getting rid of the hot spot of a parity disk.
- data lines
- lines containing data bits.
- address lines
- ditto, except address bits.
- control lines
- ditto, except control bits.
- multiplexed bus
- a bus that is used to store address bits at some times and
data at other times.
- pending bus
- a bus with the property that once the bus is grabbed for a
transaction, the bus is tied up (often twiddling its thumbs) until the
transaction has completed.
- split-transaction bus
- a bus that, instead of twiddling its thumbs, releases
the bus to allow other transactions to use it while it is waiting for the
second half of the transaction.
- pipelined bus
- split transaction bus where the order of the second half
maintains the order of the first half of each bus transaction.
- synchronous bus
- a clocked bus.
- asynchronous bus
- no clock, bus cycles take as long as they need to take.
Since no clock, handshaking is required.
- handshaking
- the interaction that starts off each bus cycle that is required
since there is no clock.
- arbitration
- the act of determining who gets the bus for the next bus cycle.
- central arbitration
- arbitration is done by one central arbiter. (e.g., PAU)
- distributed arbitration
- each device controller has enough information to know
whether or not it gets the bus the next cycle.
- priority arbitration unit (PAU)
- the central arbiter.
- bus cycle
- one unit of bus transaction time.
- bus master
- device controller in charge of the current bus cycle.
- slave
- the other controller that participates in the bus transaction, under
the control of the bus master.
- vertical priority
- priority determined by the particular BR line.
- horizontal priority
- priority determined by proximity to the PAU, among all
controllers having the same vertical priority.
- device controller
- that which interfaces the device to the bus, and manages
all accesses to/from the device.
- bus request
- signal from a controller requesting a bus cycle
- bus grant
- signal from the PAU granting the bus
- daisy chaining
- mechanism by which all controllers with the same vertical
priority receive the bus grant in order, and pass it on to the next if
it does not want it.
- burst mode
- successive data transfers in the same bus cycle.
- SACK
- signal output by the bus grantee signifying it will be the next bus
master.
- MSYN
- signal used by the bus master in synchronizing the transaction.
- SSYN
- signal used by the slave in synchronizing the transaction.
- fundamental mode
- asynchronous operation of a sequential machine wherein
at most one input signal changes its value at a time.
- pipelining
- the assembly lining of instruction processing. Each instruction
goes through multiple steps (stages), from Fetch to Retirement/completion
of the instruction.
- pipeline stages
- the set of steps each instruction goes through in the
pipeline.
- pipeline bubble
- a hole in pipeline, caused by the inability of instructions
to flow through the pipeline. For example, following a conditional branch,
it may not be possible to know which instruction to fetch next. This causes
a a pipeline bubble. …or a load followed by the use of that load can cause
a pipeline bubble.
- branch misprediction penalty
- the number of cycles wasted due to branch
misprediction where everything speculatively fetched needs to be thrown away
before we can start along the correct instruction path.
- data dependency
- one of the three data dependencies: flow, anti, write
- flow dependency
- often called read after write hazard. Example
a + (b × c).
Before we can do the add, we must do the multiply. That is the result of
the multiply must be written before it can be read as a source for the add.
- anti-dependency
- using a register to store more than one value. Before we
write to that register, all instructions which need to read the old value
have to read it. Sometimes called write after read hazard.
- output dependency
- values output from the same register must be output in
program order.
- control dependency
- dependency caused by control instructions, such as
branches. Which instruction is executed next depends on which way a branch
goes is an example of a control dependency.
- data forwarding
- information is sent from where it is produced to where it
is needed without going first to a register. Speeds up processing.
- sequential programming model
- instructions are expected to execute in
the order specified by the programmer.
- stale data
- data that was stored in a location but the location has
since been reassigned so that the old data is no longer valid for new
instructions that come along.
- in-order execution
- a microarchitecture that executes instructions in
program order.
- scoreboard
- a hardware interlock that prevent an instruction from
sourcing stale data.
- out-of-order execution
- a microarchitecture that executes instructions
when they are ready for execution, regardless of their order in the program.
- speculative execution
- activity that is carried out before the hardware
knows it must be (mandatory) carried out. For example, instructions
fetched as a result of a branch prediction.
- advantages/disadvantages of condition codes
- branch prediction
- rather than wait for the condition code to be
resolved upon which a branch instruction is based (taken or not not taken)
and creating a hole in the pipeline, the hardware guesses and fetches
based on that guess. The guess is called a branch prediction.
- always taken/always not taken
- the prediction is determined by the
designers of the chip.
- compile-time predictor
- the guess is determined by the compiler, usually
as result of profiling.
- run-time predictor
- the guess is determined by the hardware on the basis of
what has been happening with the running of the program.
- BTFN
- guess taken if the branch is to an earlier location in the code,
not taken if the branch is to a location later in the program. This is
consistent with the behavior of for and while constructs in high level
languages. Best example of a compile time predictor.
- last time predictor
- predict whatever the branch did last time it will
do this time.
- 2-bit saturating counter
- advance the counter depending on what the branch
does, predict according to the current high bit of the counter.
- saturating arithmetic
- arithmetic which does not wrap around. That is, the
highest value serves the role of infinity. i.e.,
infinity + 5 = infinity,
therefore, highest value + 5 = highest value.
- register renaming
- also called aliasing. Assigning unique register names to
independent uses of the same register.
- reservation station
- storage location for an instruction awaiting operands
before it can execute.
- Tomasulo's algorithm
- the algorithm which assigns register aliases that
allows an instruction to move to a reservation station, while maintaining all
producer/consumer relationships. This maintenance of the producer/
consumer relationships is the most important single thing needed to allow
instructions to actually execute out of order.
- tag
- another word for the alias or distinct identifier assigned to each
instance of a register.
- in-order retirement
- retirement of instructions occur in the order of
their location in the program. Necessary to obey the sequent programming
model.
-
register alias table (RAT)
- the table of aliases used by the Tomasulo
algorithm to assign tags. Each register has a valid bit, a tag and a
value. Depending on the valid bit, either the tag or the value is correct.
- common data bus (CDB)
- in the data path of the IBM 360/91, the bus used
to distribute values and corresponding tags to registers in the RAT and
in the reservation stations that are waiting for the value uniquely
assigned to that tag.
- scheduling algorithm
- an algorithm used to decide which of several ready
instructions should be executed next.
- fire when ready
- schedule the instruction for execution (that is, send
it to a functional unit) when all data that it needs is available.
- data flow
- the paradigm wherein instructions are executed when their data
is available, rather than when they show up in program order.
- data flow graph
- a directed graph showing all the instructions as nodes,
and the producer/consumer relationships as arcs.
- nodes
- instructions
- edges
- or arcs. Shows result produced by one instruction that is sourced
by the other.
- producer
- the instruction producing the result that is sent to various
consumers.
- consumer
- an instruction that requires the result as a source.
- restricted data flow
- data flow in the Tomasulo sense, wherein only that
subset of the data flow graph that corresponds to instructions that have
been decoded but not retired is contained. I coined the term in 1984
to differentiate it from the data flow graph of the entire program
- vectorizable
- a loop instruction whereby each iteration of the loop
is completely independent of the other iterations.
- vector processing
- the paradigm we discussed in class for handling
vectorizable loops.
- SIMD
- single instruction stream, multiple data sets. Sometimes called
data parallel. Two forms: vector processors and array processors.
- vector load
- multiple values loaded as a result of a single load instruction.
- vector register
- a register consisting of multiple components, useful for
storing the result of a vector load, vector add, etc.
- vector length register
- identifies how many of the components of a vector
register, or vector operation, are operative.
- row-major order
- an array stored in memory in the order: first the components
of the first row, then the components of the second row, etc.
- column-major order
- an array stored in memory in the order: first the
components of the first column, then the second column, etc.
- stride
- the distance in memory between locations to be accessed vis-a-vis
successive registers.
- vector stride register
- register containing the current stride value.
- vector chaining
- instruction processing wherein the results of one
vector operation can be forwarded to the next vector operation without
waiting for all components of a vector instruction to be completed.
- scalar computer
- what you are used to. No vector registers, no vector
loads, stores, adds, etc.
- reciprocal approximate
- a quick way to do division. a divided by b is
replaced by a times 1 divided by b. The reciprocal approximate unit
takes b as input and produces 1 divided by b.
- Backup registers
- Cray program controlled cache, sort of. Eight S registers
are backed up by 64 T registers. Instead of going to memory again and
again, one can do a vector load to the T registers.
- Software managed cache
- a tongue in cheek description of Cray T registers.
- Loop buffers
- in the Cray machines, an instruction buffer containing
enough storage to accommodate an entire loop.
- Stripmining
- The process of handling more than 64 iterations of a loop,
64 at a time.
- Fixed point arithmetic
- the generalization of integer arithmetic. The
decimal point (or binary point) is always in the same place, in the case
of integer arithmetic, at the far right. No bits of a data element are
wasted identifying where the binary point is.
- 2's complement
- one representation of integers
- 1's complement
- ditto
- signed-magnitude
- ditto
- word length
- the size of the ALU, the size of the registers, the default
value of sizes.
- short integer
- an integer of the size of the word length.
- long integer
- integers of size that is a multiple of the word length.
Not supported by the ISA. However, software can be written to deal with
long integers. Example in class: with a word length of 32 bits, the x86
can operate on 320 bit integers by means of a procedure that processes
the values 32 bits at a time. Takes a loop that will iterate 10 times.
- Binary Coded Decimal (BCD)
- an encoding of integers, such that each decimal
digit is encoded in 4 bits. Number of digits is usually arbitrary, which
means one needs two pieces of information to specify a BCD number: starting
address and number of digits.
- Shift and Add multiplier
- classical way to do binary multiplication.
Since multiplying by 1 simply means adding in the multiplicand, and
multiplying by 0 means adding zero, both after aligning (shifting)
properly, multiplication is usually described as a series of shifts and
adds, one for each bit in the multiplier.
- priority/privilege
- Priority is about urgency, privilege is about the right
to do something. They are orthogonal issues. We sometimes confuse them in
our human experiences because sometimes they go hand-in-hand. Example, the
fire truck runs a red light to get to the fire. Running the red light is
about privilege. We give it that privilege because of the urgency of getting
to the fire. There are examples where they do not go hand-in-hand. Recall
my example in class of the mail clerk who had to go to the bath room while
waiting for the elevator on the 55th floor. In computer technology, they
really are orthogonal. Real-time experiments may run at higher priority,
because they need to sample the sensors at a precise moment in time. But
they can not execute RTI or turn off the computer, or access certain files.
System administrators can do that, even though they may run at a low priority
level because it is not urgent when they do their job, just that they do it.
- post increment (auto increment)
- This is an addressing mode that is common
on many older machines and a few more recent machines. The address of the
memory location containing the operand is in a register. After the address
is read, the register is incremented, so the next time that instruction is
executed, the register will point to the next data element. Side note: since
data usually occupies some number of bytes, the register is incremented by
the number of bytes of the operand.