Annotated Buzzwords

From Fall 2005 and Spring 2007


profiling
the act, usually as part of the compilation process, of executing a program on a sample input data set to get a feel for the behavior of branches and other run-time activities. The intent is to have the compiler generate a better object file as a result of this information. Whether profiling provides value or not is highly dependent on how close the sample input data set resembles the actual input data set when the program is executed for real. We call this "closeness" representativeness.
representativeness (of data)
see profiling.
compile-time (static)
that which is done before the program is actually run. For example, a compile-time branch predictor makes its prediction based on information that is available before the program starts execution.
runtime (dynamic)
that which is done while the program is running. For example, a run-time branch predictor makes its prediction based on information that is collected while the program is being executed. It is important to note that in the old days, there was a clear distinction between compile time and run time. That is, the program was compiled. Then the program was executed. Today, more and more research has gone into the notion that the program should be amenable to re-compiling while it is running. That is, once the hardware knows that a decision that was made at compile time was a bad decision, the hardware ought to be able to reinvoke the compiler with this information and generate new object code that takes advantage of this new information. This concept is still in its infancy.
levels of transformation
my term for the transformations that must take place in order to get a problem stated in some natural language like English, Spanish, Gujarati or Tagalog to be solved by the electrons that do the work. Problem --> algorithm --> high level or assembly language program --> object code in the ISA of the machine --> control signals that drive the microarchitecture --> logic --> circuits --> electrons.
algorithm
a step by step procedure, wherein each step is definite (precisely defined), effectively computable (able to be carried out by a computer), and wherein the procedure is guaranteed to terminate (finite). For a problem to be solved with a computer, one must be able to construct an algorithm to solve it.
ISA
Instruction Set Architecture. The specification of all aspects of the interface between the software and the hardware; for example, instruction formats, opcodes, data types, addressing modes, memory address space, memory addressibility, etc. This allows the software to specify a sequence of operations for the hardware to carry out (the program) and for the hardware to know what each of those instructions is asking it (the hardware) to do. For example: x86, Power-PC, SPARC, MIPS, VAX, Alpha are all ISAs.
microarchitecture
a particular data path and its control that carries out the specification of an ISA. For example, the x86 ISA has been implemented by Intel by the 8086, 386, Pentium, Pentium-Pro, Pentium 4, Pentium M, etc. and by AMD by the Athlon, Opteron, etc.
opcode
the part of an instruction that specifies the operation(s) that the hardware is expected to perform in carrying out this instruction.
prefix property
the property that a bit pattern is a prefix of a longer bit pattern. For example, 1001 is a prefix of 1001000001111.
addressing modes
mechanisms for obtaining operands required by an instruction. For example, the LDW instruction obtains the address of the memory location to be accessed by adding an offset to a Base Register. We say the addressing mode is Base+offset. The JSR instruction computes the starting address of the subroutine by adding an offset to the PC. We say the addressing mode is PC-relative.
data type
a representation of information for which the ISA provides instructions that operate on that representation. For example, in the LC-3b, the ADD instruction operates on values represented as 2's complement integers. We say the 2's complement integer is a data type. There is no floating point data type in the LC-3b ISA since there is no instruction in the LC-3b that operates on values represented as floating point numbers. Note that we can still operate on floating point representations by writing a subroutine to operate on these representation. For example, if the LC-3b did have a FLOATADD instruction, the LC-3b could compute a floating point add in ONE instruction. Since it does not, one could write a procedure that isolates the exponent and fractional part of each value, compares the exponents, shifts right the smaller fraction accordingly, does the add, and renormalizes the result. However, such a procedure would take tens of LC-3b instructions.
addressability
the number of bits stored at each memory location. Most modern computers are byte addressable. That is, 8 bits are stored at each location.
byte-addressable
see addressability.
address space
the number of distinct memory addresses. That is, the maximum number of memory locations possible such that each can be specified by a unique address. Older machines, the original x86, for example, had an address space of 2^16. We sometimes say: the x86 had 16 bits of address space. The original IBM 360, I believe, had an address space of 2^24. Most ISAs today, have an address space of 2^32, and some have an address space of 2^64.
endianness (big-endian/little-endian)
Suppose a 32-bit value is to be stored in a memory having byte-addressability. Four memory locations are required. We store the 32-bit value in four consecutive memory locations, say those haveing addresses A, A+1, A+2, and A+3. The question still remains: which 8 bits of the value get stored in which location. If we store bits 31:24 into location A, 23:16 into A+1, etc., we say the memory system is big-endian. If we store bits 7:0 into location A, 15:8 into A+1, etc., we say the memory system is little-endian. That is, if the number of bits required to represent a value is a multiple of the addressability of the memory, we store the value in contiguous locations of memory, starting with the low-numbered address. If we store the most significant bits at that address, we say we are storing according to the big-endian convention. If we store the least significant bits there, we say we are storing according to the little-endian convention.
fixed length vs. variable length
used to describe the number of bits used to encode an entity, usually used to refer to instructions, although data can also be fixed or variable length. If fixed length, the entity is always the same size. If variable length, the size is determined by the specific entity involved. LC-3b instructions are fixed length; every one consists of 16 bits. The x86 instruction is variable length. Depending on the instruction, there may be anywhere from one to 15 bytes constituting a single instruction.
uniform vs. non-uniform decode
fixed length instructions may have the property that the various fields of the instruction always occur in the same place, simplifying the decode process. We call such an encoding of the instruction format: uniform decode. The LC-3b for example has the opcode ALWAYS in bits 15:12 and the result operand always in bits 11:9.
memory-mapped I/O
I/O activity can be specified by special instructions in the ISA or by using the load and store instructions, but assigning each I/O device register an address from the memory address space. The latter is referred to as memory-mapped I/O. For example, with memory-mapped I/O, keyboard input is specified by a load instruction where the address is the one assigned to the keyboard data register.
operate instructions
instructions that operate on data: ADD,SUB,XOR,SHR, etc.
logically complete
a set of logical primitives that allows one to specify all logical functions. NAND, for example is by itself logically complete. AND is not by itself logically complete. The set {AND, NOT}, however, is logically complete.
arithmetic/logical shift
A right shift is conveniently used to perform an integer divide by 2, where the value being divided is expressed in 2's complement representation. Thus, we need to insert into the high bit position the sign bit of the value. On the other hand, bit patterns are often right-shifted for other (logical) purposes, in which case we need to insert a 0 into the the high (left-most) bit position. Given these two uses, we need to differentiate an arithmetic shift where we insert the sign bit into the left-most position from a logical shift where we insert a 0 into the left-most position. [Note that an arithmetic left-shift, corresponding to multiply by 2, requires inserting a 0 into the right-most bit position, exactly the same as what we would do for a logical left-shift. Therefore, we generally do not differentiate arithmetic left shift from logical left shift.]
2-pass assembler
A simple assembler that makes two passes at an assembly language program, first to identify all the symbols and generate the symbol table, and second, to generate the actual bit patterns of the code.
pseudo-op
a directive provided by the assembly language programmer to the assembler to help the assembler translate the assembly language program to the 0s and 1s of the machine's ISA. In LC-3b assembly language, .ORIG, .FILL, and .STRINGZ are examples of pseudo-ops. Pseudo-ops are not executable at run-time. They are strictly messages to help the assembler assemble the program.
label
a symbol that identifies the location of an instruction. Necessary if that location contains the targe of a branch or a location that contains data that will be accessed by a load or a store instruction.
data movement instructions
instructions that move data. Loads/stores to/from memory and I/O instructions.
control instructions
instructions that redirect the instruction stream. The default in instruction processing is to increment the PC to point to the next instruction. If the next instruction is other than the next sequential instruction, a control instruction is needed to provide the new PC. The LC-3b control instructions are BR, JMP, JSR(R), TRAP, and RTI.
condition codes
single-bit registers (often referred to as "flags") which provide additional information about the result of the last instruction. The LC-3b has three condition codes (N,Z,P). Upon completion of an operate instruction, the condition codes specify whether the result of that operate was negative (N=1,Z=0,P=0), zero (N=0,Z=1,P=0), or positive (N=0,Z=0,P=1). Similarly, upon completion of a load, the condition codes specify whether the value loaded into the destination register was negative, zero, or positive. Most real ISAs also have C and V condition codes. C designates the carry out of an add instruction. V designates whether the previous integer arithmetic instruction resulted in an overflow condition. One example is the case where the source operands of an ADD are both positive 2's complement numbers, but the result produced by the ALU is the representation of a negative number. [Say, I have 4-bit integers. +6 = 0110, +5 = 0101. If I tell the ALU to add them, the output would be 1011, which is the representation for -5. The ALU is saying +6 plus +5 = -5. Ergo, set V.]
0, 1, 2, 3 address machine
An n address machine is an ISA where n indicates the number of operands that need to specified EXPLICITY for binary operations in the ISA. The ADD is a good example of a binary operate instruction. In a 0 address machine, none need to be specified, since the microarchitecture obtains both sources from the top of the stack, and pushes the result onto the stack. In a 3 address machine, like the LC-3b, for example, both sources and the destination ALL need to be specified explicitly (ADD R1,R2,R3). The x86 is a 2 address machine.
load/store ISA
An ISA, where operate instructions can only operate on operands that are stored in the registers. That is, one can not load a value from memory and operate on that value in the SAME instruction, for example. The LC-3b is an example of a load/store ISA. The x86 is not.
fetch
the process of accessing an instruction from the instruction memory.
hardware interlock
a hardware signal that prevents improper behavior from occurring in the execution of instructions. For example, consider the pipelined processing of a MUL that stores to R3, immediately followed by an ADD that sources R3. At the time the ADD wishes to source R3, the value stored there is the old value, not the value that will be produced in a few cycles by the MUL instruction. A hardware interlock (called a scoreboard bit) prevents the ADD from sourcing R3 until after the MUL stores its result there.
semantic gap
an old term, used to describe the closeness of fit between the instructions in a high level language and the opcodes in an ISA. For example, the VAX had an opcode AOBLEQ which added 1 to an operand, and branched if the result of the add was less than or equal to some limit operand. This instruction was clearly in the ISA to accomodate "for" loops. The semantic gap between the "for" and AOBLEQ was tiny. In the old days, there was strong interest in specifying opcodes that had a small semantic gap relative to the high level languages commonly in use at that time. That interest has waned during the past 20 years, but there is good reason why we may see it return in the future.
microsequencer
the mechanism in the microarchitecture that figures out what control signals will be needed to process during the next clock cycle.
control store
a private hardware structure for storing the set of control signals needed in a particular clock cycle. In a microprogrammed machine, the job of the microsequencer is to identify the location in the control store that contains the control signals for the next clock cycle.
microinstruction
the set of control signals needed in one clock cycle. In a microarchitecture with a control store: the contents of one location in the control store.
datapath
the part of the microarchitecture that processes the information. The register file, alu, and various muxes are all part of the data path.
state machine
the structure that specifies, for each state of the machine, and each set of relevant values (opcode, interrupt present, memory ready, etc.) what set of control signals should be invoked during the next clock cycle.
VLIW
Very Long Instruction Word. A processing paradigm wherein the compiler generates a long word, consisting of more than one instruction, all stored at the same PC address. At instruction fetch, all instructions in that long word are fetched, decoded, and executed concurrently in lock step. To work, it is necessary that there be no flow dependencies among the instructions in a single long word. This paradigm provides benefit if the compiler can find at compile time multiple instructions that have no flow dependencies, and therefore, can execute concurrently. The number of instructions n in the long word (VLIW instruction) is fixed. Therefore, if the compiler can not find n independent instructions to put into one VLIW instruction, it must fill the remaining slots with no-ops.
compatibility
a term used to describe the ability of new implementations of an ISA to execute code written for an older implementation. Intel has been a master at making sure that every new x86 implementation is able to execute correctly all code written for all older x86 implementations. As time passes, it becomes clear that some old opcodes, addressing modes, data types are no longer useful, and some new opcodes, addressing modes, and data types are necessary. In fact, the addition of the SSE1 and SSE2 instruction set to the x86 ISA is a reflection of the new uses of that ISA. However, Intel's insistence on compatibility means that in addition to the new SSE extensions, new implementations also have to include in their microarchitecture the capability to execute the old opcodes, addressing modes, data types, etc. since some customers may want to buy a new laptop but still use their old software.
pipelining
the process of carrying out instructions in an assembly line fashion. Suppose instructions are fetched one per cycle. Then in a pipelined implementation, in one cycle the fetch logic fetches an instruction, and hands it off to the decode unit to decode in the next cycle. In that next cycle, the fetch logic will be fetching the next instruction.
interrupt
an external event more urgent than the program running that causes the machine to stop execution of the program that is running and turn its attention to service the external event. Keyboard interrupt is an example of a low priority interrupt. Machine check is an example of a high priority interrupt.
exception
an event internal to the program that is running that must be handled to properly deal with the instruction. Page fault, access control violation are two examples of exceptions.
precise exceptions
if an instruction causes an exception, it is often important to recover the machine state that corresponds to the situation where all instructions preceding the exception-causing instruction have executed correctly and no instruction following the exception-causing instruction has executed. It is also useful to identify the instruction that caused the exception. A microarchitecture that can do both is said to support precise exceptions. The Tomasulo Algorithm running on the IBM 360/91 did not support precise exceptions.
critical path design
the cycle time of a processor is one of the key items that specifies the performance of the processor. Recall that performance is to a first approximation determined by
                        1/(length * CPI * cycle time)
Decreasing cycle time has a major impact on performance. Cycle time is the maximum of the various logic paths in the data path and its control. Critical path design instructs one to examine the longest path in the machine and attempt to reduce it, either by more clever logic design or by latching a path in the middle and turning a long single-cycle operation into a two-cycle operation.
bread and butter design
neither the supply of transistors on the chip nor the design time of the designer are infinite. With finite resources, bread and butter design suggests applying both to the chip's bread and butter, that is, those opcodes, addressing modes, and data types that are important to optimize for the expected usage of the chip.
balanced design
the whole point of a pipelined implementation is to allow the various parts of the microarchitecture to do their respective things concurrently. It makes no sense to provide capability in one stage of the pipeline that goes unused in the next stage. Better to use the extra transistors required to provide that capability for something more useful. Example: a fetch engine that can fetch and decode 6 instructions each cycle, but only has a single ALU for executing these instructions is not a balanced design. There is no point wasting resources to be able to fetch and decode 6 instructions concurrently if the instructions just sit in a queue downstream waiting for the single ALU.
SRAM
Static random access memory. An SRAM cell stores a single bit of information with two cross-coupled inverters. The feedback path maintains the state of the bit. As long as power is supplied to the two inverters, the SRAM cell maintains its value. The SRAM cell is also referred to as volatile memory, since if power is removed from the inverters, the bit that is stored is lost. [Cross-reference, non-volatile, where the bits retain their values even when power is removed, as in a disk or tape.] The minimum number of transistors for an SRAM cell is 4 -- two inverters, with 2 transistors per inverter. However, most SRAM cells use more transistors to provide for other features, such as switching speed, gating, noise immunity, for example.
DRAM
Dynamic random access memory. A DRAM cell stores a single bit of information as the absence or presence of charge on a capacitor. The capacitor charged is one value, the capacitor uncharged is the other value. A single transistor is required to read (sample the charge) and write (store the charge) the bit. Unfortunately, the capacitor will not maintain its charge indefinitely. There are RC paths to ground that will leak the charge. Therefore, a DRAM cell requires periodic refresh. That is, periodically, capacitors must be sampled and returned to their fully charged or uncharged state.
refresh
See DRAM cell.
page mode
Typically DRAM chips are addressed in two steps. Consider the DRAM chip as containing 2^n addresses. The high order n/2 bits (Row) come in first, using RAS. This effectively preselects the contents of all addresses having the same row address bits. Then the low order n/2 bits (column) of the address comes in, using CAS. This selects the exact address from the 2^(n/2) that have identical row address bits. Access time is the sum of RAS + CAS times. If the next address has the same row address bits as the previous one, the memory controller, which keeps track of this, only needs to send the column address bits to the DRAM chip, thereby saving the time needed to latch the row address bits. We say that all addresses having the same row address bits are on the same "page," and the act of the memory controller exploiting this by only sending subsequent column address bits, we refer to as "page mode."
row address strobe (RAS)
see page mode.
column address strobe (CAS)
see page mode.
chip enable
output pins of a chip float and have no effect if the chip enable bit is not asserted.
parity
a coding scheme that requires only one extra bit to detect a single bit error, provided that all single bit errors are statistically independent. "Even" parity dictates that this extra bit must be set to 1 or 0 such that the entire n+1 bit transfer contains exactly an even number of 1s. In that way, if the received n+1 bits contain a single error, the number of 1s will be odd, so the fact that an error has occurred will be detected. Statistical independence of the bit errors is necessary, since if two bits are in error, parity will not detect it.
ECC
error correction code. a code which allows a single bit error to be not only detected, but also corrected without requiring the bits to be retransmitted. See my handout on the subject.
Hamming Code
the mechanism for performing ECC correction. See my handout.
CRC check
cyclic redundancy check. A mechanism for detecting burst errors, wherein the individual bit errors are not statistically independent.
memory bank
a unit of memory that can process one memory access at a time.
interleaving
allocating addresses to memory locations "horizontally," rather than "vertically." That is, if address n is in bank 0, address n+k is in bank 1, address n+2k in bank 2, address n+3k in bank 3, etc. where k is the size of the access in a bank in bytes (assuming byte-addressable memory).
unaligned access
an access to memory that is not aligned on an access boundary, thereby requiring two accesses to perform the desired operation. See my handout on unaligned accesses.
working set
the set of pages belonging to a process that is resident in memory at an instant of time is called the process' working set at that instant of time.
balance set
the set of processes that have their working sets in memory.
swap in/swap out
the act of bringing in or out the working set of a process. If the working set of a process is stored back to disk, we say the process is swapped out.
page
the unit of virtual memory that is transferred between disk and physical memory. That is, the whole page is brought in or out, or none of it.
frame
the physical space that a virtual page occupies in physical memory.
resident
a term used to describe the fact that a virtual page is actually occupying a frame of physical memory.
page in/page out
the act of moving a virtual page from the disk to physical memory (page in), or from physical memory to the disk (page out).
thrashing
the condition whereby there is not enough frames of physical memory to hold all the virtual pages that a process needs, resulting in an inordinate amount of time being wasted by pages coming in and going out, only to come in again, etc.
PTE
page table entry. A descriptor for each page of virtual memory, containing information, including the frame of physical memory where it resides, whether it has been modified since it has been loaded from the disk, and protection information as to what levels of privilege are required to access all information stored on that page. Some ISAs include other information in each page's PTE.
page table
a data structure containing the PTEs of all pages in the virtual image of the process.
mapping virtual to physical addresses
the transformation of a virtual address to the physical location in memory where the information at that virtual address is actually stored.
PSR
processor status register. A processor register that contains status information about the running process, including its privilege level and priority.
privilege level
every ISA defines a set of privilege levels, from unprivileged to most privileged to distinguish what level of privilege is required to make certain accesses. The simplest model has two levels of privileged: privileged and unprivileged.
protection
the concept of protecting a computer resource from being utilized by a process that does not have the desired level of privilege. For example, a page can be protected so that only the highest privileged processes can write to that page. Another page may be protected so that any process can write to it. Some instructions can only be executed by processes that have a certain level of privilege. For example, on machines that have a HALT instruction, unprivileged processes are generally not allowed to execute the HALT instruction ...and allowed to remotely stop the computer.
virtual memory
memory that a process thinks it has. Generally, the size of the virtual address space.
physical memory
memory that actually exists.
address translation
conversion of a virtual address to the physical address where the virtual location actually exists, so that the processor can read/write values into that location.
VPN
virtual page number. The high bits of a virtual address; it specifies the page number.
PFN
page frame number. The high bits of a physical address; it specifies the frame number within physcial memory.
TLB
translation lookaside buffer. It is a cache of PTEs. It allows translation to take place without going through the involved process of "walking" through the page tables. That is, within a fraction of a cycle instead of tens of cycles.
valid bit
a bit determining whether other bits are accurate. In the case of the PTE, the valid bit tells whether the hardware can believe the PFN bits, which really says whether the page is resident or not. If V=0, the page is not resident, forcing the hardware to take a page fault.
reference bit
a mechanism used for page replacement. If the reference bit mechanism is used for page replacement, each PTE contains a reference bit. Periodically, all reference bits in the page table are set to 1. When a page is accessed, its reference bit is cleared. When a page fault occurs, the page chosen to be kicked out to make room is taken from the set of pages whose reference bits are still 1.
modify bit
a bit in the PTE which is set when a page is written to. If the Modify bit is clear, the page need not be written back to disk on replacement since it has not changed since being read from the disk.
page table base register
the starting address of the page table. Used to compute the address of a PTE.
limit register
the number of PTEs in the page table. Used to determine if a virtual address actually corresponds to an address in the virtual image.
system space
virtual address space that is common to all processes, and is used to house the data structures and code of system related functions.
user space
virtual address space specific to a process. Sometimes called process space. Each process has its own version of process space.
access control violation (ACV)
an exception caused by one of two things. Either the address was ill-formed and does not belong to the virtual address space of that process (limit violation) or the privilege of the process does not allow the access specified (protection violation).
page fault (TNV)
Translation not valid. the virtual address being translated is not resident in physical memory.
segmentation
a protection mechanism to protect regions of memory for different kinds of accesses.
segment register
a register used in conjunction with the address provided by the program to produce an address which is either a physical address that can be accessed directly or a virtual address which needs to be further translated to a physical address.
linear address
Intel's name for virtual address. That is, a segment register provides a base address which is combined with the address provided in the program to form the linear address. The linear address is further translated to a physical address using PTEs.
cache memory
a small, fast-access memory structure that stores a subset of the physical memory, thereby reducing memory access time if the desired location is present in the smaller structure. Since the cache only stores a subset of the physical memory, it must store both the addresses of the locations it contains as well as the data at those locations. The granularity of a cache entry is a called a cache line (or synonymously, a cache block).
spatial locality
the property that if a location in memory is accessed, it is likely that locations close by will be accessed soon.
temporal locality
the property that if a location in memory is accessed, it is likely that location will be accessed again soon.
cache line (block)
the granularity of storage, see cache memory. This size is generally greater than the size of a single access, due to our faith that spatial locality exists.
tag, index, offset bits
A memory address is partitioned into three parts: its tag bits, its index bits and its offset bits. The offset bits define which location within the cache line is being accessed. The index bits are used to index into the tag store, and specify the set of locations in the cache available for storing the particular line. The tag bits are stored at the corresponding index in the tag store to verify that the actual line stored in the cache is the one desired. That is, all blocks/lines from memory having identical index bits in their addresses compete for the same entries in the cache. The tag bits are used to specify which of these many entries are actually stored in the cache.
direct-mapped cache
a cache characterized by the property that if a line of memory is stored in the cache, it can only be stored in one place.
fully-associative cache
a cache characterized by the property that if a line of memory is stored in the cache, it can be stored in anyplace in the cache.
set-associative cache
a cache characterized by the property that a line of memory can be stored in any of k entries in the cache. We say such a cache is k-way set-associative, and the set size is k.
tag/data store
The tag store is the part of the cache that keeps all the bookkeeping information associated with the blocks of memory that are actually stored in the cache. Most important of these is the identity of the actual block. The data store is the storage in the cache allocated for storing the actual information that is contained in the corresponding lines of memory. When we say a cache is of a certain size, we are referring to the size of the data store. [A system having 16MB of memory and a 64KB cache stores 64KB out of the 16MB in the cache's data store. We do not say how much tag store is required. The tag store is just management overhead to allow the cache to function properly.]
cache hit/miss
on a load or a store, the cache is queried to determine if the location's information is contained in the cache. If yes, we say it is a cache hit. If no, we say it is a cache miss.
compulsory miss, conflict miss, capacity miss
The three common causes of a cache miss. A compulsory miss occurs the first time the line is accessed, unless the line has been prefetched into the cache. No other way the line could have gotten into the cache. A capacity miss occurs if a line that used to be in the cache had been kicked out simply because there is not enough room in the cache. A conflict miss occurs if a line was kicked out, not because of the size of the cache, but rather because of the size of the set. That is, the cache was not full, but the set for which the line competes was full. Fully associative caches do not have conflict misses.
hit ratio
the percentage of accesses that are hits. That is
        cache hit ratio = hits/(hits + misses)
writeback/write through
a property of caches. A write through cache is one in which every write to the cache is also written to memory. In this case, memory and the cache are always consistent. On replacing a block, it is unnecessary to write that line back to memory. A write back cache is one in which writes to the cache do NOT also include writes to memory. Therefore, if a cache line is to replaced, its dirty bit is checked to see whether any writes to that line have occurred since it was brought in from memory. If the dirty bit is set, indicating that writes have occurred, then before the line can be removed from the cache. it has to be written back to memory first.
dirty bit
in a write back cache, a bit that specifies whether the information in the cache line has been changed since it was read in from memory. If not, then on replacement, this line need not be written back to memory.
allocate on write miss
the act of setting aside space in the cache for data to be stored, if the line written to is not in the cache.
sector cache
the characteristic of a cache whereby each cache line is partitioned into sectors, whereby each sector is individually valid. Thus, if a cache line is partitioned into four sectors, there are four valid bits in the tag store entry for that line. Useful in those cases where one wishes to allocate on a write miss, but not read in the entire line from memory because it is likely that the entire line will be completely overwritten before it is ever read.
replacement policy (LRU, FIFO, random, victim/next-victim)
algorithm to determine which line within a set is to be evicted if a cache miss occurs, and one of the lines in the set must go. The line to be evicted is often called the victim. The line that will become the victim after the victim is evicted is the next-victim.
key transformation
a fancy synonym for hashing.
virtually-indexed, virtually-tagged cache
a virtually addressed cache.
physically-indexed, physically-tagged cache
a physically addressed cache.
virtually-indexed, physically-tagged cache
a cache wherein the index bits used to access the tag store are obtained from the virtual address, but the tag bits actually stored in the tag store are physical address bits. This allows TLB and tag store accesses to be made concurrently.
cold start effect
in general, the transient effect that occurs in most things before a system is in service long enough to be operating in a stable mode. In the case of a cache, this refers to the case where the cache is completely empty and the first several accesses are compulsory misses.
RAM
Random Access storage. Given an access, the time to perform the next access is independent of the location of that access relative to the preceding access.
direct access storage
access is random to a starting address, and then sequential locations are accessed from that point on. Most common type is a disk.
sequential access storage
access is always sequential to the next location. Most common type is a tape.
associative access
access is based in part at least on the contents of the storage location. That is, part of the address is actually stored in the storage location. Two common examples that you have already seen of this are (1) the tag store of the cache, where the entry contains the tag bits, which are part of the address, and (2) the TLB where each entry is a page number and its corresponding PTE. The page number is the associative part of the access.
CAM
Content addressable memory. Memory that is accessed via associative access.
context switch
the act of stopping the execution of instructions from one process and starting the execution of instructions from another process. Before the processor can do that, it needs to save the registers associated with the process being stopped, and load the registers associated with the process being started. This process of saving and loading the context is referred to as a context switch.
cache consistency/coherency
in a shared memory multiprocessor system, where each processor has its own cache, it is important that the contents of a single location that is stored in multiple caches has the same information in all caches that store that location. If yes, we say the cache is coherent with respect to that location.
multiple levels of caches
A vast disparity exists between the fast access needed by the cache that is dealing with the processor (say, one or two cycles) and the very slow access that is available from memory (say, hundreds of cycles). It is also the case that in order for the cache to be accessible in one or two cycles, it must necessarily be small. Caches are either fast or large, they can not be both. This model would result in too many misses, resulting in too many long accesses to memory. The solution is to have multiple levels of cache: L1 is small, but takes one or two cycles, L2 might take 6 or 8 cycles but can be much, much larger. Sometimes there is an L3 cache, which is slower still and larger still. Finally, there is memory -- largest, but slowest. The notion is to extend the notion of the memory hierarchy to within the levels of cache.
inclusion property
In a shared memory multiprocessor, it is important to have cache coherency across all the caches. That could mean each processor's caches monitoring the traffic associated with all the other processors' caches. It would degrade performance to have the L1 cache involved in this monitoring activity. One solution is to obey the inclusion property, which states that any location present in a processor's L1 cache is also present in that processor's L2 cache. In such cases, the L2 caches can do the monitoring for cache coherency, freeing the L1 caches from that task.
science of tradeoff
My tongue in cheek phrase to emphasize the importance of tradeoffs to the discipline of computer architecture. Clearly, computer architecture is more art than science. Science, we like to think, involves a coherent body of knowledge, even though we have yet to figure out all the connections. Art, on the other hand, is the result of individual expressions of the various artists. Since each computer architecture is the result of the individual(s) who specified it, there is no such completely coherent structure. So, I opined if computer architecture is a science at all, it is a science of tradeoffs. In class, we keep coming up with design choices that involve tradeoffs. In my view, "tradeoffs" is at the heart of computer architecture.
latency
how long it takes something to happen. memory latency -- the time it takes from when the address is provided until the information stored at that address is available. We will talk about interrupt latency after the spring break -- the time it takes from the time some external device requests service until the time that device starts getting serviced.
PLA
Programming Logic Array. A combinational structure consisting of an array of AND gates all producing outputs which are fed to an array of OR gates. The general structure of a PLA sources all external inputs to each AND gate, making each AND gate able to identify one input combination (row of a truth table). Also, in the general structure, all AND gate outputs are available as input to every OR gate. Those actually connected as inputs to an OR gate are those that produce a 1 in the output column of the truth table of the relevant logic function. Thus, in principal, any arbitrary logic function of the external inputs can be implemented as the output of one of the OR gates. In practical structures, the fan-in (maximum number of inputs) to an OR gate is fixed, thereby limiting the ability of the PLA to implement all combinational logic functions.
shift and add multiplication algorithm
Recall your experience with the multiplication of decimal numbers. We systematically multiply the multiplicand by each digit of the multiplier and write each corresponding temporary result right-aligned with the corresponding multiplier digit. Then we add up these temporary results, producing the product. In multiplying binary numbers, the only digits are 1 and 0 so the multiplication step consists of simply copying the multiplicand or writing all zeroes as each temporary result. If we are multiplying n bit numbers, we get n temporary results. To add these temporary results properly, we need to take into account their relative alignment. This can be done by n-1 steps where each step includes the sum of all temporary steps before it. That is, after step 1, we have correctly added the temporary results for bits 1 and 2. After step i, we have correctly added the temporary results for bits 1, 2, ... i+1. The (i+1)th step consists of SHIFTING right the (i+2)th temporary result one bit so as to align it with the sum of the first i steps and then ADDING the two quantities. Thus, the name shift and add to describe binary multiplication.
static instruction stream
The program as it sits in memory, or rather out on the disk. Each instruction generated by the compiler exists once in the static instruction stream.
dynamic instruction stream
The *sequence* of instructions as they are executed. A single instruction (located in a single address of memory) counts once in the static instruction stream. Its count vis-a-vis the dynamic instruction stream is the number of times it is executed. The dynamic instruction stream (often called a *trace*) gives a record of the order that the intructions were carried out.
state of machine
Computer processing is all about systematically changing the state of the machine in response to each instruction in the dynamic execution of the program. That is a program executes by the machine being in an initial state, and one by one each instruction execution changes the state in accordance with the semantics (specification) of the instruction. The machine state consists of the PC, the PSR, the general purpose registers, the condition codes (if the ISA specifies condition codes), the contents of each memory location, and the mapping registers that enable the memory management process. More sophisticated ISAs provide additional state.
design point
a specification of the criteria to be optimized in the design process. If we are interested in performance, or reliability, or availability, or low cost, we colloquially say our design point is performance, or reliability, or ...etc.
assembly language
a language very close in semantics to the ISA of the machine, with the advantage that bit patterns are replaced by mnemonic devices, and some additional useful features are often present. Nonetheless, the defining characteristic of an assembly language is that the level of control over changing the machine state is very much like what one would have if one programmed in 0s and 1s. Also, sometimes called a "low level" language to distinguish it from "high level" languages that are more tuned to the expressiveness of humans. High level languages (more than 1000 out there): C#, C++, Java, C, all the way back to antiquity, when Fortran and Cobol showed up. Assembly language programs are translated by an assembler which does little more than replace all the mnemonic devices with 0s and 1s. It would be a tough job to try to assemble a program written in the assembly language for ISA A into the 0s and 1s of ISA B. On the other hand, high level language programs are generally translated by a compiler. Since the instructions in the high level language are quite a bit removed from the intricasies of any ISA, it is not difficult to change the ISA that a compiler translates the high level language program into.
instruction level simulator
A simulator that changes its state at the granularity of the instruction. That is, given the machine state and an instruction, the instruction level simulator produces the state required by the instruction. No within-instruction states are simulated.
cycle level simulator
A simulator that changes its state at the granularity of the clock cycle. That is, given the machine state and the control signals asserted during a clock cycle, the cycle level simulator produces the state required by those control signals.
steering bit
A bit in an instruction that specifies the interpretation of other bits in the instruction, depending on the value of the steering bit. For example, bit[5] in the LC-3b is a steering bit, since it determines the usage of bits[4:0] of the instruction.
vector operation
an operation that is performed on a set (or sets) of values (referred to as vectors). For example, a vector A consists of n values A1, A2, A3, ..An. And a vector B consists of n values B1, B2, ...Bn. A vector add of vectors A and B would produce the vector consisting of A1+B1, A2+B2, ... An+Bn. We often refer to this mode as data-parallel, since there are no dependences between the component adds, and they can all be executed in parallel.
prefetching
Fetching instructions or data early so that they are available quickly when they are needed. For example, most microprocessors have units of storage called L1 caches that can be accessed in one or two cycles. It is useful during instruction fetch or data fetch to be able to access this element in one or two cycles. This can be accomplished by obtaining the element from a much slower unit of storage before it is really needed. We call that access in advance of when needed *prefetching.*
natural word length
the generic size of elements processed in the machine. Generally, the size of the registers. Generally, the size of values processed by the ALU in a single cycle.
multi-dimensional array
a classical mathematical structure, having multiple dimensions.
dynamic/static interface
the interface between when the compilation process ends, and the run time process begins. The compiler translates the source program producing 0s and 1s that specify the work to be done. The hardware carries out the work of those 0s and 1s at run time. Although it is now changing, the compiler never used to have run-time behavior available to it to make its decisions. It did not know which cache accesses missed, which branches were actually taken, how long certain operate instructions took. It is true that compilers have for a long time, as part of the compilation process, observed what happened when the program ran on some sample data. This is called *profiling.* But that data is not the *real* data, and in fact can lead to decisions that could be counterproductive, depending on how close to the sample data the real data is. We refer to the DSI as the interface between when all activity is done without real knowledge of real run-time data, and when all activity has that knowledge.
indirect addressing
An addressing mode, where the address (A) of the source data is contained in a memory location (B), the address of B is computed by the addressing mode. We say that B is a pointer variable because its contents is not data but rather the address A of the data.
virtual machine (emulation)

An ISA is implemented in hardware. Each instruction in that ISA is fetched, decoded, etc. and at each step of the instruction cycle, control signals are generated to carry out the phases of that instruction. What if you wanted a program compiled into the ISA of machine A to be executed on a machine that has a different ISA, say ISA B. Clearly, the 0s and 1s of the ISA of A have no meaning with respect to the machine of ISA B. We call the ISA of machine A a virtual machine because it does not correspond to real hardware in our situation. What is needed is a program (usually called an emulator) that translates each instruction of ISA A into a sequence of instructions from the ISA B that can then be carried out on the machine of ISA B and produce the same result as it would have if the ISA A instruction had been carried out on a machine of ISA A. We often refer to ISA B as the host machine since that is where the results are actually being produced. And ISA A the target machine since that is the code produced by the compiler from the original program. The concept has been around for a very long time, although it is lately gaining a lot of popularity both in research universities and product development centers. An early example was the Burroughs 1700 of the 1970s. Programs were compiled to three different virtual machines, depending on whether the nature of the code was scientific, business, or system level in nature. The resulting three different ISAs were then emulated on the host B1700 machine. More recently, there have been plenty of Java programs that have been compiled into an ISA referred to as the Java Virtual Machine. The result is that other companies have produced software emulator to translate each JVM instruction into the ISA of its own host machine. So, for example, a Pentium M can run JVM instructions if it wishes by emulating each JVM instruction into a sequence of x86 instructions that does that job.

An analogy to virtual memory is in order. Like virtual memory, a virtual machine exists in the eye of the beholder. Recall a virtual address has to be translated to a physical address before it can be accessed. A virtual machine instruction has to be translated to a sequence of instructions in the host ISA before it can be carried out.

flow dependency
the dependency between a result produced by one instruction that is needed by another instruction. For example the expression A(B+C). We can not do the multiply until we do the add. We say there is a flow dependency between the add and the multiply. A popular well-meaning textbook refers to this as a RAW hazard, meaning that we can not read the source of the multiply until after we write the result of the add. I personally do not like this terminology because it totally distorts the meaning of the word hazard. We will deal with this after Spring break, so you can ignore this for now.
CISC/RISC
In 1980, the term RISC was coined to characterize simple ISAs. The term was an acronym for Reduced Instruction Set Computer, but was ill-advised on a lot of levels, which we may go into late in the semester if there is time. Suffice it to say right now, that the intent was to have each instruction do a single operation in a very streamlined way. SPARC, MIPS, HPPA, Alpha, Motorola 88000 Power-PC are all examples of ISAs that attempted more or less to follow that mantra. x86, Motorola 68000, VAX IBM Zseries are all examples of ISAs that did not. The acronym czars referred to the "other" ISAs as Complicated Instruction Set Computers. For example, the x86 which can load from memory, add to a register and store back to memory in a single x86 instruction -- three distinct operations. ...which led their advocates to shoot back: CISC stood for Comprehensive Instruction Set Computer.
Load/Store architecture
The term refers to ISAs that require separate load and store instructions to move data to and from memory. That is a load/store ISA does not allow operate instructions to operate on data loaded from memory in the same instruction. Nor, does it allow the result of an operate instruction to be written to memory in the same instruction. So-called RISC ISAs were in general Load/Store. So-called CISC ISAs were in general not.
control signals
signals that control the processing within the data path, the memory, and the microsequencer. The set of control signals that are effective during a single clock cycle are often characterized as a microinstruction.
atomic unit
a unit that can not be further partitioned without losing its intrinsic identity. It takes its meaning from chemistry where an atom consists of protons, neutrons, and electrons, but once partitioned into these particles, ceases to have the identifying characteristics of the atom. Oxygen is no longer oxygen when the 6 electrons and 6 protons (if I remember my high school chemistry correctly) are lopped off. We say an instruction is the atomic unit of processing since we execute code at that lowest granularity. That is, we never execute a piece of an instruction. Either the whole instruction or none of it. That is, we do not change the machine state corresponding to a piece of an instruction and leave the machine state corresponding to another piece of the same instruction unchanged.
byte write
Although memory is usually byte addressable, the memory bus is usually multiple bytes wide. A byte write is a memory access which stores one byte of information into a designated byte-addressable location while ignoring the rest of the bytes on the memory bus. Implementation requires write enable signals of byte granularity.
soft error (Updated: 03/29/2007)

A soft error is an error that is not hard. Duh.

So, what does that mean? We distinguish errors as transients errors and permanent errors. A transient error is one that occurs due to some event that for purposes of the computer is random. Someone started up a washing machine in the same room, and there is not enough noise margin. Or, some alpha particle got in the way, which gets more and more common as the clock gets more and more into the GHz range. The idea is that if you replay the program, the error will not show up the second time.

The other kind of error is the permanent error. Here the device is broken and always give you the bad result.

Permanent errors are called hard errors. Transient errors are called soft errors.

checksum
the bits used to detect whether a burst error has occurred within a large transfer of information, where the errors are not statistically independent. Suppose I wish to transfer N bits of information from X to Y. A common implemenation of checksum error detection is to input the N bits sequentially into an n-bit linear feedback shift register at X concurrently with transferring them to Y. After the transfer is complete, the n bits remaining in the shift register are then sequentially transferred to Y as well. At the destination Y, the process is repeated and the contents of the n-bit shift register after N bits have been input is compared with the last n bits received at Y. If they match, the transfer is assumed to have gone error free. Typical values for N is in the thousands of bits, typical values for n is 16 or 32.
barrel shifter/rotator (shift matrix rotator)
a large combinational network that allows an n-bit value to be rotated (or simply shifted) an arbitrary number of bit positions. Example, if n is 32, the 32-bit value can be shifted 0, 1, 2, ... 31 bits. The combinational structure could be implemented with 32 32-to-1 muxes, but is more efficiently implemented in multiple layers as a shift tree, as described in class. Inputs to a barrel shifter is the n-bit source and a log n bit encoding of the shift amount. Output is the rotated result.
write enable
A signal that must be asserted if a storage structure is to be written. In the technology we use in 360N, if the signal is asserted throughout the cycle, the storage device is written at the end of the cycle.
length register
A register that is part of the virtual memory management structure, identifying how many pages are described by the corresponding page table. This is needed to protect against false translations wherein the page number computed from the virtual address may correspond to a page that is not part of the virtual address space that is mapped. For example, if an executable image consists of 6 pages, the page table will contain 6 PTEs, and the length register will contain the value 5 (pages 0, 1, ...5). Any virtual address whose bits that identify a page greater than 5 must therefore be a bad address. This is easy to detect by simply comparing the page- number bits to the contents of the length register.
interrupts
external events that cause the processor to stop processing the current process, and after putting the machine in a consistent state, vector to the starting address of the service routine specified by the specific vector.
exceptions
events that are internal to the process running, such as illegal opcode, page fault, floating point underflow, etc., that cause the processor to stop processing the current program and deal with the exceptional event. Like interrupts, it is necessary to first put the machine in a consistent state and then vector to the starting address of the exception service routine. The service routine that is executed depends on the exception which has occurred. Each exception (or set of exceptions) has its own vector to tell the machine which service routine.
service routine (handler)
see above.
interrupt vector
the bit pattern identifying the particular interrupt, which is used to initiate the correct service routine.
exception vector
see interrupt vector
consistent state
the state of the machine between the execution of instructions. In the middle of execution of an instruction, the internal state of the processor can be whatever the microarchitecture wishes. Even general purpose registers can be clobbered if the microarchitecture wishes. But, at the end of each instruction, the contract with the software requires that the state of the machine be recoverable to the state produced by the program up to that point. That point is the consistent state.
power failure
loss of power causing a very high priority interrupt.
machine check
the highest priority interrupt, indicating that the processor is generating bogus information. Cause: a hardware malfunction.
sticky bit
a condition code that once set retains its condition until interrogated under program control.
interrupt/exception priority level
priority at which an event requests interruption of the running program. Usually, the priority at which the service routine runs. Priority depends on urgency of the event.
maskable
the ability to ignore an event so as to defer handling, or even deny handling altogether.
medium
the material containing the information – the hole in a punched card, the magnetic material on the disk, etc.
device
the I/O unit that houses all the stuff that causes the information to be stored and transformed from one form to another. Disk drive, for example
transducer
the thing that tranduces from one medium to another. From light passing through a punched card to the electrical signals. From the magnetized material on the disk track to the electrical signals.
polling
I/O activity controlled by the processor executing its program.
interrupt driven I/O
I/O activity controlled by the device requesting the processor interrupt what it is doing.
DMA
direct memory access. Allows memory and the disk (for example) to transfer information without tying up the processor.
I/O processor
a special purpose processor that manages I/O activity, including doing some processing on that information which is being transferred.
dedicated bus
bus dedicated to a single purpose.
platter
colloquial word to describe one of the surfaces of a disk.
track
on a platter, a complete circle, containing bits of information.
aerial density
the number of bits per unit length along a track.
disk head
that part of the mechanical arm that contains the ability to read the magnetic material stored on the track, or write to the track.
cylinder
the set of tracks that can be read simultaneously.
seek time
the time it takes the head to get to the desired track to read or write the information.
rotation time
the time it takes the correct starting point on the track to come under the head where it can be read/written.
disk crash
when the head actually hits the spinning platter. …and scores the surface, usually. BAD.
disk block
the unit of storage accessed from the disk. 512 bytes, for example.
redundancy
the concept of using more than n bits to store n bits of information in order to recover in case one or more of the bits get corrupted. Parity provides one extra bit of redundancy. ECCL provides log n + 1 extra bits of redundancy.
RAID-0
striped disk array with no redundancy.
mirroring
aka RAID-1. Two disks to store the information of one. Each bit has its mirror bit.
striping
interleaving, when applied to disk.
RAID-1
see mirroring
RAID-2
array with ECC coding across the disks
RAID-3
array with parity coding. All parity bits in the same disk.
RAID-4
array with larger units on interleaving, so an entire file can be accessed on the same disk.
RAID-5
RAID-3, except the parity bits are stored across all the disks, getting rid of the hot spot of a parity disk.
data lines
lines containing data bits.
address lines
ditto, except address bits.
control lines
ditto, except control bits.
multiplexed bus
a bus that is used to store address bits at some times and data at other times.
pending bus
a bus with the property that once the bus is grabbed for a transaction, the bus is tied up (often twiddling its thumbs) until the transaction has completed.
split-transaction bus
a bus that, instead of twiddling its thumbs, releases the bus to allow other transactions to use it while it is waiting for the second half of the transaction.
pipelined bus
split transaction bus where the order of the second half maintains the order of the first half of each bus transaction.
synchronous bus
a clocked bus.
asynchronous bus
no clock, bus cycles take as long as they need to take. Since no clock, handshaking is required.
handshaking
the interaction that starts off each bus cycle that is required since there is no clock.
arbitration
the act of determining who gets the bus for the next bus cycle.
central arbitration
arbitration is done by one central arbiter. (e.g., PAU)
distributed arbitration
each device controller has enough information to know whether or not it gets the bus the next cycle.
priority arbitration unit (PAU)
the central arbiter.
bus cycle
one unit of bus transaction time.
bus master
device controller in charge of the current bus cycle.
slave
the other controller that participates in the bus transaction, under the control of the bus master.
vertical priority
priority determined by the particular BR line.
horizontal priority
priority determined by proximity to the PAU, among all controllers having the same vertical priority.
device controller
that which interfaces the device to the bus, and manages all accesses to/from the device.
bus request
signal from a controller requesting a bus cycle
bus grant
signal from the PAU granting the bus
daisy chaining
mechanism by which all controllers with the same vertical priority receive the bus grant in order, and pass it on to the next if it does not want it.
burst mode
successive data transfers in the same bus cycle.
SACK
signal output by the bus grantee signifying it will be the next bus master.
MSYN
signal used by the bus master in synchronizing the transaction.
SSYN
signal used by the slave in synchronizing the transaction.
fundamental mode
asynchronous operation of a sequential machine wherein at most one input signal changes its value at a time.
pipelining
the assembly lining of instruction processing. Each instruction goes through multiple steps (stages), from Fetch to Retirement/completion of the instruction.
pipeline stages
the set of steps each instruction goes through in the pipeline.
pipeline bubble
a hole in pipeline, caused by the inability of instructions to flow through the pipeline. For example, following a conditional branch, it may not be possible to know which instruction to fetch next. This causes a a pipeline bubble. …or a load followed by the use of that load can cause a pipeline bubble.
branch misprediction penalty
the number of cycles wasted due to branch misprediction where everything speculatively fetched needs to be thrown away before we can start along the correct instruction path.
data dependency
one of the three data dependencies: flow, anti, write
flow dependency
often called read after write hazard. Example a + (b × c). Before we can do the add, we must do the multiply. That is the result of the multiply must be written before it can be read as a source for the add.
anti-dependency
using a register to store more than one value. Before we write to that register, all instructions which need to read the old value have to read it. Sometimes called write after read hazard.
output dependency
values output from the same register must be output in program order.
control dependency
dependency caused by control instructions, such as branches. Which instruction is executed next depends on which way a branch goes is an example of a control dependency.
data forwarding
information is sent from where it is produced to where it is needed without going first to a register. Speeds up processing.
sequential programming model
instructions are expected to execute in the order specified by the programmer.
stale data
data that was stored in a location but the location has since been reassigned so that the old data is no longer valid for new instructions that come along.
in-order execution
a microarchitecture that executes instructions in program order.
scoreboard
a hardware interlock that prevent an instruction from sourcing stale data.
out-of-order execution
a microarchitecture that executes instructions when they are ready for execution, regardless of their order in the program.
speculative execution
activity that is carried out before the hardware knows it must be (mandatory) carried out. For example, instructions fetched as a result of a branch prediction.
advantages/disadvantages of condition codes
branch prediction
rather than wait for the condition code to be resolved upon which a branch instruction is based (taken or not not taken) and creating a hole in the pipeline, the hardware guesses and fetches based on that guess. The guess is called a branch prediction.
always taken/always not taken
the prediction is determined by the designers of the chip.
compile-time predictor
the guess is determined by the compiler, usually as result of profiling.
run-time predictor
the guess is determined by the hardware on the basis of what has been happening with the running of the program.
BTFN
guess taken if the branch is to an earlier location in the code, not taken if the branch is to a location later in the program. This is consistent with the behavior of for and while constructs in high level languages. Best example of a compile time predictor.
last time predictor
predict whatever the branch did last time it will do this time.
2-bit saturating counter
advance the counter depending on what the branch does, predict according to the current high bit of the counter.
saturating arithmetic
arithmetic which does not wrap around. That is, the highest value serves the role of infinity. i.e., infinity + 5 = infinity, therefore, highest value + 5 = highest value.
register renaming
also called aliasing. Assigning unique register names to independent uses of the same register.
reservation station
storage location for an instruction awaiting operands before it can execute.
Tomasulo's algorithm
the algorithm which assigns register aliases that allows an instruction to move to a reservation station, while maintaining all producer/consumer relationships. This maintenance of the producer/ consumer relationships is the most important single thing needed to allow instructions to actually execute out of order.
tag
another word for the alias or distinct identifier assigned to each instance of a register.
in-order retirement
retirement of instructions occur in the order of their location in the program. Necessary to obey the sequent programming model.
register alias table (RAT)
the table of aliases used by the Tomasulo algorithm to assign tags. Each register has a valid bit, a tag and a value. Depending on the valid bit, either the tag or the value is correct.
common data bus (CDB)
in the data path of the IBM 360/91, the bus used to distribute values and corresponding tags to registers in the RAT and in the reservation stations that are waiting for the value uniquely assigned to that tag.
scheduling algorithm
an algorithm used to decide which of several ready instructions should be executed next.
fire when ready
schedule the instruction for execution (that is, send it to a functional unit) when all data that it needs is available.
data flow
the paradigm wherein instructions are executed when their data is available, rather than when they show up in program order.
data flow graph
a directed graph showing all the instructions as nodes, and the producer/consumer relationships as arcs.
nodes
instructions
edges
or arcs. Shows result produced by one instruction that is sourced by the other.
producer
the instruction producing the result that is sent to various consumers.
consumer
an instruction that requires the result as a source.
restricted data flow
data flow in the Tomasulo sense, wherein only that subset of the data flow graph that corresponds to instructions that have been decoded but not retired is contained. I coined the term in 1984 to differentiate it from the data flow graph of the entire program
vectorizable
a loop instruction whereby each iteration of the loop is completely independent of the other iterations.
vector processing
the paradigm we discussed in class for handling vectorizable loops.
SIMD
single instruction stream, multiple data sets. Sometimes called data parallel. Two forms: vector processors and array processors.
vector load
multiple values loaded as a result of a single load instruction.
vector register
a register consisting of multiple components, useful for storing the result of a vector load, vector add, etc.
vector length register
identifies how many of the components of a vector register, or vector operation, are operative.
row-major order
an array stored in memory in the order: first the components of the first row, then the components of the second row, etc.
column-major order
an array stored in memory in the order: first the components of the first column, then the second column, etc.
stride
the distance in memory between locations to be accessed vis-a-vis successive registers.
vector stride register
register containing the current stride value.
vector chaining
instruction processing wherein the results of one vector operation can be forwarded to the next vector operation without waiting for all components of a vector instruction to be completed.
scalar computer
what you are used to. No vector registers, no vector loads, stores, adds, etc.
reciprocal approximate
a quick way to do division. a divided by b is replaced by a times 1 divided by b. The reciprocal approximate unit takes b as input and produces 1 divided by b.
Backup registers
Cray program controlled cache, sort of. Eight S registers are backed up by 64 T registers. Instead of going to memory again and again, one can do a vector load to the T registers.
Software managed cache
a tongue in cheek description of Cray T registers.
Loop buffers
in the Cray machines, an instruction buffer containing enough storage to accommodate an entire loop.
Stripmining
The process of handling more than 64 iterations of a loop, 64 at a time.
Fixed point arithmetic
the generalization of integer arithmetic. The decimal point (or binary point) is always in the same place, in the case of integer arithmetic, at the far right. No bits of a data element are wasted identifying where the binary point is.
2's complement
one representation of integers
1's complement
ditto
signed-magnitude
ditto
word length
the size of the ALU, the size of the registers, the default value of sizes.
short integer
an integer of the size of the word length.
long integer
integers of size that is a multiple of the word length. Not supported by the ISA. However, software can be written to deal with long integers. Example in class: with a word length of 32 bits, the x86 can operate on 320 bit integers by means of a procedure that processes the values 32 bits at a time. Takes a loop that will iterate 10 times.
Binary Coded Decimal (BCD)
an encoding of integers, such that each decimal digit is encoded in 4 bits. Number of digits is usually arbitrary, which means one needs two pieces of information to specify a BCD number: starting address and number of digits.
Shift and Add multiplier
classical way to do binary multiplication. Since multiplying by 1 simply means adding in the multiplicand, and multiplying by 0 means adding zero, both after aligning (shifting) properly, multiplication is usually described as a series of shifts and adds, one for each bit in the multiplier.
priority/privilege
Priority is about urgency, privilege is about the right to do something. They are orthogonal issues. We sometimes confuse them in our human experiences because sometimes they go hand-in-hand. Example, the fire truck runs a red light to get to the fire. Running the red light is about privilege. We give it that privilege because of the urgency of getting to the fire. There are examples where they do not go hand-in-hand. Recall my example in class of the mail clerk who had to go to the bath room while waiting for the elevator on the 55th floor. In computer technology, they really are orthogonal. Real-time experiments may run at higher priority, because they need to sample the sensors at a precise moment in time. But they can not execute RTI or turn off the computer, or access certain files. System administrators can do that, even though they may run at a low priority level because it is not urgent when they do their job, just that they do it.
post increment (auto increment)
This is an addressing mode that is common on many older machines and a few more recent machines. The address of the memory location containing the operand is in a register. After the address is read, the register is incremented, so the next time that instruction is executed, the register will point to the next data element. Side note: since data usually occupies some number of bytes, the register is incremented by the number of bytes of the operand.

Memory Hierarchy (Caches)

register spill
Registers are useful of storing values that will need to be sourced, rather than write them to memory and have to read them from memory again. More registers mean more values can be kept in registers. Unfortunately there are not enough registers, so at some point we need to write the values of registers to memory in order to make them available to hold new temporary values. We call that process of writing the registers to memory to temporarily get them out of the way and make room: register spill, and the compiler output that does this “register spill code.”
write back buffer (store buffer)
Loads must be serviced as soon as possible, since subsequent instructions are usually dependent on the data being loaded. Stores, on the other hand, can take place when convenient. [Note: in a multiprocessor, that may not be quite so simplistically stated, but that is an issue we will deal with later this semester.] A store buffer allows the processor to put values to be written to the side and not tie up bus bandwidth unnecessarily. For example, on a load miss to a write back cache, put aside the dirty cache line to write to memory until after the new cache line is read in. For example, on a write through cache hit, we may want to buffer multiple writes to the same cache line, so as to only have to do one memory write and tie up bus bandwidth once. The write buffer is an extension of the memory system and needs to be considered as such. That is, an access to the cache also accesses the write buffer.
uniprocessor cache consistency
We probably also ought to note here that the rest of the world would call this “coherency” not “consistency.” Normally, we think of cache coherency only when we have more than one processor that can write to its own private cache. We will deal with this more after the exam. It is worth noting however that in a uniprocessor environment with intelligent I/O devices, the I/O device (e.g., DMA controller) can independently write to memory, in which case data in the cache may no longer be valid. Ergo, in an environment with intelligent I/O devices, the cache must monitor independent accesses to memory from the I/O subsystem in order to be sure that data in the cache has not been invalidated by what the I/O device wrote to memory.

Input/Output

volatile/non-volatile
Volatile storage loses its data when power is turned off. Memory is volatile. Non-volatile storage retains its data when power is turned off. CDROMS, disks, tapes are all non-volatile.
carrier sense multiple access/collision detect (CSMA/CD)
The way the Ethernet works. The ethernet is an asynchronous bus. Device controllers are attached to the bus. Any controller can access any bus at any time. How do we keep from transmitting garbage. Answer: CSMA/CD. Each controller senses the carrier to see if anyone else is accessing the bus. If yes, the controller does not access the bus. All (multiple) the controllers are sensing the carrier all the time. If no one else is using the bus, the controller is free to put signals on the bus. However, it may be, due to the nature of the asynchronous bus and propagation time of those signals that controller A may think there is no one else using the bus because the signals being propagated by controller B may not have propagated all the way to A yet. Thus, controller A must continue to monitor the bus to be sure it is always the case that the signals controller A is putting on the bus are the same as the signals controller A is monitoring. If they differ, controller A has detectd a collision (CD). In this case, controller A stops transmitting, waits a random amount of time, and then tries again. Meanwhile, controller B will have detected the collision due to A, so it stops transmitting, waits a random amount of time, and then tries again. The hope is that randomA is not equal to randomB. If it does, we repeat.
protocol
Taken straight from our embassies. The sequence of actions that must take place in order to transact business. In the case of an asynchronous bus, business is the transmission of a value from A to B. Also called handshaking.
race condition
A condition where the result depends on the inadvertent propagation delay of signals from one device to another.
starvation
A condition in which a controller may never get the bus it is requesting. Recall the example of the daisy chain grant of the BG signal.
bus busy
A control signal indicating that a bus cycle is in progress.

Pipelining and Branch Prediction

dual fetch
delayed branch
A branch instruction whose effect does not take effect immediately. That is the semantics of the branch instruction is: execute n more instructions and then branch.
delay slot
The number of instructions between the branch and when it executes. Also refers to the instructions that are contained in those n locations after the branch that will execute before the branch redirects the instruction stream.
fall through problem
performance metric
The basic equation for performance is:
1 ÷ (N × CPI × t)
N is the number of instructions in the program being measured. CPI is the average number of cycles it takes to complete (retire) each instruction. t is the cycle time of the clock. Note that if make the clock smaller, I probably increase CPI. CPI vs. t is a fundamental tradeoff. Note also that t is 1/freq, and CPI is 1/IPC. Most machines quote their IPC values (bigger is better), although the microarchitect is more interested in CPI, since the microarchitect is interested in eliminating some of the cycles that contribute the CPI of an instruction. Recall, when we talked about virtual memory, I mentioned that the VAX 11-780 had a CPI of 10.5, that each VAX instruction on average made one memory reference, and that the page table translation mechanism took 22 cycles. The reason the CPI is only 10.5 is because there is a TLB on that machine which takes no extra time when it hits. Since it hits 95% of the time, only 5% of the time does the 22 cycle translation take place. 5% of 22 cycles is approximately one cycle. That is, approximately one cycle of the 10.5 CPI is due to virtual to physical memory translation.
branch target buffer (branch history table)
A mechanism that is helpful in predicting branches. it generally contains the target address of a taken branch plus a bit to tell the instruction fetch mechanism whether to predict taken or the fall through path. The prediction can be used to select the source of a mux that will get loaded into the PC for the next cycle. If the prediction is taken, we load the address from the BTB. If we predict not taken, we load the existing PC plus the instruction size in bytes.
2-level adaptive branch predictor
The branch predictor we invented in 1990, where the branch decision is based on what the processor did in previous cycles where its history was the same as the current history. We need a history to store the current history, and a table of two bit counters to store what the processor did in past when it had exactly the same history.
history register
See above.
pattern history table
The table of 2-bit counters. If the history register contains n bits, the table contains 2n 2-bit counters.

Out-of-Order Execution

register renaming (aliasing)
Recall that because we do not have an infinite number of registers, we need to use the same register to hold different values at different times in the flow of the program. Due to the nature of out-of-order execution processing, it is likely that two instruction that write to the same register could be active at the same time. Thus it is important to separately identify each INSTANCE of each destination write. We do this by assigning a tag (usually a physical register number) to each destination operand. In Tomasulo's time, that tag was referred to as an alias. Recall our figure in class – the Greek letters represented tags. Recall the data flow example – the tags correspond to arcs.
retirement (commit, completion)
Important in an out-of-order execution machine, where precise exceptions must be maintained. Instructions execute when they are ready, but they must commit their results in the order of the program (in program order) – hence the term “in-order retirement.” We use the word retire or commit to designate that act of permanently changing the destination register.
reorder buffer (ROB)
A convenient register file that allows destination writes to take place in any order, but reads for the purpose of retirement must be done in program order.
architectural/physical register
Architectural registers are what the software knows about. Physical registers are internal storage that the microarchitecture uses to maintain the correctness of the execution. The re-order buffer for example is a file of physical registers.
address unknown problem (memory disambiguation problem)
A ST instruction has an addressing mode that requires the contents of a register. A younger LD instruction has a known address. Can the load take place before the store. It depends on whether the addresses they are accessing are different. What if the address associated with the store is not yet known. We have a problem. Maybe yes, maybe no. I called it the unknown address problem. Josh Fisher carried the day with his name for it: memory disambiguation. The requirement to disambiguate these two address. That is what everyone calls it today. What to do about it:
memory order buffer (MOB)
The structure that keeps track of the memory accesses so as to be able to quickly tell if there is a disambiguation problem.

Vector Processors

load/store pipe
superscalar processor
A processor that fetches more than one instruction at a time and by virtue of the logic it entails (multiple decode, branch prediction, out-of-order execution) is able to obtain (hopefully) an IPC greater than 1. Note: it does not have to obtain an IPC greater than 1. Rather it has the enablements to accomplish that. Some have argued that as long as more than one instruction is being fetched and decoded each cycle, and that number is determined at run-time, it is a superscalar processor. The name derives from the fact that before these multiple fetch processors came around, the only processors that had an IPC greater than one were vector processors. Ergo this scalar processor that provides in some sense capability previously reserved for vector processors were not simply scalar processors, they were super scalar processors.
Flynn's bottleneck
As long as we fetch one instruction at a time into our pipeline, we can never get an IPC at the other end of the pipeline greater than one.
loop unrolling (software pipelining)
The name given to the process of executing multiple loop interations before encountering a branch. For example, a loop that iterates ten times could be unrolled once, such that each iteration now consists of what was before two iterations. That is, the code would do iteration one, iteration two, and then test to see whether to branch back. The technique is used in those cases where results produced in one iteration are needed by the next iteration, wherein if the loop is enrolled the whole second iteration does not have to wait for the completion of the whole first iteration.
Booth's algorithm
an improvement over the standard shift and add algorithm for multiplication by enabling each iteration to consume two bits of multiplier at each step, rather than one.
residue
another word for remainder
modulus
In a modulo scheme of any sort, the number of distinct values. In practical matters, if I say “A modulo p,” I am asking what is the remainder when I divide the value A by the modulus p. Example: 10 modulus 3 = 1.
pairwise relatively prime
A number of values are pairwise relatively prime if the greatest common diviser of every pair equals 1. For example, 7,8, and 9 are relatively prime since the gcd of 7,8 is 1, gcd of 7,9 is 1, and gcd of 8,9 is 1. Note that 8 and 9 are not prime.
Chinese remainder theorem
That if I have a set of k moduli, p1, p2, …pk that are pairwise relatively prime, and if I represent each value as the set of k residues, modulo their corresponding moduli, then the residues of the values 1 to the product of the k moduli are all distinct.
floating point
a data type supported by the ISAs of most computers. It allows larger and tinier numbers than is possible with an equal number of bits using a fixed point representation. This is accomplished by using some bits to express the size of the value (exponent bits) at the expense of the number of bits of precision. A magnitude is represented by a fractional part and an exponent part, where the fractional part contains the significant bits of precision, and the exponent part specifies the size. A value can be expressed in floating point as either a normalized number or, if it is too small to be expressed as a normalized number, it can be expressed as a “subnormal number.” Normalized numbers are of the form (-1)sign × 1.fraction × radixexp, where the 64 bit format in IEEE arithmetic specifies the bits as follows: [63:63] is the sign bit, [62:52] is the exponent field, and [51:0] is the fraction field. If the value is too small to be represented as a normalized number, it can still be represented in the form (-1)sign × 0.fraction × radixsmallest_exp, where [63:63] is the sign bit, [62:52] contains the smallest exponent, and [51:0] is the fraction field.
signed magnitude
the representation of a value that uses one bit for sign and the remaining bits for magnitude. Used in floating point data type.
exponent
the bits used to describe the size of the datum. See floating point.
mantissa (aka significand or fraction)
the part of the representation that contains the significant bits of the value.
normalize
to put into a standard (i.e., “normal”) form. In the case of IEEE Floating Point arithmetic, that form is 1.fraction × 2exponent.
radix
the base, as in 2exponent ---> 2 is the radix.
excess-code (bias)
A code that uses an excess or bias to compute its code words. To obtain a code word, we add the bias to the value to be encoded. For example, with an excess-7 code, the value -1 would be expressed in eight bits (for example) as 00000110.
binade
the interval from 2n to 2(n+1).
wobble
the ratio of error introduced by rounding when the result of rounding is a power of the radix. For radix 2, one ULP larger than 2n is twice as large as one ULP smaller than 2n. Therefore the wobble is 2.
precision
the number of bits of significance in the representation of a value. For 52 bits of fraction, the value has 53 bits of precision.
ULP (unit in the last place)
the value of 0.000…001 × 2exponent. For example, with 4 bits of fraction, exponent = -2, one ULP is 1/64.
underflow
an exception caused by a computed value being too small in magnitude to be expressed other than by 0.
overflow
an exception caused by a computed value being too big to be expressed by any normalized finite value. Rounding will result in the value being represented by infinity or by the largest number to be represented exactly, depending on the rounding mode used.
chopping
one of the four rounding modes. Rounding is accomplished by removing (chopping) the least significant bits. Both positive and negative values are rounded toward zero.
round up
one of the four rounding modes. Rounding is accomplished by rounding toward the next larger value. All values are rounded toward +infinity.
round down
one of the four rounding modes. Rounding is accomplished by rounding toward the next smaller value. All values are rounded toward -infinity.
round to unbiased nearest (beautiful rounding)
one of the four rounding modes. Rounding is accomplished by rounding to the next value (greater or smaller) that is closer in magnitude to the value being rounded. If equidistant, then we round to the value (of the two) that has a 0 in its ULP.
error due to underflow
amount of error caused by representing a value by 0.
error due to inexact
amount of error caused by rounding.
gradual underflow
the result of introducing subnormal numbers into the representation scheme so that the error due to underflow is equal to the error due to rounding as values approach 0. Subnormal numbers are numbers too small to represent as normalized numbers. Consider the smallest normalized number, we call it 1.000…000 × 2n. The binade delimited by 2n and 2(n+1) contains 2k exact values, where k is the number of bits of fraction. This binade is exactly the same size as the interval between 0 and 2n. The subnormal numbers are those values 0.fraction × 2n. Since the fraction consists of k bits, there are exactly 2k exact values in this interval. Since this interval is the same size as the smallest binade, the error due to underflow is the same as the error due to rounding. Professor W. Kahan (the inventor of IEEE Floating Point Arithmetic) gave this notion the name “gradual underflow.”
subnormal
see gradual overflow.
floating point exception
there are five things that can cause an exception: underflow, overflow, inexact, divide by 0, and invalid. Exceptions can be quiet, in which case a sticky bit is set which can only be cleared by testing for it (usually used for inexact) or signalling, in which case the exception is taken immediately (always used for invalid).
divide by 0 exception
an exception caused by a computation that uses finite arguments and produces infinite results. Called “divide by 0” because the simplest example of this is a finite number divided by 0. Another example of this is the tangent function where the argument is 90 degrees.
invalid exception
an exception caused by an operation that produces nonsense results. Examples include 0/0, infinity/infinity, sqrt(-n), arcsin(2).
NaN (not a number)
the result of an operation that creates nonsense. see invalid exception.
SIMD (data parallel)
processing that involves a single instruction stream such that each instruction operates on multiple data sets. Two forms of SIMD processing exist: array processors and vector processors.
SPMD (single procedure multiple data set)
a variation of SIMD, wherein the level of synchronization is a procedure rather than each instruction.
SISD
single instruction stream operating on a single data set.
MIMD
multiple instruction streams each operating on its own data set. The classic form of a multiprocessor or multicomputer.
vector processor
a processor containing vector instructions, i.e., instructions that operate, in turn, on multiple values.
array processor
a set of processors such that all execute the same program, each executing the identical instruction in lock step, each step of the way.
granularity of concurrency
the scope of what is being processed simultaneously. Granularity can be at the intrainstruction level, as in pipelining, or at the interinstruction level, as in VLIW, or at the processor level, as in a multiprocessor.
multiprocessor
more than one processor operating concurrently. The pure definition of a multiprocessor requires that each processor carry on substantial work by itself (to differentiate it from a co-processor), and that all the processors share the same memory space. Sometimes this form of multiprocessor is referred to as a tightly-coupled multiprocessor …as compared to a loosely-coupled multiprocessor where there is no shared memory. The more proper term for a loosely-coupled multiprocessor is multicomputer network. But most people refer to the two paradigms as tightly and loosely coupled multiprocessors. In a shared memory multiprocessor, values can be passed from one to the other via shared memory locations. In a loosely-coupled multiprocessor, they are passed via message passing.
tightly coupled MP
see multiprocessor.
shared global memory
see multiprocessor.
loosely coupled MP
see multiprocessor.
message passing
see multiprocessor.
superscalar
a microarchitecture paradigm where more than one instruction is fetched and decode each cycle, the exact number determined by the fetch/decode engine, that number dependent on the actual instructions fetched.
IPC (instructions per cycle)
a common metric of performance. The number of instructions completing execution on average per cycle.
Flynn's bottleneck
Michael J. Flynn's observation that as long as one fetches instructions one per cycle into the pipeline, the machine will never obtain an IPC greater than 1.
superpipelined
a term used to describe a pipeline wherein the instruction fetch rate is a multiple of the rate at which instructions can be processed by one of the functional units in the pipeline.
scalability
a measure of the ability of a design to produce linearly improving benefits as the number of elements in the design continues to increase linearly. For example, if we add k ALUs, do we get k times the performance. If we increase the frequency by x, do we compute the answer x times faster.
speedup
a measure of the improvement in performance achieved by adding processing capability to a problem. In its pure form,
Speedup(p) = Time(1) ÷ Time(p)
where Time(1) is the time a single processor requires to compute the answer using the best available algorithm for a single processor, and Time(p) is the time the processor requires to compute the answer using p processors.
baseline
in any measurement where we are trying to evaluate the benefit of a feature, the baseline is the system that does not include the feature. Note: Some less than honest purveyers will quote benefits due to a feature by selecting a very wimpy baseline. In the case above which discusses Speedup with p processors, a poor baseline would be to use an algorithm for one processor that is a particularly bad algorithm if we were restricted to one processor. Thus we would have a large Time(1), which would make Time(p) look better than it really is.
Amdahl's Law
an observation by Gene Amdahl that most problems have a parallel part and a sequential part, such that the addition of p processors reduces the execution time of the parallel part by p, but does nothing to improve the execution of the sequential part. We call the sequential part, for which the p processors can do nothing, the sequential bottleneck. If we let a equal the part of the problem that is the parallel part, then we can state Amdahl's Law in terms of speed up:
Speedup(p) = Time(1) ÷ ((a × Time(1)) ÷ p + (1-a) × Time(1))
sequential (serial) bottleneck
see Amdahl's Law.
parallelizable
see Amdahl's Law. That part that can be executed in parallel.
NUMA (non-uniform memory access)
a distributed shared memory multiprocessor, wherein processors can access all of memory, but the access time varies depending on which part of the distributed memory a processor is accessing.
memory contention
the result of more than one processor attempting to access the same memory bank at the same time.
cache coherence problem (cache consistency)
in a shared memory multiprocessor, it is most common for each processor to have its own private cache. If more than one processor have in their respective caches the contents of the same memory location, it is important that those respective contents be identical. We say that caches are coherent when this is the case. Caches are not coherent, for example if both P1 and P2 cache location A, but P1 has one value for A in its cache, and P2 has a different value for A in its cache.
fault tolerance
the ability of a processor to compute correctly even though there are some faults in the hardware.
high availability
a more recent term for fault tolerance. Availability refers to the fraction of time a computer is processing correctly, and not “down.” A high availability machine is one that is “down” for no more than a very few minutes a year. To accomplish such, the computer system must be able to operate in the presence of faults; therefore, the term: fault tolerant.
TMR (triple modular redundancy)
one mechanism for achieving fault tolerance. Three machines execute the program in lock step, and continually vote on the results they produce. The probability that two are both wrong and give the same wrong answer is sufficiently remote that processing continues correctly as long as two give the same answer. When one of the three gives a different answer from the other two, it is assumed to be broken. It is replaced with a known good machine, and the system continues. The system is continually available as long as the other two machines give identical answers during the period where the broken machine is replaced. We call the frequency of faults in a processor the MTTF (mean time to failure), and the time required to replace the faulty machine with a good one MTTR (mean time to repair). As long as MTTF is greater than MTTR, we should do just fine.
pair and spare
another scheme for maintaining error-free operation in the presence of single processor failures. Two pairs of processors, all four executing. Each pair compares results with its partner. As long as all four give the same answer, we are fine. When the partners of one pair disagree, we enter vulnerable mode, where we use the other pair for computation while we fix the failing pair. Again, MTTF and MTTR tell us how good our scheme is.
MTTR (mean time to repair)
see TMR
snoopy cache
a mechanism for maintaining cache coherency. This is done by allowing each cache to monitor the accesses of other processors to memory so that it knows whether a particular cache line is exclusive to itself or shared with other caches. …or is being written by another processor and so it (this cache) must either invalidate its copy or update it with the value being written by the other processor.
VLIW (very large instruction word)
a paradigm wherein the fetch/decode mechanism fetches n instructions from the instruction memory each cycle, and executes the n instructions concurrently. The value n is a fixed value for the machine. The set of n instructions fetched in one cycle, that is the contents of one VLIW word, is specified by the compiler. In order for the n instructions to execute concurrently, they must all be independent with respect to flow dependencies. It is the job of the compiler to identify n instructions that can be packaged together as a single VLIW word.
lock step
a characteristic of a VLIW machine is that the n instructions are executed by n processing elements in lock step, that is, at the same time. Further, if one processor finishes before the others, it can not fetch its next instructions until all have finished and the next fetch is done at the same time for all n processing elements.
decoupled access/execute (DAE)
a relaxation of the VLIW paradigm wherein the processors do not have to execute their instructions in lock step.
EPIC (explicitly parallel instruction computer)
a modern variation of VLIW, wherein additional bits (called template bits) are provided to specify which instructions in a very long word can be operated in lock step, and which have interinstruction dependencies, and therefore, can not be.
template bits
see EPIC.
trace scheduling
a profiling technique used with VLIW machines to find branches that are highly skewed so as to schedule some of the instructions in a vliw word by violating control dependencies. Since 98% taken (a highly skewed branch) is not the same as 100% taken, additional instructions have to be provided to prepare the machine to undo instructions that should not have been executed if the branch behaved according to the other 2%, and still additional instructions to actually do the undo when that “bad” branch behavior occurs. The benefit of this technique is that most of the time the branch behaves according to the 98% behavior and so additional paralllelism can be exploited. The idea is that the undo time is small and is more than compensated for the extra parallelism 98% of the time.
restricted data flow
my word for the data flow graph produced by generating it (a la Tomasulo) at run time. A compile time data flow representation of a program would be an enormous data flow graph. Generation at run time restricts the data flow graph to only those instructions in the active window. The active window is the set of instructions that have been fetched but not retired yet.
active window
see restricted data flow.
dataflow node
an operation in a data flow graph.
fine-grain dataflow
a dataflow graph wherein each node performs a small granularity operation, like a single multiply or a single add.
coarse-grain dataflow
a dataflow graph wherein each node performs a larger granularity function, like perhaps an entire high level assignment statement, or an iteration of a loop body.
copy node
a node which takes a value as input and produces two copies of that value as output. A consequence of pure data flow as an abstraction wherein each input value to a data flow node is consumed when the node executes.
tokens
another name for values that are input to data flow nodes or results that are output from data flow nodes.
firing rule
the rule for when a data flow node can be scheduled for execution. Safe vs. queues.
queue dataflow
wherein a node fires when all inputs have tokens. The outputs produced are queued, waiting for their subsequent use.
safe dataflow
wherein a node fires when all inputs have tokens, and there is an available slot for the result. Pro: Queues are not required. Minus: the node is kept from firing until an output slot is available, even though it possibly could have fired earlier.
conditional nodes
a node having two inputs and two outputs, the outputs being labeled true and false. This node takes two tokens (a value x and a boolean value) and passes the value x along the output path corresponding to whether the boolean token is true or false.
relational nodes
a node having two inputs and one output, the output being a boolean. The node performs the relation operation on its two inputs and produces the output token TRUE if the relation is true, or FALSE if the relation is false.
barrier synch
a node having k inputs and k outputs, such that when all inputs have tokens, the node fires, producing on each output line the token that was on its corresponding input line.
register spill code
that part of an object program generated by a compiler that writes values back to memory because there are not enough registers in the microarchitecture to accomodate all the temporaries required of the high level program.
live-in
a value that exists prior to the entity under consideration. With respect to a unit of code, a live-in is a value that is a source for one of the instructions in the unit of code.
live-out
a value that is produced in the unit of code and is required as a source by code subsequent to the unit of code.
in-order retirement
instructions complete their execution (as far as the software knows) in program order – that is, the same order that the program dictates.
re-order buffer (result buffer)
a structure in the microarchitecture that can hold results that are executed out-of-order so these results are still returned to the software in program order.
multithreading
a processor that allows multiple threads to be in various stages of execution at the same time. A thread is another word for an instruction stream. The HEP (circa 1978) is the first machine to implement multithreading. Each cycle an instruction is fetched from a different thread in a round-robin fashion. See my handout on the HEP.
simultaneous multithreading (SMT)
a variation of multithreading in which the execution stage of the pipeline executes out-of-order. The result is that multiple functional units can be simultaneously executing instructions from multiple threads.
interconnection network
a structure that interconnects various elements (usually processors and units of memory) of a computer system.
hypercube (cosmic cube)
An example of an interconnection network. The structure is a boolean hypercube. That is if n is the number of dimensions, and each dimension has two values (0 and 1), there are 2n connection points in the network. Each connection point is connected to n other connection points. Each connection point is at most n hops from every other connection point. See my handout on interconnection networks.
bus
an interconnection network where every element is connected to the same wire. Only one transaction can occur on the bus at any point in time. This is the structure of worst case contention. See my handout on interconnection networks.
full cross-bar switch
a set of buses, such that there are multiple transactions that can occur, one on each bus, at the same time. See my handout on interconnection networks.
tree
a connection network that forms a binary tree. See my handout on interconnection networks.
mesh
a connection network that forms a two dimensional array. See my handout on interconnection networks.
contention
one of the three common metrics for goodness of an interconnection network. Contention is a measure of how many independent transfers can be made on the network simultaneously.
cost
one of the three common metrics for goodness of an interconnection network. Cost is a measure of the number of connection points in the network.
latency
one of the three common metrics for goodness of an interconnection network. Latency is a measure of the number of hops through the network a transfer has to make from source to destination.