# Fall 2005

Booth's algorithm
an improvement over the standard shift and add algorithm for multiplication by enabling each iteration to consume two bits of multiplier at each step, rather than one.
residue
another word for remainder
modulus
In a modulo scheme of any sort, the number of distinct values. In practical matters, if I say “A modulo p,” I am asking what is the remainder when I divide the value A by the modulus p. Example: 10 modulus 3 = 1.
pairwise relatively prime
A number of values are pairwise relatively prime if the greatest common diviser of every pair equals 1. For example, 7,8, and 9 are relatively prime since the gcd of 7,8 is 1, gcd of 7,9 is 1, and gcd of 8,9 is 1. Note that 8 and 9 are not prime.
Chinese remainder theorem
That if I have a set of k moduli, p1, p2, …pk that are pairwise relatively prime, and if I represent each value as the set of k residues, modulo their corresponding moduli, then the residues of the values 1 to the product of the k moduli are all distinct.
floating point
a data type supported by the ISAs of most computers. It allows larger and tinier numbers than is possible with an equal number of bits using a fixed point representation. This is accomplished by using some bits to express the size of the value (exponent bits) at the expense of the number of bits of precision. A magnitude is represented by a fractional part and an exponent part, where the fractional part contains the significant bits of precision, and the exponent part specifies the size. A value can be expressed in floating point as either a normalized number or, if it is too small to be expressed as a normalized number, it can be expressed as a “subnormal number.” Normalized numbers are of the form (-1)sign × 1.fraction × radixexp, where the 64 bit format in IEEE arithmetic specifies the bits as follows: [63:63] is the sign bit, [62:52] is the exponent field, and [51:0] is the fraction field. If the value is too small to be represented as a normalized number, it can still be represented in the form (-1)sign × 0.fraction × radixsmallest_exp, where [63:63] is the sign bit, [62:52] contains the smallest exponent, and [51:0] is the fraction field.
signed magnitude
the representation of a value that uses one bit for sign and the remaining bits for magnitude. Used in floating point data type.
exponent
the bits used to describe the size of the datum. See floating point.
mantissa (aka significand or fraction)
the part of the representation that contains the significant bits of the value.
normalize
to put into a standard (i.e., “normal”) form. In the case of IEEE Floating Point arithmetic, that form is 1.fraction × 2exponent.
the base, as in 2exponent ---> 2 is the radix.
excess-code (bias)
A code that uses an excess or bias to compute its code words. To obtain a code word, we add the bias to the value to be encoded. For example, with an excess-7 code, the value -1 would be expressed in eight bits (for example) as 00000110.
the interval from 2n to 2(n+1).
wobble
the ratio of error introduced by rounding when the result of rounding is a power of the radix. For radix 2, one ULP larger than 2n is twice as large as one ULP smaller than 2n. Therefore the wobble is 2.
precision
the number of bits of significance in the representation of a value. For 52 bits of fraction, the value has 53 bits of precision.
ULP (unit in the last place)
the value of 0.000…001 × 2exponent. For example, with 4 bits of fraction, exponent = -2, one ULP is 1/64.
underflow
an exception caused by a computed value being too small in magnitude to be expressed other than by 0.
overflow
an exception caused by a computed value being too big to be expressed by any normalized finite value. Rounding will result in the value being represented by infinity or by the largest number to be represented exactly, depending on the rounding mode used.
chopping
one of the four rounding modes. Rounding is accomplished by removing (chopping) the least significant bits. Both positive and negative values are rounded toward zero.
round up
one of the four rounding modes. Rounding is accomplished by rounding toward the next larger value. All values are rounded toward +infinity.
round down
one of the four rounding modes. Rounding is accomplished by rounding toward the next smaller value. All values are rounded toward -infinity.
round to unbiased nearest (beautiful rounding)
one of the four rounding modes. Rounding is accomplished by rounding to the next value (greater or smaller) that is closer in magnitude to the value being rounded. If equidistant, then we round to the value (of the two) that has a 0 in its ULP.
error due to underflow
amount of error caused by representing a value by 0.
error due to inexact
amount of error caused by rounding.
the result of introducing subnormal numbers into the representation scheme so that the error due to underflow is equal to the error due to rounding as values approach 0. Subnormal numbers are numbers too small to represent as normalized numbers. Consider the smallest normalized number, we call it 1.000…000 × 2n. The binade delimited by 2n and 2(n+1) contains 2k exact values, where k is the number of bits of fraction. This binade is exactly the same size as the interval between 0 and 2n. The subnormal numbers are those values 0.fraction × 2n. Since the fraction consists of k bits, there are exactly 2k exact values in this interval. Since this interval is the same size as the smallest binade, the error due to underflow is the same as the error due to rounding. Professor W. Kahan (the inventor of IEEE Floating Point Arithmetic) gave this notion the name “gradual underflow.”
subnormal
floating point exception
there are five things that can cause an exception: underflow, overflow, inexact, divide by 0, and invalid. Exceptions can be quiet, in which case a sticky bit is set which can only be cleared by testing for it (usually used for inexact) or signalling, in which case the exception is taken immediately (always used for invalid).
divide by 0 exception
an exception caused by a computation that uses finite arguments and produces infinite results. Called “divide by 0” because the simplest example of this is a finite number divided by 0. Another example of this is the tangent function where the argument is 90 degrees.
invalid exception
an exception caused by an operation that produces nonsense results. Examples include 0/0, infinity/infinity, sqrt(-n), arcsin(2).
NaN (not a number)
the result of an operation that creates nonsense. see invalid exception.
SIMD (data parallel)
processing that involves a single instruction stream such that each instruction operates on multiple data sets. Two forms of SIMD processing exist: array processors and vector processors.
SPMD (single procedure multiple data set)
a variation of SIMD, wherein the level of synchronization is a procedure rather than each instruction.
SISD
single instruction stream operating on a single data set.
MIMD
multiple instruction streams each operating on its own data set. The classic form of a multiprocessor or multicomputer.
vector processor
a processor containing vector instructions, i.e., instructions that operate, in turn, on multiple values.
array processor
a set of processors such that all execute the same program, each executing the identical instruction in lock step, each step of the way.
granularity of concurrency
the scope of what is being processed simultaneously. Granularity can be at the intrainstruction level, as in pipelining, or at the interinstruction level, as in VLIW, or at the processor level, as in a multiprocessor.
multiprocessor
more than one processor operating concurrently. The pure definition of a multiprocessor requires that each processor carry on substantial work by itself (to differentiate it from a co-processor), and that all the processors share the same memory space. Sometimes this form of multiprocessor is referred to as a tightly-coupled multiprocessor …as compared to a loosely-coupled multiprocessor where there is no shared memory. The more proper term for a loosely-coupled multiprocessor is multicomputer network. But most people refer to the two paradigms as tightly and loosely coupled multiprocessors. In a shared memory multiprocessor, values can be passed from one to the other via shared memory locations. In a loosely-coupled multiprocessor, they are passed via message passing.
tightly coupled MP
see multiprocessor.
shared global memory
see multiprocessor.
loosely coupled MP
see multiprocessor.
message passing
see multiprocessor.
superscalar
a microarchitecture paradigm where more than one instruction is fetched and decode each cycle, the exact number determined by the fetch/decode engine, that number dependent on the actual instructions fetched.
IPC (instructions per cycle)
a common metric of performance. The number of instructions completing execution on average per cycle.
Flynn's bottleneck
Michael J. Flynn's observation that as long as one fetches instructions one per cycle into the pipeline, the machine will never obtain an IPC greater than 1.
superpipelined
a term used to describe a pipeline wherein the instruction fetch rate is a multiple of the rate at which instructions can be processed by one of the functional units in the pipeline.
scalability
a measure of the ability of a design to produce linearly improving benefits as the number of elements in the design continues to increase linearly. For example, if we add k ALUs, do we get k times the performance. If we increase the frequency by x, do we compute the answer x times faster.
speedup
a measure of the improvement in performance achieved by adding processing capability to a problem. In its pure form,
Speedup(p) = Time(1) ÷ Time(p)
where Time(1) is the time a single processor requires to compute the answer using the best available algorithm for a single processor, and Time(p) is the time the processor requires to compute the answer using p processors.
baseline
in any measurement where we are trying to evaluate the benefit of a feature, the baseline is the system that does not include the feature. Note: Some less than honest purveyers will quote benefits due to a feature by selecting a very wimpy baseline. In the case above which discusses Speedup with p processors, a poor baseline would be to use an algorithm for one processor that is a particularly bad algorithm if we were restricted to one processor. Thus we would have a large Time(1), which would make Time(p) look better than it really is.
Amdahl's Law
an observation by Gene Amdahl that most problems have a parallel part and a sequential part, such that the addition of p processors reduces the execution time of the parallel part by p, but does nothing to improve the execution of the sequential part. We call the sequential part, for which the p processors can do nothing, the sequential bottleneck. If we let a equal the part of the problem that is the parallel part, then we can state Amdahl's Law in terms of speed up:
Speedup(p) = Time(1) ÷ ((a × Time(1)) ÷ p + (1-a) × Time(1))
sequential (serial) bottleneck
see Amdahl's Law.
parallelizable
see Amdahl's Law. That part that can be executed in parallel.
NUMA (non-uniform memory access)
a distributed shared memory multiprocessor, wherein processors can access all of memory, but the access time varies depending on which part of the distributed memory a processor is accessing.
memory contention
the result of more than one processor attempting to access the same memory bank at the same time.
cache coherence problem (cache consistency)
in a shared memory multiprocessor, it is most common for each processor to have its own private cache. If more than one processor have in their respective caches the contents of the same memory location, it is important that those respective contents be identical. We say that caches are coherent when this is the case. Caches are not coherent, for example if both P1 and P2 cache location A, but P1 has one value for A in its cache, and P2 has a different value for A in its cache.
fault tolerance
the ability of a processor to compute correctly even though there are some faults in the hardware.
high availability
a more recent term for fault tolerance. Availability refers to the fraction of time a computer is processing correctly, and not “down.” A high availability machine is one that is “down” for no more than a very few minutes a year. To accomplish such, the computer system must be able to operate in the presence of faults; therefore, the term: fault tolerant.
TMR (triple modular redundancy)
one mechanism for achieving fault tolerance. Three machines execute the program in lock step, and continually vote on the results they produce. The probability that two are both wrong and give the same wrong answer is sufficiently remote that processing continues correctly as long as two give the same answer. When one of the three gives a different answer from the other two, it is assumed to be broken. It is replaced with a known good machine, and the system continues. The system is continually available as long as the other two machines give identical answers during the period where the broken machine is replaced. We call the frequency of faults in a processor the MTTF (mean time to failure), and the time required to replace the faulty machine with a good one MTTR (mean time to repair). As long as MTTF is greater than MTTR, we should do just fine.
pair and spare
another scheme for maintaining error-free operation in the presence of single processor failures. Two pairs of processors, all four executing. Each pair compares results with its partner. As long as all four give the same answer, we are fine. When the partners of one pair disagree, we enter vulnerable mode, where we use the other pair for computation while we fix the failing pair. Again, MTTF and MTTR tell us how good our scheme is.
MTTR (mean time to repair)
see TMR
snoopy cache
a mechanism for maintaining cache coherency. This is done by allowing each cache to monitor the accesses of other processors to memory so that it knows whether a particular cache line is exclusive to itself or shared with other caches. …or is being written by another processor and so it (this cache) must either invalidate its copy or update it with the value being written by the other processor.
VLIW (very large instruction word)
a paradigm wherein the fetch/decode mechanism fetches n instructions from the instruction memory each cycle, and executes the n instructions concurrently. The value n is a fixed value for the machine. The set of n instructions fetched in one cycle, that is the contents of one VLIW word, is specified by the compiler. In order for the n instructions to execute concurrently, they must all be independent with respect to flow dependencies. It is the job of the compiler to identify n instructions that can be packaged together as a single VLIW word.
lock step
a characteristic of a VLIW machine is that the n instructions are executed by n processing elements in lock step, that is, at the same time. Further, if one processor finishes before the others, it can not fetch its next instructions until all have finished and the next fetch is done at the same time for all n processing elements.
decoupled access/execute (DAE)
a relaxation of the VLIW paradigm wherein the processors do not have to execute their instructions in lock step.
EPIC (explicitly parallel instruction computer)
a modern variation of VLIW, wherein additional bits (called template bits) are provided to specify which instructions in a very long word can be operated in lock step, and which have interinstruction dependencies, and therefore, can not be.
template bits
see EPIC.
trace scheduling
a profiling technique used with VLIW machines to find branches that are highly skewed so as to schedule some of the instructions in a vliw word by violating control dependencies. Since 98% taken (a highly skewed branch) is not the same as 100% taken, additional instructions have to be provided to prepare the machine to undo instructions that should not have been executed if the branch behaved according to the other 2%, and still additional instructions to actually do the undo when that “bad” branch behavior occurs. The benefit of this technique is that most of the time the branch behaves according to the 98% behavior and so additional paralllelism can be exploited. The idea is that the undo time is small and is more than compensated for the extra parallelism 98% of the time.
restricted data flow
my word for the data flow graph produced by generating it (a la Tomasulo) at run time. A compile time data flow representation of a program would be an enormous data flow graph. Generation at run time restricts the data flow graph to only those instructions in the active window. The active window is the set of instructions that have been fetched but not retired yet.
active window
see restricted data flow.
dataflow node
an operation in a data flow graph.
fine-grain dataflow
a dataflow graph wherein each node performs a small granularity operation, like a single multiply or a single add.
coarse-grain dataflow
a dataflow graph wherein each node performs a larger granularity function, like perhaps an entire high level assignment statement, or an iteration of a loop body.
copy node
a node which takes a value as input and produces two copies of that value as output. A consequence of pure data flow as an abstraction wherein each input value to a data flow node is consumed when the node executes.
tokens
another name for values that are input to data flow nodes or results that are output from data flow nodes.
firing rule
the rule for when a data flow node can be scheduled for execution. Safe vs. queues.
queue dataflow
wherein a node fires when all inputs have tokens. The outputs produced are queued, waiting for their subsequent use.
safe dataflow
wherein a node fires when all inputs have tokens, and there is an available slot for the result. Pro: Queues are not required. Minus: the node is kept from firing until an output slot is available, even though it possibly could have fired earlier.
conditional nodes
a node having two inputs and two outputs, the outputs being labeled true and false. This node takes two tokens (a value x and a boolean value) and passes the value x along the output path corresponding to whether the boolean token is true or false.
relational nodes
a node having two inputs and one output, the output being a boolean. The node performs the relation operation on its two inputs and produces the output token TRUE if the relation is true, or FALSE if the relation is false.
barrier synch
a node having k inputs and k outputs, such that when all inputs have tokens, the node fires, producing on each output line the token that was on its corresponding input line.
register spill code
that part of an object program generated by a compiler that writes values back to memory because there are not enough registers in the microarchitecture to accomodate all the temporaries required of the high level program.
live-in
a value that exists prior to the entity under consideration. With respect to a unit of code, a live-in is a value that is a source for one of the instructions in the unit of code.
live-out
a value that is produced in the unit of code and is required as a source by code subsequent to the unit of code.
in-order retirement
instructions complete their execution (as far as the software knows) in program order – that is, the same order that the program dictates.
re-order buffer (result buffer)
a structure in the microarchitecture that can hold results that are executed out-of-order so these results are still returned to the software in program order.
a processor that allows multiple threads to be in various stages of execution at the same time. A thread is another word for an instruction stream. The HEP (circa 1978) is the first machine to implement multithreading. Each cycle an instruction is fetched from a different thread in a round-robin fashion. See my handout on the HEP.
a variation of multithreading in which the execution stage of the pipeline executes out-of-order. The result is that multiple functional units can be simultaneously executing instructions from multiple threads.
interconnection network
a structure that interconnects various elements (usually processors and units of memory) of a computer system.
hypercube (cosmic cube)
An example of an interconnection network. The structure is a boolean hypercube. That is if n is the number of dimensions, and each dimension has two values (0 and 1), there are 2n connection points in the network. Each connection point is connected to n other connection points. Each connection point is at most n hops from every other connection point. See my handout on interconnection networks.
bus
an interconnection network where every element is connected to the same wire. Only one transaction can occur on the bus at any point in time. This is the structure of worst case contention. See my handout on interconnection networks.
full cross-bar switch
a set of buses, such that there are multiple transactions that can occur, one on each bus, at the same time. See my handout on interconnection networks.
tree
a connection network that forms a binary tree. See my handout on interconnection networks.
mesh
a connection network that forms a two dimensional array. See my handout on interconnection networks.
contention
one of the three common metrics for goodness of an interconnection network. Contention is a measure of how many independent transfers can be made on the network simultaneously.
cost
one of the three common metrics for goodness of an interconnection network. Cost is a measure of the number of connection points in the network.
latency
one of the three common metrics for goodness of an interconnection network. Latency is a measure of the number of hops through the network a transfer has to make from source to destination.