2005
WARFP: The first paper on FAST. 2006
WARFP: A four page summary. ICCAD
2007: How the timing
models are composed. MICRO
2007: Our first
prototype. CAL 2009: How to efficiently
extend FAST to multicore
The FAST simulation
methodology is a new approach to simulating computer systems aims to be
orders of magnitude faster than other RTL-level cycle-accurate
techniques, while being able to run unmodified applications on top of
unmodified operating systems and providing full transparency into the
running system with no simulation slowdown. Such simulators are
only useful if they are relatively easy to create, use and
modify. We are in the process of proving the methodology by
building such a simulator and the infrastructure needed to efficiently
use it.
Such speed and functionality
is achieved by
partitioning the simulation task between a functional model and a timing model. The
functional model executes the desired programs and produces a trace of
all instructions executed that is immediately piped to the timing
model. The timing model pushes those instructions through a model
of the target (simulated machine) micro-architecture. Since the
instructions have already been executed, all information about them,
such as the virtual and physical address of both the instruction itself
as well as any data memory accesses, source and destination registers,
exceptions, and so on, have already been generated and can be made
available to the timing model.
Of course, not every target micro-architecture is capable of fetching
only right path instructions; in fact virtually all modern
micro-architectures at least occasionally fetch wrong path
instructions. We call the instruction stream that the functional
model would naturally produce (branches all correctly resolved) the functional path and the instruction
stream that the timing model would actually fetch (which would include
both right path instructions and wrong path instructions) either the target path or the correct path.
Without correct instructions, the timing model cannot accurately
predict the behavior of the target. Thus, the timing model checks
to make sure that the functional path instructions are the correct
path. If they are not, the timing model notifies the functional
model of that fact, forcing the functional model back onto the correct
path.
FAST is sometimes mistakenly thought to be the same as FastSim and Aria/Turnadot.
Those
two simulators execute in a functional-first fashion, generating
a trace that is fed to the timing model to predict performance.
However, the functional model queries the timing model branch predictor
at each branch to determine the next address. Such a scheme
enables branch prediction to be accurately modeled, including wrong
path instructions.
This scheme, however, tightly couples the functional model and timing
model. The functional model cannot execute beyond a basic block
without round-trip communication (functional->timing->functional)
being required. Thus, it is difficult to parallelize such
simulators on the functional/timing boundary. In addition, such a
scheme cannot accurately model branch predictors that may misspeculate
that a non-branch instruction is a branch without querying the branch
predictor every instruction.
Also, such a scheme cannot accurately model a multicore, because memory
operation order can affect functionality, but without instructions
being executed by the functional model, there will be no instructions
to move the timing model forward to allow it to determine memory
order. In this case, the circular dependency causes deadlock.
The FAST simulator addresses all of these issues by (i) having its
functional speculatively execute, (ii) having its timing model be able
to detect if the functional model misspeculated and, (iii) correcting
the functional model when it misspeculated. The only
time synchronization between the functional model and timing model is
necessary is when the timing model corrects
the functional model, that occurs infrequently if
the target micro-architecture is efficient.
Accurately simulating multicore targets requires additional
support to deal with memory reorderings, where the functional memory
execution order is different than the target order. Rather than
trying to determine functional orderand then compare functional order to
timing order, our scheme allows us to compare values instead.
Target-correct values are maintained in the targe memory oracle (TMO)
using data generated by the functional model. As the timing model
executes load instructions, the TMO value is compared with the
functional load value. If they differ, the functional model rolls
back to correct its load value and all dependent instructions,
regenerating the trace while it does so. This scheme allows
us to efficiently parallelize the functional model as well.
We have implemented a FAST simulator. We parallelize both between
the functional model and the
timing model, within the functional model and within the timing model
itself.
Our current prototype uses a heavily modified QEMU (an open source,
high-performance, full-system functional simulator) as a functional
model and our own hand-written timing model that models a generic
out-of-order superscalar processor and some of the system around
it. Our timing model is written in Bluespec, a high-level
hardware description language that has lots of nice features. The
timing model runs on an FPGA, an extraordinarly efficient parallel
platform. Our modified QEMU runs on an standard general-purpose
multicore processor.
We have done some work on automatic generation of power models that run
efficiently on the FPGA, enabling power simulation at the same
speeds. We have also added the ability to inject faults into an
earlier version of the simulator and are about to start porting fault
injection into our current version of the simulator.
Our current development host platform (what the simulator runs on) is a
Nallatech ACP system which is a four socket Intel server with one Intel
multicore processor and an FPGA board. The two communicate
over the Intel FSB. Reliability issues in the hardware prevent us
from running and
measuring performance of the entire system. However, we are able
to run the functional model at roughly 25MIPS/host core with tracing
and rollback support turned on (but no rollbacks being performed.)
This material is based upon work supported by the National Science
Foundation under Grants No. 0541416 and0615352, the Department of
Energy, and SRC (power modeling) as well
as gifts from Bluespec, Intel, IBM, Freescale, and Xilinx.
We are in the process of releasing parts of the simulator. Below
are our publically released components.
BRAM.v: includes a quad-ported block RAM
implemented by double clocking the Xilinx block RAMs.