The FAST simulation
methodology is a new approach to simulating computer systems aims to be
orders of magnitude faster than other RTL-level cycle-accurate
techniques, while being able to run unmodified applications on top of
unmodified operating systems and providing full transparency into the
running system with no simulation slowdown. Such simulators are
only useful if they are relatively easy to create, use and
modify. We are in the process of proving the methodology by
building such a simulator and the infrastructure needed to efficiently
use it.
Such speed and functionality
is achieved by
partitioning the simulation task between a functional model and a timing model. The
functional model executes the desired programs and produces a trace of
all instructions executed that is immediately piped to the timing
model. The timing model pushes those instructions through a model
of the target (simulated machine) micro-architecture. Since the
instructions have already been executed, all information about them,
such as the virtual and physical address of both the instruction itself
as well as any data memory accesses, source and destination registers,
exceptions, and so on, have already been generated and can be made
available to the timing model.
Of course, not every target micro-architecture is capable of fetching
only right path instructions; in fact virtually all modern
micro-architectures at least occasionally fetch wrong path
instructions. We call the instruction stream that the functional
model would naturally produce (branches all correctly resolved) the functional path and the instruction
stream that the timing model would actually fetch (which would include
both right path instructions and wrong path instructions) either the target path or the correct path.
Without correct instructions, the timing model cannot accurately
predict the behavior of the target. Thus, the timing model checks
to make sure that the functional path instructions are the correct
path. If they are not, the timing model notifies the functional
model of that fact, forcing the functional model back onto the correct
path.
This scheme of a functional/timing split with timing correction of the
functional mode was first used in the FastSim simulator (Schnarr and
Larus). FAST is different than FastSim in that it leverages our
observation that the timing model only seldomly corrects the functional
model, meaning that the functional model can be separated from the
timing model and efficiently run in parallel. The only time a
synchronization event is necessary is when the timing model corrects
the functional model, that occurs infrequently if
the target micro-architecture is efficient.
We have a prototype of a parallelized simulator based on this
observation. We parallelize between the functional model and the
timing model, as well as within the timing model itself. We plan
to also parallelize the functional model in the very near future.
Our current prototype uses a heavily modified QEMU (an open source,
high-performance, full-system functional simulator) as a functional
model and our own hand-written timing model that models a generic
out-of-order superscalar processor and some of the system around
it. Our timing model is written in Bluespec, a high-level
hardware description language that has lots of nice features. The
timing model runs on an FPGA, an extraordinarly efficient parallel
platform. FAST-QEMU runs on an standard general-purpose
processor.
Our current development host platform (what the simulator runs on) is a
DRC Computer prototyping system that is essentially a Linux system
built from a dual-processor motherboard with an AMD Opteron in one
socket and a Xilinx FPGA in the other socket. The two communicate
over HyperTransport. On the DRC platform, our current
prototype runs at average of about 1.2MIPS across a wide range of
applications including Windows XP and Linux boot, MySQL, Sweep3D (a DOE
benchmark) and the SPEC2000 integer benchmarks.
This material is based upon work supported by the National Science
Foundation under Grant No. 0615352 and the Department of Energy as well
as gifts from Bluespec, Intel, IBM, Freescale, and Xilinx.
We are in the process of releasing parts of the simulator. Below
are our publically released components.
BRAM.v: includes a quad-ported block RAM
implemented by double clocking the Xilinx block RAMs.