FPGA-Accelerated Simulation Technologies (FAST)

2005 WARFP: The first paper on FAST.
2006 WARFP: A four page summary.
ICCAD 2007: How the timing models are composed.
MICRO 2007: Our first prototype.

The FAST simulation methodology is a new approach to simulating computer systems aims to be orders of magnitude faster than other RTL-level cycle-accurate techniques, while being able to run unmodified applications on top of unmodified operating systems and providing full transparency into the running system with no simulation slowdown.  Such simulators are only useful if they are relatively easy to create, use and modify.   We are in the process of proving the methodology by building such a simulator and the infrastructure needed to efficiently use it.

Such speed and functionality is achieved by partitioning the simulation task between a functional model and a timing model.  The functional model executes the desired programs and produces a trace of all instructions executed that is immediately piped to the timing model.  The timing model pushes those instructions through a model of the target (simulated machine) micro-architecture.  Since the instructions have already been executed, all information about them, such as the virtual and physical address of both the instruction itself as well as any data memory accesses, source and destination registers, exceptions, and so on, have already been generated and can be made available to the timing model.

Of course, not every target micro-architecture is capable of fetching only right path instructions; in fact virtually all modern micro-architectures at least occasionally fetch wrong path instructions.  We call the instruction stream that the functional model would naturally produce (branches all correctly resolved) the functional path and the instruction stream that the timing model would actually fetch (which would include both right path instructions and wrong path instructions) either the target path or the correct path.

Without correct instructions, the timing model cannot accurately predict the behavior of the target.  Thus, the timing model checks to make sure that the functional path instructions are the correct path.  If they are not, the timing model notifies the functional model of that fact, forcing the functional model back onto the correct path. 

This scheme of a functional/timing split with timing correction of the functional mode was first used in the FastSim simulator (Schnarr and Larus).  FAST is different than FastSim in that it leverages our observation that the timing model only seldomly corrects the functional model, meaning that the functional model can be separated from the timing model and efficiently run in parallel.  The only time a synchronization event is necessary is when the timing model corrects the functional model, that occurs infrequently if the target micro-architecture is efficient. 

We have a prototype of a parallelized simulator based on this observation.  We parallelize between the functional model and the timing model, as well as within the timing model itself.  We plan to also parallelize the functional model in the very near future. 

Our current prototype uses a heavily modified QEMU (an open source, high-performance, full-system functional simulator) as a functional model and our own hand-written timing model that models a generic out-of-order superscalar processor and some of the system around it.  Our timing model is written in Bluespec, a high-level hardware description language that has lots of nice features.  The timing model runs on an FPGA, an extraordinarly efficient parallel platform.  FAST-QEMU runs on an standard general-purpose processor. 

Our current development host platform (what the simulator runs on) is a DRC Computer prototyping system that is essentially a Linux system built from a dual-processor motherboard with an AMD Opteron in one socket and a Xilinx FPGA in the other socket.  The two communicate over HyperTransport.  
On the DRC platform, our current prototype runs at average of about 1.2MIPS across a wide range of applications including Windows XP and Linux boot, MySQL, Sweep3D (a DOE benchmark) and the SPEC2000 integer benchmarks. 

This material is based upon work supported by the National Science Foundation under Grant No. 0615352 and the Department of Energy as well as gifts from Bluespec, Intel, IBM, Freescale, and Xilinx.

We are in the process of releasing parts of the simulator.  Below are our publically released components.

BRAM.v: includes a quad-ported block RAM implemented by double clocking the Xilinx block RAMs.