FPGA-Accelerated Simulation Technologies (FAST)

2005 WARFP: The first paper on FAST.
2006 WARFP: A four page summary.
ICCAD 2007: How the timing models are composed.
MICRO 2007: Our first prototype.
CAL 2009: How to efficiently extend FAST to multicore

The FAST simulation methodology is a new approach to simulating computer systems aims to be orders of magnitude faster than other RTL-level cycle-accurate techniques, while being able to run unmodified applications on top of unmodified operating systems and providing full transparency into the running system with no simulation slowdown.  Such simulators are only useful if they are relatively easy to create, use and modify.   We are in the process of proving the methodology by building such a simulator and the infrastructure needed to efficiently use it.

Such speed and functionality is achieved by partitioning the simulation task between a functional model and a timing model.  The functional model executes the desired programs and produces a trace of all instructions executed that is immediately piped to the timing model.  The timing model pushes those instructions through a model of the target (simulated machine) micro-architecture.  Since the instructions have already been executed, all information about them, such as the virtual and physical address of both the instruction itself as well as any data memory accesses, source and destination registers, exceptions, and so on, have already been generated and can be made available to the timing model.

Of course, not every target micro-architecture is capable of fetching only right path instructions; in fact virtually all modern micro-architectures at least occasionally fetch wrong path instructions.  We call the instruction stream that the functional model would naturally produce (branches all correctly resolved) the functional path and the instruction stream that the timing model would actually fetch (which would include both right path instructions and wrong path instructions) either the target path or the correct path.

Without correct instructions, the timing model cannot accurately predict the behavior of the target.  Thus, the timing model checks to make sure that the functional path instructions are the correct path.  If they are not, the timing model notifies the functional model of that fact, forcing the functional model back onto the correct path. 

FAST is sometimes mistakenly thought to be the same as FastSim and Aria/Turnadot.  Those two simulators execute in a functional-first fashion, generating a trace that is fed to the timing model to predict performance.  However, the functional model queries the timing model branch predictor at each branch to determine the next address.  Such a scheme enables branch prediction to be accurately modeled, including wrong path instructions. 

This scheme, however, tightly couples the functional model and timing model.  The functional model cannot execute beyond a basic block without round-trip communication (functional->timing->functional) being required.  Thus, it is difficult to parallelize such simulators on the functional/timing boundary.  In addition, such a scheme cannot accurately model branch predictors that may misspeculate that a non-branch instruction is a branch without querying the branch predictor every instruction. 

Also, such a scheme cannot accurately model a multicore, because memory operation order can affect functionality, but without instructions being executed by the functional model, there will be no instructions to move the timing model forward to allow it to determine memory order.  In this case, the circular dependency causes deadlock.

The FAST simulator addresses all of these issues by (i) having its functional speculatively execute, (ii) having its timing model be able to detect if the functional model misspeculated and, (iii) correcting the functional model when it misspeculated.    The only time synchronization between the functional model and timing model is necessary is when the timing model corrects the functional model, that occurs infrequently if the target micro-architecture is efficient. 

Accurately simulating multicore targets requires additional support to deal with memory reorderings, where the functional memory execution order is different than the target order.  Rather than trying to determine functional order and then compare functional order to timing order, our scheme allows us to compare values instead.  Target-correct values are maintained in the targe memory oracle (TMO) using data generated by the functional model.  As the timing model executes load instructions, the TMO value is compared with the functional load value.  If they differ, the functional model rolls back to correct its load value and all dependent instructions, regenerating the trace while it does so.   This scheme allows us to efficiently parallelize the functional model as well.

We have implemented a FAST simulator.  We parallelize both between the functional model and the timing model, within the functional model and within the timing model itself. 

Our current prototype uses a heavily modified QEMU (an open source, high-performance, full-system functional simulator) as a functional model and our own hand-written timing model that models a generic out-of-order superscalar processor and some of the system around it.  Our timing model is written in Bluespec, a high-level hardware description language that has lots of nice features.  The timing model runs on an FPGA, an extraordinarly efficient parallel platform.  Our modified QEMU runs on an standard general-purpose multicore processor. 

We have done some work on automatic generation of power models that run efficiently on the FPGA, enabling power simulation at the same speeds.  We have also added the ability to inject faults into an earlier version of the simulator and are about to start porting fault injection into our current version of the simulator.

Our current development host platform (what the simulator runs on) is a Nallatech ACP system which is a four socket Intel server with one Intel multicore processor and an FPGA board.  The two communicate over the Intel FSB.  Reliability issues in the hardware prevent us from
running and measuring performance of the entire system.  However, we are able to run the functional model at roughly 25MIPS/host core with tracing and rollback support turned on (but no rollbacks being performed.)

This material is based upon work supported by the National Science Foundation under Grants No. 0541416 and
0615352, the Department of Energy, and SRC (power modeling) as well as gifts from Bluespec, Intel, IBM, Freescale, and Xilinx.

We are in the process of releasing parts of the simulator.  Below are our publically released components.

BRAM.v: includes a quad-ported block RAM implemented by double clocking the Xilinx block RAMs.