System-on-Chip
(SoC) Design
EE382M.20,
Fall 2018
Lab #2
Due: 11:59pm,
October 14, 2018
Instructions:
•
This lab is a team exercise,
groups will be assigned in class.
•
Please use the discussion board on Piazza
for Q&A.
•
All reports and code MUST be submitted to the
assignment of Canvas.
The goals of this lab are to:
•
Partition the Darknet
code into software on the ARM and external hardware accelerators.
•
Prototype the target platform in a SystemC TLM2.0 environment.
The assignment of
this lab includes the following:
•
Isolate the GEMM and convert it into a SystemC-modeled hardware module.
•
Develop a hardware abstraction layer (HAL) for
access to the external GEMM accelerator from the ARM.
•
Cross-compile the YOLO/Darknet
application to run under Linux on the ARM board.
•
Simulate the YOLO/Darknet
application running on a virtual platform model to validate the HW/SW
implementation.
Go through the
tutorial on how to setup the QEMU/SystemC simulation
environment for ARM processor emulation of the Zedboard:
Here are some
guidelines as to generating a standalone General Matrix-Matrix Multiplication
(GEMM) accelerator module modeled in SystemC:
a)
Make sure to start from a GEMM that has already
been converted into a fixed-point version. You can start with the code you
developed in Lab 1.
b)
Make sure the GEMM is a single function that is
side-effect free, i.e. any and all required inputs and outputs are passed as
function parameters or return value as you proceed with the isolation.
c)
Turn the function into a SystemC
process and wrap it in a SystemC module.
d)
Insert wait() statements
to model estimated execution delays into the GEMM process.
e)
The GEMM accelerator will have to communicate and
synchronize with the ARM to exchange input and output data. Define the
accelerator interfaces and describe them in TLM form in the SystemC
module. You may have to instantiate local registers or scratchpad memories and
additional interfacing methods or processes for temporary data
buffering/storage and external communication in the accelerator module.
However, keep in mind that local BRAM memory in the physical FPGA used in the
final project will be limited, i.e. your model should not use more local memory
than what will be available in the final system implementation later (140 Block
RAMs with 39kbit each for a total of 4.9Mbit in case of the Z-7020 Zynq devices used in our boards).
f)
Design the overall system architecture and
integrate the SystemC model of the GEMM accelerator
into the virtual QEMU/SystemC platform accordingly.
You can use the zynq_demo platform example from the QEMU/SystemC tutorial as a starting point for system
integration.
g)
(Extra credit) The zynq_demo
only connects the debug demo device as a pure slave to the M_AXI_GP0 system bus
port. This requires the ARM to shuffle all output and input data back and forth
from/to the hardware device. The Zynq-7000
provides other ports between the processing subsystem (PS) and the programmable
logic (PL)/FPGA fabric that will allow for a bus-mastering accelerator
including direct cache or memory access. These ports are also exposed to the SystemC side as TLM sockets in the provided co-simulation
library (see ‘libsystemctlm-soc/zynq/xilinx-zynq.h’). For
example, you can connect an initiator socket of your accelerator module to the
S_AXI_GP0 port for bus-mastering main memory access as follows:
mbus = new iconnect<1,1>
("membus");
mbus->memmap(0x0LL,
0x2000000 - 1, ADDRMODE_RELATIVE, -1, *(zynq.s_axi_gp[0]));
accelerator->master_socket.bind(*mbus->t_sk[0]));
Note that our
boards and hence the QEMU-modeled system only have 512MB of DDR3 memory.
Furthermore, as documented in the Zynq-7000
manual (see address map), only the address range from 0x00100000 upwards is
accessible by all masters. As such, the addressable DRAM range shared between
the CPU and the FPGA fabric (and hence SystemC) is
0x00100000 to 0x1FFFFFFF. In addition, the Zync has
256KB of on-chip scratchpad memory (OCM) that can be accessed from both the CPU
and FPGA (by default mapped high to addresses 0xFFFC0000 to 0xFFFFFFFF in our
platform) for sharing of data with faster SRAM access times.
To setup a
simulation of the YOLO/Darknet application running on
the virtual prototype of our accelerated target platform, the following steps
need to be performed:
a)
Using the application example in the tutorial as a
reference, develop a hardware abstraction layer (HAL) that can serve as a stub
for the GEMM functionality. The HAL should be a drop-in substitute for the
existing GEMM function call, i.e. it replaces the call with an implementation
that makes equivalent calls to the external hardware instead. Make any
necessary modifications to the fpga_drv if you need
to.
b)
Integrate your HAL into the Darknet
application, cross-compile the 'darknet' executable,
add it to the simulated platform and run the setup in the emulated ARM+FPGA
environment. Verify that the hardware/software interaction is working properly
and that the correct output is produced when running the Tiny YOLO network on
top of your Darknet framework.
c)
(Extra credit) Explore possible parallelization of
the processing chain to exploit any available concurrency between the GEMM
running in hardware and the rest of the software running on the CPU. Refer to
the instructions in Lab 1 about general parallelization hints and strategies.
Note that this may require you to turn the software side into a multi-threaded
application, such that a part of the processing chain containing the external
GEMM calls runs in parallel with other parts of the darknet
framework.
Submit your report
and files to Canvas. The
report should summarize your work flow, describe your simulation setup,
document the model and application code modifications, and analyze your
simulation results. Attach an archive (.tar.gz or .zip) that includes:
•
The SystemC platform
model, including sources.
•
Source code for the GEMM SystemC
module.
•
Source code for the ARM HAL and driver code.
•
A README file that describes how to compile and run
your platform simulation.