System-on-Chip
(SoC) Design
ECE382M.20,
Fall 2021
Lab #2
Due: 11:59pm, October
11, 2021 October 14, 2021
Instructions:
•
This lab is a team exercise, groups will be
assigned in class.
•
Please use the discussion board on Piazza
for Q&A.
•
All reports and code MUST be submitted to the
assignment of Canvas.
The goals of this lab are to:
•
Partition the Darknet code into software on the ARM
and external hardware accelerators.
•
Prototype the target platform in a SystemC TLM2.0 environment.
The assignment of
this lab includes the following:
•
Isolate the GEMM and convert it into a SystemC-modeled hardware module.
•
Develop a hardware abstraction layer (HAL) for
access to the external GEMM accelerator from the ARM.
•
Cross-compile the YOLO/Darknet application to run
under Linux on the ARM board.
•
Simulate the YOLO/Darknet application running on a virtual
platform model to validate the HW/SW implementation.
Go through the
tutorial on how to setup the QEMU/SystemC simulation
environment for ARM processor emulation of the Zedboard:
Here are some
guidelines as to generating a standalone General Matrix-Matrix Multiplication
(GEMM) accelerator module modeled in SystemC:
a)
Make sure to start from a GEMM that has already
been converted into a fixed-point version. You can start with the code you
developed in Lab 1.
b)
Make sure the GEMM is a single function that is
side-effect free, i.e. any and all required inputs and outputs are passed as
function parameters or return value as you proceed with the isolation.
c)
Turn the function into a SystemC
process and wrap it in a SystemC module.
d)
Insert wait() statements
to model estimated execution delays into the GEMM process.
e)
The GEMM accelerator will have to communicate and
synchronize with the ARM to exchange input and output data. Define the
accelerator interfaces and describe them in TLM form in the SystemC
module. You may have to instantiate local registers or scratchpad memories and
additional interfacing methods or processes for temporary data buffering/storage
and external communication in the accelerator module. However, keep in mind
that local BRAM memory in the physical FPGA used in the final project will be
limited, i.e. your model should not use more local memory than what will be
available in the final system implementation later (216 block RAMs of 36kb each
for a total of 7.6Mb in case of the ZU3EG devices used in our boards).
f)
Design the overall system architecture and
integrate the SystemC model of the GEMM accelerator
into the virtual QEMU/SystemC platform accordingly.
You can use the zynqmp_demo platform example from the
QEMU/SystemC tutorial as a starting point for system
integration.
g)
(Extra credit) The zynqmp_demo
only connects the debug demo device as a pure slave to the M_AXI_GP0 system bus
port. This requires the ARM to shuffle all output and input data back and forth
from/to the hardware device. The Zynq
UltraScale+ provides other ports between the
processing subsystem (PS) and the programmable logic (PL)/FPGA fabric that will
allow for a bus-mastering accelerator including direct cache or memory access.
These ports are also exposed to the SystemC side as
TLM sockets in the provided co-simulation library (see ‘libsystemctlm-soc/soc/Xilinx/zynqmp/xilinx-zynqmp.h’). For example, you can connect an
initiator socket of your accelerator module to the S_AXI_HP0_FPD port for
non-coherent bus-mastering main memory access as follows:
mbus = new iconnect<1,1>
("membus");
mbus->memmap(0x0LL, 0x8000000 - 1, ADDRMODE_RELATIVE, -1, *(zynq.s_axi_hp_fpd[0]));
accelerator->master_socket.bind(*mbus->t_sk[0]));
Note that our
boards only have 2GB of DDR4 memory. As such, and as documented in the Zynq
UltraScale+ manual (see address map), the addressable
DRAM range shared between the CPU and the FPGA fabric (and hence SystemC) is 0x00000000 to 0x7FFFFFFF. In addition, the Zync has 256KB of on-chip scratchpad memory (OCM) that can
be accessed from both the CPU and FPGA (by default mapped high to addresses
0xFFFC0000 to 0xFFFFFFFF in our platform) for sharing of data with faster SRAM
access times.
To setup a
simulation of the YOLO/Darknet application running on the virtual prototype of
our accelerated target platform, the following steps need to be performed:
a)
Using the application example in the tutorial as a
reference, develop a hardware abstraction layer (HAL) that can serve as a stub
for the GEMM functionality. The HAL should be a drop-in substitute for the
existing GEMM function call, i.e. it replaces the call with an implementation
that makes equivalent calls to the external hardware instead. Make any
necessary modifications to the fpga_drv if you need
to.
b)
Integrate your HAL into the Darknet application,
cross-compile the 'darknet' executable, add it to the simulated platform and
run the setup in the emulated ARM+FPGA environment. Verify that the
hardware/software interaction is working properly and that the correct output
is produced when running the Tiny YOLO network on top of your Darknet
framework.
c)
(Extra credit) Explore possible parallelization of
the processing chain to exploit any available concurrency between the GEMM
running in hardware and the rest of the software running on the CPU. Refer to
the instructions in Lab 1 about general parallelization hints and strategies.
Note that this may require you to turn the software side into a multi-threaded
application, such that a part of the processing chain containing the external
GEMM calls runs in parallel with other parts of the darknet framework.
Submit your report
and files to Canvas.
The report should summarize your work flow, describe your simulation setup,
document the model and application code modifications, and analyze your
simulation results. Attach an archive (.tar.gz or .zip) that includes:
•
The SystemC platform
model, including sources.
•
Source code for the GEMM SystemC
module.
•
Source code for the ARM HAL and driver code.
•
A README file that describes how to compile and run
your platform simulation.