System-on-Chip
(SoC) Design
ECE382M.20,
Fall 2023
Lab #3
Due: 11:59pm, October
30 November 6, 2023
Instructions:
•
This lab is a team exercise, groups will be
assigned in class.
•
Please use the discussion board on Ed
for Q&A.
•
Submit the report on Canvas
and code on Github Classroom.
The goals of this lab are to:
•
Integrate the GEMM hardware accelerator with the
rest of the ARM-based SoC platform and YOLO/Darknet application.
•
Prototype the target platform running the
YOLO/Darknet application in a SystemC TLM2.0
environment.
The assignment of
this lab includes the following:
•
Develop a SystemC module
for the GEMM hardware accelerator, and integrate into an overall SystemC TLM platform model.
•
Develop a hardware abstraction layer (HAL) and
driver for access to the external GEMM accelerator from the ARM, and integrate
it with the rest of the YOLO/Darknet application.
•
Cross-compile the accelerated YOLO/Darknet
application to run under Linux on the (simulated) ARM platform.
•
Co-simulate the YOLO/Darknet application running on
a virtual platform model to validate the HW/SW implementation.
Go through the
tutorial on how to setup the QEMU/SystemC simulation
environment for ARM processor emulation of the Zedboard:
Here are some
guidelines on generating a standalone General Matrix-Matrix Multiplication
(GEMM) accelerator module modeled in SystemC and
integrated into a TLM-based virtual platform:
a)
Make sure to start from the standalone fixed-point
GEMM that served as input for hardware synthesis in Lab 2.
b)
Turn the function into a SystemC
process and wrap it in a SystemC module.
c)
Insert wait() statements
to model estimated execution delays into the GEMM process. You can use the
measurements you obtained from the co-simulation and HLS reports in Lab 2 for a
design point of your choice as basis for delay modeling.
d)
The GEMM accelerator will have to communicate and
synchronize with the ARM to exchange input and output data. Define the
accelerator interfaces and describe them in TLM form in the SystemC
module. You may have to instantiate local scratchpad memories and additional
interfacing methods or processes for temporary data buffering/storage and
external communication in the accelerator module. This scratchpad memory
corresponds to the local accelerator SRAM memory that contained A, B
and C inputs/outputs of the gemm() function Lab 2. However, keep in mind
that local SRAM memory in the physical FPGA used in the final project will be
limited, i.e. your model should not use more local memory than what will be
available in the final system implementation later (216 block RAMs of 36kb each
for a total of 7.6Mb in case of the ZU3EG devices used in our boards).
e)
Design the overall system architecture and
integrate the SystemC model of the GEMM accelerator
into the virtual QEMU/SystemC platform accordingly.
You can use the zynqmp_demo
platform example from the QEMU/SystemC tutorial as a starting point for system
integration.
f)
(Extra credit) The zynqmp_demo only connects the
debug demo device as a pure slave to the M_AXI_GP0
system bus port. This requires the ARM to copy all output and input data back
and forth from/to the hardware device. As an alternative, you can design a
bus-mastering accelerator that handles all such copying itself. The accelerator
in this case will need to include an active load/store unit (as SystemC process) that copies data between external system
DRAM and local accelerator SRAM under control of and in coordination with the
GEMM computations within the accelerator. The Zynq
UltraScale+ provides ports between the processing
subsystem (PS) and the programmable logic (PL)/FPGA fabric that will allow for
a bus-mastering accelerator including direct cache or memory access. These
ports are also exposed to the SystemC side as TLM
sockets in the provided co-simulation library (see libsystemctlm-soc/soc/Xilinx/zynqmp/xilinx-zynqmp.h). For
example, you can connect an initiator socket of your accelerator module to the S_AXI_HP0_FPD port for non-coherent
bus-mastering main memory access as follows:
mbus = new iconnect<1,1>
("membus");
mbus->memmap(0x0LL, 0x8000000 - 1, ADDRMODE_RELATIVE, -1, *(zynq.s_axi_hp_fpd[0]));
accelerator->master_socket.bind(*mbus->t_sk[0]));
Note that our
boards only have 2GB of DDR4 memory. As such, and as documented in the Zynq
UltraScale+ manual (see address map), the
addressable DRAM range shared between the CPU and the FPGA fabric (and hence SystemC) is 0x00000000 to 0x7FFFFFFF. In addition, the Zync has 256KB of on-chip scratchpad memory (OCM) that can
be accessed from both the CPU and FPGA (by default mapped high to addresses
0xFFFC0000 to 0xFFFFFFFF in our platform) for sharing of data with faster SRAM
access times.
To setup a
simulation of the YOLO/Darknet application running on the virtual prototype of
our accelerated target platform, the following steps need to be performed:
a)
Using the application example in the tutorial as a
reference, develop a hardware abstraction layer (HAL) and driver (kernel
module) that can serve as a stub for the GEMM functionality. The HAL should be
a drop-in substitute for the existing GEMM function call, i.e. it replaces the
call with an implementation that makes equivalent calls to the external
hardware instead. Make any necessary modifications to the kernel module if you
need to.
b)
Integrate your HAL into the Darknet application,
cross-compile the 'darknet' executable, add it to the simulated platform and
run the setup in the co-simulated ARM+FPGA environment. Verify that the
hardware/software interaction is working properly and that the correct output
is produced when running the Tiny YOLO network on top of your Darknet
framework.
c)
(Extra credit) Explore possible parallelization of
the processing chain to exploit any available concurrency between the GEMM
running in hardware and the rest of the software running on the CPU. Refer to
the instructions in Lab 1 about general parallelization hints and strategies.
Note that this may require you to turn the software side into a multi-threaded
application, such that a part of the processing chain containing the external
GEMM calls runs in parallel with other parts of the darknet framework.
Submit your report
to Canvas
and code to Github Classroom. The report should summarize your work
flow, describe your simulation setup, document the model and application code
modifications, and analyze your simulation results. Include the following in
your Github Classroom repository:
•
The SystemC platform
model, including sources.
•
Source code for the GEMM SystemC
module.
•
Source code for the ARM HAL and driver code.
•
A README file that describes how to compile and run
your platform simulation.