1 Overview

The goals of this lab are to:

• Partition the Darknet code into software on the ARM and external hardware accelerators.

• Prototype the target platform in a SystemC TLM2.0 environment.

The assignment of this lab includes the following:

• Isolate the GEMM and convert it into a SystemC-modeled hardware module.

• Develop a hardware abstraction layer (HAL) for access to the external GEMM accelerator from the ARM.

• Cross-compile the YOLO/Darknet application to run under Linux on the ARM board.

• Simulate the YOLO/Darknet application running on a virtual platform model to validate the HW/SW implementation.

2 Tutorial

Go through the tutorial on how to setup the QEMU/SystemC simulation environment for ARM processor emulation of the Zedboard:

QEMU/SystemC Tutorial

3 Isolating the GEMM

Here are some guidelines as to generating a standalone General Matrix-Matrix Multiplication (GEMM) accelerator module modeled in SystemC:

a) Make sure to start from a GEMM that has already been converted into a fixed-point version. You can start with the code you developed in Lab 1.

b) Make sure the GEMM is a single function that is side-effect free, i.e. any and all required inputs and outputs are passed as function parameters or return value as you proceed with the isolation.

c) Turn the function into a SystemC process and wrap it in a SystemC module.

d) Insert wait() statements to model estimated execution delays into the GEMM process.

e) The GEMM accelerator will have to communicate and synchronize with the ARM to exchange input and output data. Define the accelerator interfaces and describe them in TLM form in the SystemC module. You may have to instantiate local registers or scratchpad memories and additional interfacing methods or processes for temporary data buffering/storage and external communication in the accelerator module. However, keep in mind that local BRAM memory in the physical FPGA used in the final project will be limited, i.e. your model should not use more local memory than what will be available in the final system implementation later (216 block RAMs of 36kb each for a total of 7.6Mb in case of the ZU3EG devices used in our boards).

f) Design the overall system architecture and integrate the SystemC model of the GEMM accelerator into the virtual QEMU/SystemC platform accordingly. You can use the zynqmp_demo platform example from the QEMU/SystemC tutorial as a starting point for system integration.

g) (Extra credit) The zynqmp_demo only connects the debug demo device as a pure slave to the M_AXI_GP0 system bus port. This requires the ARM to shuffle all output and input data back and forth from/to the hardware device. The Zynq UltraScale+ provides other ports between the processing subsystem (PS) and the programmable logic (PL)/FPGA fabric that will allow for a bus-mastering accelerator including direct cache or memory access. These ports are also exposed to the SystemC side as TLM sockets in the provided co-simulation library (see ‘libsystemctlm-soc/soc/Xilinx/zynqmp/xilinx-zynqmp.h’). For example, you can connect an initiator socket of your accelerator module to the S_AXI_HP0_FPD port for non-coherent bus-mastering main memory access as follows:

mbus = new iconnect<1,1> ("membus");
mbus->memmap(0x0LL, 0x8000000 - 1, ADDRMODE_RELATIVE, -1, *(zynq.s_axi_hp_fpd[0]));
accelerator->master_socket.bind(*mbus->t_sk[0]));

Note that our boards only have 2GB of DDR4 memory. As such, and as documented in the Zynq UltraScale+ manual (see address map), the addressable DRAM range shared between the CPU and the FPGA fabric (and hence SystemC) is 0x00000000 to 0x7FFFFFFF. In addition, the Zync has 256KB of on-chip scratchpad memory (OCM) that can be accessed from both the CPU and FPGA (by default mapped high to addresses 0xFFFC0000 to 0xFFFFFFFF in our platform) for sharing of data with faster SRAM access times.

4 YOLO/Darknet Application Mapping

To setup a simulation of the YOLO/Darknet application running on the virtual prototype of our accelerated target platform, the following steps need to be performed:

a) Using the application example in the tutorial as a reference, develop a hardware abstraction layer (HAL) that can serve as a stub for the GEMM functionality. The HAL should be a drop-in substitute for the existing GEMM function call, i.e. it replaces the call with an implementation that makes equivalent calls to the external hardware instead. Make any necessary modifications to the fpga_drv if you need to.

b) Integrate your HAL into the Darknet application, cross-compile the 'darknet' executable, add it to the simulated platform and run the setup in the emulated ARM+FPGA environment. Verify that the hardware/software interaction is working properly and that the correct output is produced when running the Tiny YOLO network on top of your Darknet framework.

c) (Extra credit) Explore possible parallelization of the processing chain to exploit any available concurrency between the GEMM running in hardware and the rest of the software running on the CPU. Refer to the instructions in Lab 1 about general parallelization hints and strategies. Note that this may require you to turn the software side into a multi-threaded application, such that a part of the processing chain containing the external GEMM calls runs in parallel with other parts of the darknet framework.

5 Lab Report Submission

Submit your report and files to Canvas. The report should summarize your work flow, describe your simulation setup, document the model and application code modifications, and analyze your simulation results. Attach an archive (.tar.gz or .zip) that includes:

• The SystemC platform model, including sources.

• Source code for the GEMM SystemC module.

• Source code for the ARM HAL and driver code.

• A README file that describes how to compile and run your platform simulation.