1 Overview

The goals of this lab are to:

• Integrate the GEMM hardware accelerator with the rest of the ARM-based SoC platform and YOLO/Darknet application.

• Prototype the target platform running the YOLO/Darknet application in a SystemC TLM2.0 environment.

The assignment of this lab includes the following:

• Develop a SystemC module for the GEMM hardware accelerator, and integrate into an overall SystemC TLM platform model.

• Develop a hardware abstraction layer (HAL) and driver for access to the external GEMM accelerator from the ARM, and integrate it with the rest of the YOLO/Darknet application.

• Cross-compile the accelerated YOLO/Darknet application to run under Linux on the (simulated) ARM platform.

• Co-simulate the YOLO/Darknet application running on a virtual platform model to validate the HW/SW implementation.

2 Tutorial

Go through the tutorial on how to setup the QEMU/SystemC simulation environment for ARM processor emulation of the Zedboard:

QEMU/SystemC Tutorial

3 Integrating the GEMM Accelerator

Here are some guidelines on generating a standalone General Matrix-Matrix Multiplication (GEMM) accelerator module modeled in SystemC and integrated into a TLM-based virtual platform:

a) Make sure to start from the standalone fixed-point GEMM that served as input for hardware synthesis in Lab 2.

b) Turn the function into a SystemC process and wrap it in a SystemC module.

c) Insert wait() statements to model estimated execution delays into the GEMM process. You can use the measurements you obtained from the co-simulation and HLS reports in Lab 2 for a design point of your choice as basis for delay modeling.

d) The GEMM accelerator will have to communicate and synchronize with the ARM to exchange input and output data. Define the accelerator interfaces and describe them in TLM form in the SystemC module. You may have to instantiate local scratchpad memories and additional interfacing methods or processes for temporary data buffering/storage and external communication in the accelerator module. This scratchpad memory corresponds to the local accelerator SRAM memory that contained A, B and C inputs/outputs of the gemm() function Lab 2. However, keep in mind that local SRAM memory in the physical FPGA used in the final project will be limited, i.e. your model should not use more local memory than what will be available in the final system implementation later (216 block RAMs of 36kb each for a total of 7.6Mb in case of the ZU3EG devices used in our boards).

e) Design the overall system architecture and integrate the SystemC model of the GEMM accelerator into the virtual QEMU/SystemC platform accordingly. You can use the zynqmp_demo platform example from the QEMU/SystemC tutorial as a starting point for system integration.

f) (Extra credit) The zynqmp_demo only connects the debug demo device as a pure slave to the M_AXI_GP0 system bus port. This requires the ARM to copy all output and input data back and forth from/to the hardware device. As an alternative, you can design a bus-mastering accelerator that handles all such copying itself. The accelerator in this case will need to include an active load/store unit (as SystemC process) that copies data between external system DRAM and local accelerator SRAM under control of and in coordination with the GEMM computations within the accelerator. The Zynq UltraScale+ provides ports between the processing subsystem (PS) and the programmable logic (PL)/FPGA fabric that will allow for a bus-mastering accelerator including direct cache or memory access. These ports are also exposed to the SystemC side as TLM sockets in the provided co-simulation library (see libsystemctlm-soc/soc/Xilinx/zynqmp/xilinx-zynqmp.h). For example, you can connect an initiator socket of your accelerator module to the S_AXI_HP0_FPD port for non-coherent bus-mastering main memory access as follows:

mbus = new iconnect<1,1> ("membus");
mbus->memmap(0x0LL, 0x8000000 - 1, ADDRMODE_RELATIVE, -1, *(zynq.s_axi_hp_fpd[0]));
accelerator->master_socket.bind(*mbus->t_sk[0]));

Note that our boards only have 2GB of DDR4 memory. As such, and as documented in the Zynq UltraScale+ manual (see address map), the addressable DRAM range shared between the CPU and the FPGA fabric (and hence SystemC) is 0x00000000 to 0x7FFFFFFF. In addition, the Zync has 256KB of on-chip scratchpad memory (OCM) that can be accessed from both the CPU and FPGA (by default mapped high to addresses 0xFFFC0000 to 0xFFFFFFFF in our platform) for sharing of data with faster SRAM access times.

4 YOLO/Darknet Application Mapping

To setup a simulation of the YOLO/Darknet application running on the virtual prototype of our accelerated target platform, the following steps need to be performed:

a) Using the application example in the tutorial as a reference, develop a hardware abstraction layer (HAL) and driver (kernel module) that can serve as a stub for the GEMM functionality. The HAL should be a drop-in substitute for the existing GEMM function call, i.e. it replaces the call with an implementation that makes equivalent calls to the external hardware instead. Make any necessary modifications to the kernel module if you need to.

b) Integrate your HAL into the Darknet application, cross-compile the 'darknet' executable, add it to the simulated platform and run the setup in the co-simulated ARM+FPGA environment. Verify that the hardware/software interaction is working properly and that the correct output is produced when running the Tiny YOLO network on top of your Darknet framework.

c) (Extra credit) Explore possible parallelization of the processing chain to exploit any available concurrency between the GEMM running in hardware and the rest of the software running on the CPU. Refer to the instructions in Lab 1 about general parallelization hints and strategies. Note that this may require you to turn the software side into a multi-threaded application, such that a part of the processing chain containing the external GEMM calls runs in parallel with other parts of the darknet framework.

5 Lab Report Submission

Submit your report to Canvas and code to Github Classroom. The report should summarize your work flow, describe your simulation setup, document the model and application code modifications, and analyze your simulation results. Include the following in your Github Classroom repository:

• The SystemC platform model, including sources.

• Source code for the GEMM SystemC module.

• Source code for the ARM HAL and driver code.

• A README file that describes how to compile and run your platform simulation.