1 Overview

The goals of this lab are to:

• Learn the structure of the Darknet source code and compile the code for the ARM platform

• Identify and propose ways to remove the bottleneck of the code when run on the ARM platform

The assignment of this lab includes the following:

• Set up the design and board environment

• Profile the code to identify the time consuming portions of the code

• Complete an exercise to remove a type of bottleneck

• Isolate modules of the Darknet and perform floating-to-fixed point conversion

• Perform additional software optimizations

2 Environment Setup

Lab work for this class can be done either on the ECE Department’s LRC Linux servers or the Ultra96 Board. For Lab 1, you can compile the application either on the board or cross-compile it on the LRC servers, but our target is the ARM platform, i.e. all profiling will need to be done on the board itself. Note that for functional testing, you can also natively compile and execute Darknet on any other, e.g. Intel platform.

(a) ECE Linux Servers

We will be using the LRC servers for the class. Instructions for remote access via ssh are listed here: https://wikis.utexas.edu/display/eceit/ECE+Linux+Application+Servers

You can use the /misc/scratch directory on the LRC machines as your own workspace. The scratch directory will not be wiped out until the end of the semester. However, scratch space is also not backed up, i.e. use at your own risk. Execute the following commands:

% cd /misc/scratch

% mkdir <your username>

For software development targeting the board, e will be using Xilinx’s SDK. This includes the capability to compile and link applications for the board using the aarch64-linux-gnu-gcc cross-compiler tool chain, which is installed on the LRC machines and provided by Xilinx together with their development environment:

% module load xilinx/2018
% source /usr/local/packages/xilinx_2018/vivado_hl/SDK/2018.3/settings64.sh

(b) Boards

Each team will get an Ultra96 board pre-installed with Ubuntu 18.04. You can connect to the board initially from a Linux or Windows host via USB-UART as follows:

1. Power on and connect the Ultra96 board to the host machine using the provide USB-UART serial cable. If the board doesn’t boot automatically, press the Power Button (SW4). The blue Power On and Done LEDs (D1/D2) next to the microSD card socket should be on.

2. On a Linux host, search the kernel messaging with the command dmesg|grep tty and look for an indication that the USB-UART is enumerated as a device (typically listed as /dev/ttyUSB1). Connect the device with the minicom application, using the following command:

% minicom –D /dev/ttyUSB1 –b 115200 -8 -o

The minicom terminal will connect and allow the Ultra96 board terminal output to be interacted with. For further details about the board and its bringup, you can consult the Open HW Wiki.

3. On Windows, go into the Device Manager to find the COM port for the USB connection and use a terminal application like Putty to connect with a baudrate of 115200. See this getting started link for general details. If the device driver for the USB UART is not automatically installed, or for further troubleshooting, please see the USB-to-JTAG/UART pod documentation by Avnet.

The login/password will be provided with the board. This account has root access via sudo. To setup Wifi on the board, first put the SSID and pre-shared key (PSK) of the network to connect into in the /root/wpa_supplicant.conf file. To generate the PSK from a plain-text password, run:

% wpa_passphrase <ssid> <password>

and copy and paste the PSK entry into /root/wpa_supplicant.conf.

If you are on campus, you need to use the “utexas-iot” network and register the board’s MAC address (stamped onto the Wifi chip) with ITS here. This will give you the PSK value to put into wpa_supplicant.conf. Important: don’t forget to de-register the device from your EID at the end of the semester or you will be on the hook for any shenanigans by future users of the board!

Then start Wifi with:

% sudo /root/wifi.sh

This command may take ~30s to execute, but as long as the SSID and PSK are correct, it should connect. To run an ssh server on the board, you can follow this guide. You can then connect your board to the network and potentially use ssh to access the board remotely via Wifi. Install any necessary tools/libraries as you wish.

3 Cloning and Compiling the Darknet Source Code

Again, you can compile the application directly on the board or cross-compile it on the LRC servers:

a) Get the latest Darknet code from the following link: https://github.com/AlexeyAB/darknet

% git clone https://github.com/AlexeyAB/darknet

b) Go to the Darknet directory

% cd darknet

c) Compile the Darknet sources. If you are cross-compiling for the board on the LRC servers, first update the Makefile to use the correct compiler settings:

CC=aarch64-linux-gnu-gcc
CPP=aarch64-linux-gnu-g++

Then, run make in the Darknet directory:

% make

d) We will be using the pre-trained Tiny YOLO CNN for small and embedded devices. Get the pre-trained weight model from the following link

% wget https://pjreddie.com/media/files/yolov3-tiny.weights

e) If you cross-compiled on the LRC machines, transfer the darknet executable and all configuration settings (weights file and cfg/ and data/ subdirectories) to the board. Test and run Darknet/YOLO with the following command on the board:

% ./darknet detector test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg -save_labels

The save_labels flag will produce the golden reference output with detected classes and bounding boxes and save it in the file data/dog.txt. You can also look at the generated predictions.jpg for a visual representation of detection results.

For more information, and to get familiar with Darknet concepts and the source code, read the material and go through the following links:

https://pjreddie.com/darknet/yolo/

https://pjreddie.com/media/files/papers/yolo.pdf

4 Profiling Darknet

Next, identify the performance bottlenecks in the Darknet code and report on your results:

a) Before you can profile your program, you must first recompile it specifically for profiling. To do so, add the -pg option the CFLAGS line in the Makefile. Then, recompile the code:

% make clean
% make

b) Profile the code using:

% ./darknet detector test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg

This command does not overwrite the reference data/dog.txt output file by default (unless the –save_labels option is included). This will allow us to use the original reference output as ground truth to compare against when we start making modifications and optimizations of Darknet as discussed below.

c) Running the program to completion causes a file named gmon.out to be created in the current directory. gprof works by analyzing the data collected during the execution of your program after your program has finished running. gmon.out holds this data in a gprof-readable format.

d) Run gprof as follows:

% gprof darknet gmon.out > darknet.perf

e) Identify the bottleneck of the code based on the execution time of each function. Report your profiling results.

5 Floating- to Fixed-point Conversion

As you probably realize by now, the general matrix-matrix multiply (GEMM) part in the convolutional layers occupies the dominant share of the total execution time. GEMM is known to be computationally-intensive and expensive operations. Now, let’s try to do some optimization to improve the execution speed of the GEMM. Image processing or object detection applications like YOLO in general require algorithms that are typically specified using floating-point operations. However, for power, cost, and performance reasons, they are usually implemented with fixed-point operations either in software or as special-purpose hardware accelerators. To that end, we will convert the floating-point GEMM in Darknet to a fixed-point GEMM.

First, isolate the GEMM as a standalone program from the darknet code. By default, Darknet’s GEMM uses a float data type. Convert the GEMM data type from floating-point to fixed-point using only integer data types, such as short/long ints (signed or unsigned). This code snippet shows how to perform floating- to fixed-point conversion in C/C++.

As you are converting the GEMM to fixed-point, a certain amount of accuracy loss is unavoidable. This whole idea of trading off accuracy with execution speed is often called Approximate Computing. In the context of the standalone GEMM, we can define an accuracy metric by the signal-to-noise ratio (SNR). An example for calculating SNR using Matlab is given below, where the output matrices of the floating point GEMM and fixed-point GEMM are assumed to be cout_flp and cout_fxp, respectively:

ddif = cout_fxp – cout_flp;

disp([‘SNR is’, num2str(10*log10(sum(cout_flp(:).^2)/sum(ddiff(:).^2))), ‘dB’]);

Try to maximize the SNR of your fixed-point GEMM. Aim to achieve at least >40 dB of SNR. Report the SNR of your converted GEMM. You can use this test bench to report the SNR.

6 Darknet Conversion

Integrate the fixed-point GEMM back into the Darknet code and explore opportunities for further optimizations in the larger Darknet context. Some hints for possible avenues:

• So far, we have performed the floating- to fixed-point at the GEMM boundary. This will require conversion overhead on every GEMM call. To gain more significant system-wide performance, you can explore pushing the conversion boundary further beyond the GEMM.

• Hint: When and where is the first time in the code that we operate with floating-point images or weights? Instead of converting to fixed-point not until the GEMM is called, can we convert them the values earlier, e.g. the first time we see them?

• More specifically, many of the weight values used in the GEMM are constant. Can we convert the weights into fixed-point constants at compile time (rather than doing run-time conversion)?

• Some pre-processing operations before the GEMM in the convolutional layers are filling the matrix C with zeros. The larger the size of matrix C, the longer run-time it takes to complete. Can we do something smarter? Do we have to always fill with zeros?

• Explore the fixed-point data types design space. What is the smallest fixed-point data type that you can use during conversion? In general, the smaller data type the better in terms of performance. In particular, you can exploit more SIMD parallelism (data packing) with smaller data types (see below).

Use profiling to measure and guide you towards achieving as much improvement in total Darknet runtime as you can, with as minimal a loss in the detection accuracy of the overall YOLO application that includes your converted fixed-point modules and interfaces. Note that as you are optimizing the entire Darknet software, as discussed above a certain amount of prediction accuracy loss is expected. That being said, your optimized version should be able to at least predict that there are four objects in the picture: dog, bicycle, car, truck. Your program should at least predict these four objects. The prediction accuracy of these four objects might vary, but the accuracy vs. performance tradeoff should be optimized.

To measure accuracy of object detection applications, a commonly used quality loss metric is the so-called mean Average Precision (mAP), which is essentially the average of the maximum precisions at different recall values. For further theoretical background, refer to this link. Darknet includes the capability to compute the mAP of your modified program as follows:

a) Unfortunately, the mAP computation in Darknet has a bug and crashes if less than 4 images are provided. To fix the bug, apply the following patch and recompile Darknet. The patch will also modify Darknet to only report mAP for object classes that are actually included in the provided image test set (as opposed to reporting average detection accuracy across all classes that the CNN was originally trained for, even if those are not tested). Make sure you are in the darknet directory and apply the patch:

% wget http://www.ece.utexas.edu/~gerstl/ece382m_f21/labs/lab1/darknet-map.patch
% patch -p0 -b < darknet-map.patch
% make clean
% make

b) Put the (relative) paths of the images you want to be included in the mAP computation into a coco_testdev file in the darknet directory. For example:

% echo "data/dog.jpg" > coco_testdev

c) Make sure that the ground truth reference files (e.g. data/dog.txt) are the ones produced by a run of the original, unmodified floating-point Darknet implementation. Then, run your modified fixed-point implementation on the images listed in the coco_testdev file and compute the mAP:

% ./darknet detector map cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights

Report the following:

• Total execution time of Darknet using your optimized fixed-point versus the original floating-point implementation.

• mAP of Darknet using your optimized fixed-point version as compared against the original floating-point detection results.

7 Additional Software Optimizations (Extra Credit)

As an extra credit item (that will, however, be very useful for your final project design), start from your fixed-point code base developed in Step 6 and find additional opportunities to implement other optimizations that further improve software performance on the ARM. Report on the optimizations applied and results achieved.

Some suggestions for possible optimizations are:

• Exploiting SIMD vector processing. Many high-performance computing applications exploit vectorized instructions and SIMD processing capabilities of our ARM A53 CPU which includes a NEON SIMD vector unit. Leverage such hardware capabilities to further improve run-time performance. You can look for this link as a starting point:

– https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

• Cache locality-aware GEMM optimization. By default, Darknet uses a naïve triple-nested loop to implement the GEMM. This does not consider data reuse opportunities from the underlying cache and memory hierarchies in the ARM platform. Implement a locality-aware GEMM and measure the performance improvement accordingly. See these links as starting points:

– https://github.com/flame/how-to-optimize-gemm/wiki

– https://sites.google.com/lbl.gov/cs267-spr2019/hw-1

• Parallelization and/or pipelining of the Darknet processing chain on our quad-core ARM platform (this may also expose opportunities for exploiting hardware/software parallelism when mapping the GEMM out into hardware in Lab 2 and the final project). This requires a deeper understanding of the Darknet processing chain, specifically to analyze dependencies (and hence parallelization opportunities) among Darknet blocks. Some basic instructions for how to implement parallel processing using the Pthreads library (available both on the board and on the Linux hosts) are available here.

• Use of the ARM Mali-400 MP2 GPU on the board.

Talk to us (instructor or TA) if you are interested, have questions or are looking for ideas/advice around any of these topics.

8 Lab Report Submission

Submit your report and files in Canvas. The report should list the bottlenecks identified during profiling and discuss/propose ways used to remove them. List the differences between the original flp and fxp versions of the Darknet code with respect to what you observed by profiling them. Finally, report on the results of floating-point to fixed-point conversion (Steps 5 and 6, including achieved performance improvements and accuracy analysis) and any additional optimizations you performed (Step 7). Also include the fixed point code (tar ball archives with -czvf of code from Steps 5 and 6/7) as part of your report.