System-on-Chip (SoC) Design
ECE382M.20, Fall 2021
Lab #1
Due: 11:59pm, September 20, 2021
Instructions:
•
This lab is a team exercise, groups will be
assigned in class.
•
Please use the discussion board on Piazza
for Q&A.
•
All reports and code MUST be submitted to the
assignment of Canvas.
•
Please check relevant web pages.
The goals of this lab are to:
•
Learn the structure of the Darknet source code and
compile the code for the ARM platform
•
Identify and propose ways to remove the bottleneck
of the code when run on the ARM platform
The assignment of this lab includes the following:
•
Set up the design and board environment
•
Profile the code to identify the time
consuming portions of the code
•
Complete an exercise to remove a type of bottleneck
•
Isolate modules of the Darknet and perform
floating-to-fixed point conversion
•
Perform additional software optimizations
Lab work for this class can be done either on the ECE Department’s
LRC Linux servers or the Ultra96 Board. For Lab 1, you can compile the
application either on the board or cross-compile it on the LRC servers, but our
target is the ARM platform, i.e. all profiling will need to be done on the
board itself. Note that for functional testing, you can also natively compile
and execute Darknet on any other, e.g. Intel platform.
(a)
ECE Linux Servers
We will be using
the LRC servers for the class. Instructions for remote access via ssh are listed here: https://wikis.utexas.edu/display/eceit/ECE+Linux+Application+Servers
You can use the /misc/scratch directory on the
LRC machines as your own workspace. The scratch directory will not be wiped out
until the end of the semester. However, scratch space is also not backed up,
i.e. use at your own risk. Execute the following commands:
% cd /misc/scratch
% mkdir <your
username>
For software
development targeting the board, e will be using Xilinx’s SDK. This
includes the capability to compile and link applications for the board using the
aarch64-linux-gnu-gcc cross-compiler tool chain, which is installed on the LRC
machines and provided by Xilinx together with their development environment:
%
module load xilinx/2018
% source /usr/local/packages/xilinx_2018/vivado_hl/SDK/2018.3/settings64.sh
(b)
Boards
Each team will get an Ultra96 board
pre-installed with Ubuntu 18.04. You can connect to the board initially from a
Linux or Windows host via USB-UART as follows:
1.
Power on and connect the Ultra96 board to the host
machine using the provide USB-UART serial cable. If the board doesn’t
boot automatically, press the Power Button (SW4). The blue Power On and Done LEDs (D1/D2) next to the microSD card socket
should be on.
2.
On a Linux host, search the kernel messaging with
the command dmesg|grep tty and look for an
indication that the USB-UART is enumerated as a device (typically listed as /dev/ttyUSB1). Connect the
device with the minicom application,
using the following command:
% minicom –D /dev/ttyUSB1 –b 115200 -8
-o
The minicom
terminal will connect and allow the Ultra96 board terminal output to be interacted
with. For further details about the board and its bringup,
you can consult the Open HW Wiki.
3.
On Windows, go into the Device Manager to find the
COM port for the USB connection and use a terminal application like Putty to
connect with a baudrate of 115200. See this getting started link for general details. If the
device driver for the USB UART is not automatically installed, or for further
troubleshooting, please see the USB-to-JTAG/UART
pod documentation by Avnet.
The login/password will be provided with the board.
This account has root access via sudo. To setup Wifi on the board, first put the SSID and pre-shared key
(PSK) of the network to connect into in the /root/wpa_supplicant.conf file. To generate
the PSK from a plain-text password, run:
% wpa_passphrase <ssid> <password>
and copy and paste the PSK entry into /root/wpa_supplicant.conf.
If you are on campus, you need to use the “utexas-iot” network and register the board’s
MAC address (stamped onto the Wifi chip) with ITS here.
This will give you the PSK value to put into wpa_supplicant.conf. Important: don’t forget to
de-register the device from your EID at the end of the semester or you will be
on the hook for any shenanigans by future users of the board!
Then start Wifi with:
% sudo /root/wifi.sh
This command may take ~30s to execute, but as long
as the SSID and PSK are correct, it should connect. To run an
ssh server on the board, you can follow this guide.
You can then connect your board to the network and potentially use ssh to access the
board remotely via Wifi. Install any necessary
tools/libraries as you wish.
Again, you can compile the
application directly on the board or cross-compile it on the LRC servers:
a) Get the latest Darknet code from the following link: https://github.com/AlexeyAB/darknet
% git clone https://github.com/AlexeyAB/darknet
b) Go to the Darknet directory
%
cd darknet
c) Compile the Darknet sources. If you are cross-compiling for the board on the LRC servers, first update the Makefile to use the correct compiler settings:
CC=aarch64-linux-gnu-gcc
CPP=aarch64-linux-gnu-g++
Then, run make in the Darknet directory:
% make
d) We will be using the pre-trained Tiny YOLO CNN for small and embedded devices. Get the pre-trained weight model from the following link
%
wget https://pjreddie.com/media/files/yolov3-tiny.weights
e) If you cross-compiled on the LRC machines, transfer the darknet executable and all configuration settings (weights file and cfg/ and data/ subdirectories) to the board. Test and run Darknet/YOLO with the following command on the board:
% ./darknet detector test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg -save_labels
The save_labels flag will produce the golden reference output with detected classes and bounding boxes and save it in the file data/dog.txt. You can also look at the generated predictions.jpg for a visual representation of detection results.
For more information, and to get familiar with Darknet concepts and the source code, read the material and go through the following links:
https://pjreddie.com/darknet/yolo/
https://pjreddie.com/media/files/papers/yolo.pdf
a)
Before you can profile your program, you must first
recompile it specifically for profiling. To do so, add the -pg option the CFLAGS line in the Makefile. Then, recompile the code:
b)
Profile the code using:
% ./darknet detector
test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg
This command does not overwrite the reference data/dog.txt output file by
default (unless the –save_labels option is
included). This will allow us to use the original reference output as ground
truth to compare against when we start making modifications and optimizations
of Darknet as discussed below.
c)
Running the program to completion causes a file
named gmon.out to be created in the current directory. gprof works by
analyzing the data collected during the execution of your program after your
program has finished running. gmon.out holds this data in a gprof-readable
format.
d)
Run gprof as follows:
%
gprof darknet gmon.out >
darknet.perf
e)
Identify the bottleneck of the code based on the
execution time of each function. Report your profiling results.
As you probably realize by now, the general
matrix-matrix multiply (GEMM) part in the convolutional layers occupies the
dominant share of the total execution time. GEMM is known to be
computationally-intensive and expensive operations. Now, let’s try to do
some optimization to improve the execution speed of the GEMM. Image processing
or object detection applications like YOLO in general require algorithms that
are typically specified using floating-point operations. However, for power,
cost, and performance reasons, they are usually implemented with fixed-point
operations either in software or as special-purpose hardware accelerators. To that end, we
will convert the floating-point GEMM in Darknet to a fixed-point GEMM.
First, isolate the GEMM as a standalone program from the darknet code.
By default, Darknet’s GEMM uses a float data type. Convert the GEMM data type from
floating-point to fixed-point using only integer data types, such as short/long
ints (signed or unsigned). This code
snippet shows how to perform floating- to fixed-point conversion in C/C++.
As you are converting the GEMM to fixed-point, a certain amount of
accuracy loss is unavoidable. This whole idea of trading off accuracy with
execution speed is often called Approximate
Computing. In
the context of the standalone GEMM, we can define an accuracy metric by the
signal-to-noise ratio (SNR). An example for calculating SNR using Matlab is
given below, where the output matrices of the floating point GEMM and
fixed-point GEMM are assumed to be cout_flp and cout_fxp, respectively:
ddif = cout_fxp – cout_flp;
disp([‘SNR is’, num2str(10*log10(sum(cout_flp(:).^2)/sum(ddiff(:).^2))),
‘dB’]);
Try to maximize the SNR of your fixed-point GEMM. Aim to achieve at
least >40 dB of SNR. Report the SNR of your converted GEMM. You can use this test
bench to report the SNR.
Integrate the fixed-point GEMM back into the Darknet code and explore
opportunities for further optimizations in the larger Darknet context. Some
hints for possible avenues:
•
So far, we have performed the floating- to
fixed-point at the GEMM boundary. This will require conversion overhead on
every GEMM call. To gain more significant system-wide performance, you can
explore pushing the conversion boundary further beyond the GEMM.
•
Hint: When and where is the first time in the code
that we operate with floating-point images or weights? Instead of converting to
fixed-point not until the GEMM is called, can we convert them the values
earlier, e.g. the first time we see them?
•
More specifically, many of the weight values used
in the GEMM are constant. Can we convert the weights into fixed-point constants
at compile time (rather than doing run-time conversion)?
•
Some pre-processing operations before the GEMM in
the convolutional layers are filling the matrix C with zeros. The
larger the size of matrix C, the longer run-time it takes to complete.
Can we do something smarter? Do we have to always fill with zeros?
•
Explore the fixed-point data types design space.
What is the smallest fixed-point data type that you can use during conversion?
In general, the smaller data type the better in terms of performance. In
particular, you can exploit more SIMD parallelism (data packing) with smaller
data types (see below).
Use profiling to measure and guide you towards achieving as much
improvement in total Darknet runtime as you can, with as minimal a loss in the
detection accuracy of the overall YOLO application that includes your converted
fixed-point modules and interfaces. Note that as you are optimizing the entire
Darknet software, as discussed above a certain amount of prediction accuracy
loss is expected. That being said, your optimized version should be able to at
least predict that there are four objects in the picture: dog, bicycle, car,
truck. Your program should at least predict these four objects. The prediction
accuracy of these four objects might vary, but the accuracy vs. performance
tradeoff should be optimized.
To measure accuracy of object detection applications, a commonly used
quality loss metric is the so-called mean
Average Precision (mAP), which is essentially the
average of the maximum precisions at different recall values. For further
theoretical background, refer to this link.
Darknet includes the capability to compute the mAP of your modified program as
follows:
a)
Unfortunately, the mAP computation in Darknet has a
bug and crashes if less than 4 images are provided. To fix the bug, apply the
following patch
and recompile Darknet. The patch will also modify Darknet to only report mAP for object
classes that are actually included in the provided image test set (as opposed
to reporting average detection accuracy across all classes that the CNN was
originally trained for, even if those are not tested). Make sure you are in the
darknet directory and
apply the patch:
b)
Put the (relative) paths of the images you want to
be included in the mAP
computation into a coco_testdev file in the darknet directory. For
example:
c)
Make sure that the ground truth reference files
(e.g. data/dog.txt) are the ones
produced by a run of the original, unmodified floating-point Darknet
implementation. Then, run your modified fixed-point implementation on the
images listed in the coco_testdev file and compute
the mAP:
Report the following:
•
Total execution time of Darknet using your
optimized fixed-point versus the original floating-point implementation.
•
mAP of Darknet using your optimized fixed-point
version as compared against the original floating-point detection results.
As an extra credit item (that will,
however, be very useful for your final project design), start from your
fixed-point code base developed in Step 6 and find additional opportunities to
implement other optimizations that further improve software performance on the
ARM. Report on the optimizations applied and results achieved.
Some suggestions for possible
optimizations are:
• Exploiting SIMD vector processing. Many high-performance
computing applications exploit vectorized instructions and SIMD processing capabilities
of our ARM A53 CPU which includes a NEON SIMD vector unit. Leverage such
hardware capabilities to further improve run-time performance. You can look for
this link as a starting point:
– https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference
•
Cache locality-aware GEMM optimization. By default,
Darknet uses a naïve triple-nested loop to implement the GEMM. This does
not consider data reuse opportunities from the underlying cache and memory
hierarchies in the ARM platform. Implement a locality-aware GEMM and measure
the performance improvement accordingly. See these links as starting points:
–
https://github.com/flame/how-to-optimize-gemm/wiki
–
https://sites.google.com/lbl.gov/cs267-spr2019/hw-1
•
Parallelization and/or pipelining of the Darknet
processing chain on our quad-core ARM platform (this may also expose
opportunities for exploiting hardware/software parallelism when mapping the
GEMM out into hardware in Lab 2 and the final project). This requires a deeper
understanding of the Darknet processing chain, specifically to analyze
dependencies (and hence parallelization opportunities) among Darknet blocks.
Some basic instructions for how to implement parallel processing using the Pthreads library (available both on the board and on the
Linux hosts) are available here.
•
Use of the ARM Mali-400 MP2 GPU on the board.
Talk to us (instructor or TA) if you are interested, have questions or
are looking for ideas/advice around any of these topics.
Submit your report and files in Canvas.
The report should list the bottlenecks identified during profiling and
discuss/propose ways used to remove them. List the differences between the
original flp and fxp
versions of the Darknet code with respect to what you observed by profiling
them. Finally, report on the results of floating-point to fixed-point
conversion (Steps 5 and 6, including achieved performance improvements and
accuracy analysis) and any additional optimizations you performed (Step 7).
Also include the fixed point code (tar ball archives
with -czvf of code from Steps 5 and 6/7) as part of your
report.