System-on-Chip (SoC) Design
ECE382M.20, Fall 2023
Lab #1
Due: 11:59pm, September 18, 2023
Instructions:
•
This lab is a team exercise (teams of 2-3 students).
•
Please use the discussion board on ed
for Q&A.
•
Submit the report on Canvas
and code on GitHub
Classroom.
•
Please check relevant web pages.
The goals of this lab are to:
•
Learn the structure of the Darknet source code and
compile the code for the ARM platform
•
Identify and propose ways to remove the bottleneck
of the code when run on the ARM platform
The assignment of this lab includes the following:
•
Set up the design and board environment
•
Profile the code to identify the time
consuming portions of the code
•
Complete an exercise to remove a type of bottleneck
•
Isolate modules of the Darknet and perform
floating-to-fixed point conversion
•
Perform additional software optimizations
Lab work for this class can be done either on the ECE Department’s
LRC Linux servers or the Ultra96 Board. For Lab 1, you can compile the
application either on the board or cross-compile it on the LRC servers, but our
target is the ARM platform, i.e. all profiling will need to be done on the
board itself. Note that for functional testing, you can also natively compile
and execute Darknet on any other, e.g. Intel platform.
(a)
ECE Linux Servers
We will be using
the LRC servers for the class. Instructions for remote access via ssh are listed here: https://wikis.utexas.edu/display/eceit/ECE+Linux+Application+Servers
You can use the /misc/scratch directory on the
LRC machines as your own workspace. The scratch directory will not be wiped out
until the end of the semester. However, scratch space is also not backed up,
i.e. use at your own risk. Execute the following commands:
% cd /misc/scratch
% mkdir <your
username>
For software
development targeting the board, we will be using Xilinx’s SDK that
matches the Ubuntu setup (gcc compiler version) on
the board. The SDK includes the capability to compile and link applications for the board using the
aarch64-linux-gnu-gcc cross-compiler tool chain, which is installed on the LRC
machines and provided by Xilinx together with their development environment:
%
module load xilinx/2018
% source
/usr/local/packages/xilinx_2018/vivado_hl/SDK/2018.3/settings64.sh
(b)
Boards
Each team will get an Ultra96 board
pre-installed with Ubuntu 18.04. You can connect to the board initially from a
Linux or Windows host via USB-UART as follows:
1.
Power on and connect the Ultra96 board to the host
machine using the provide USB-UART serial cable. If the board doesn’t
boot automatically, press the Power Button (SW4). The blue Power On and Done LEDs (D1/D2) next to the microSD card socket
should be on.
2.
On a Linux host, search the kernel messaging with
the command dmesg|grep tty and look for an
indication that the USB-UART is enumerated as a device (typically listed as /dev/ttyUSB1). Connect the
device with the minicom application,
using the following command:
% minicom –D /dev/ttyUSB1 –b 115200 -8
-o
The minicom
terminal will connect and allow the Ultra96 board terminal output to be
interacted with.
3.
On Windows, go into the Device Manager to find the
COM port for the USB connection and use a terminal application like Putty to
connect with a baudrate of 115200..
For further details about the board and its bringup, you can consult the Open HW Wiki.
See this getting started link for general
details about setting up the serial connection. If the device driver for the
USB UART is not automatically installed, or for further troubleshooting, please
see the USB-to-JTAG/UART
pod documentation by Avnet.
The login/password will be provided with the board.
This account has root access via sudo. To setup Wifi on the board, first put the SSID and pre-shared key
(PSK) of the network to connect into in the /root/wpa_supplicant.conf file. To generate
the PSK from a plain-text password, run:
% wpa_passphrase <ssid> <password>
and copy and paste the PSK entry into /root/wpa_supplicant.conf.
If you are on campus, you need to use the “utexas-iot” network. The boards share a common PSK
value for the utexas IoT network, which can be found
in:
/root/wpa_supplicant.conf.Utexas-IOT or /root/wpa_supplicant.conf.utexas
If none of these files can be found in your
filesystem, please contact the TA.
Then start Wifi with:
% sudo /root/wifi.sh
This command may take ~30s to execute, but as long
as the SSID and PSK are correct, it should connect. To run an
ssh server on the board, you can follow this guide.
You can then connect your board to the network and potentially use ssh to access the
board remotely via Wifi. Install any necessary
tools/libraries as you wish.
Important: Before unplugging
the board from a power source, always make sure to first run:
% sudo halt
It will take a few seconds for the kernel to halt.
You can then unplug the board safely.
Again, you can compile the
application directly on the board or cross-compile it on the LRC servers:
a) Get the latest Darknet code from the following link: https://github.com/AlexeyAB/darknet
% git clone https://github.com/AlexeyAB/darknet
b) Go to the Darknet directory
%
cd darknet
c) Compile the Darknet sources. If you are cross-compiling for the board on the LRC servers, first update the Makefile to use the correct compiler settings:
CC=aarch64-linux-gnu-gcc
CPP=aarch64-linux-gnu-g++
Then, run make in the Darknet directory:
% make
d) We will be using the pre-trained Tiny YOLO CNN for small and embedded devices. Get the pre-trained weight model from the following link
%
wget https://pjreddie.com/media/files/yolov3-tiny.weights
e) If you cross-compiled on the LRC machines, transfer the darknet executable and all configuration settings (weights file and cfg/ and data/ subdirectories) to the board. Test and run Darknet/YOLO with the following command on the board:
% ./darknet detector test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg -save_labels
The save_labels flag will produce the golden reference output with detected classes and bounding boxes and save it in the file data/dog.txt. You can also look at the generated predictions.jpg for a visual representation of detection results.
To get started with the Darknet source code, and to get familiar with Darknet concepts and the source code, look at this Darknet starter guide. For more information, you can read the material provided at the following links:
https://pjreddie.com/darknet/yolo/
https://pjreddie.com/media/files/papers/yolo.pdf
a)
Before you can profile your program, you must first
recompile it specifically for profiling. To do so, add the -pg option the CFLAGS line in the Makefile. Then, recompile the code:
b)
Profile the code using:
% ./darknet detector
test cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg
This command does not overwrite the reference data/dog.txt output file by
default (unless the –save_labels option is
included). This will allow us to use the original reference output as ground
truth to compare against when we start making modifications and optimizations
of Darknet as discussed below.
c)
Running the program to completion causes a file
named gmon.out to be created in the current directory. gprof works by
analyzing the data collected during the execution of your program after your
program has finished running. gmon.out holds this data in a gprof-readable
format.
d)
Run gprof as follows:
%
gprof darknet gmon.out >
darknet.perf
e)
Identify the bottleneck of the code based on the
execution time of each function. Report your profiling results.
As you probably realize by now, the general
matrix-matrix multiply (GEMM) part in the convolutional layers occupies the
dominant share of the total execution time. GEMM is known to be computationally-intensive
and expensive operations. Now, let’s try to do some optimization to
improve the execution speed of the GEMM. Image processing or object detection
applications like YOLO in general require algorithms that are typically
specified using floating-point operations. However, for power, cost, and
performance reasons, they are usually implemented with fixed-point operations
either in software or as special-purpose hardware accelerators. To that end, we
will convert the floating-point GEMM in Darknet to a fixed-point GEMM.
First, isolate the GEMM as a standalone program separate from Darknet.
Place the standalone code in a directory named part5 in your
repository, i.e. your repository should contain two directories, part5 and darknet. Please maintain the same function prototype as
that of the gemm() function in darknet/src/gemm.c for your
standalone version, and develop a testbench around that. To generate testbench
data, you can capture the input and output data of the gemm()
function as it executes in the darknet code. For grading, we will run your standalone gemm() that must conform to the prototype of
the original gemm() function in Darknet against our own
testbench data.
By default, Darknet’s GEMM uses a float data type. Convert the GEMM data type from
floating-point to fixed-point using only integer data types, such as short/long
int (signed or unsigned). This code
snippet shows how to perform floating- to fixed-point conversion in C/C++.
As you are converting the GEMM to fixed-point, a certain amount of accuracy
loss is unavoidable. This whole idea of trading off accuracy with execution
speed is often called Approximate
Computing. In
the context of the standalone GEMM, we can define an accuracy metric by the
signal-to-noise ratio (SNR). An example for calculating SNR is included in the code
snippet mentioned above.
Try to maximize the SNR of your fixed-point GEMM. Aim to achieve at
least >40 dB of SNR. Report the SNR of your converted GEMM for the matrices
that correspond to the first time the gemm() function executes when running darknet
with the command line mentioned in part 4, using the dog.jpg image.
Integrate the fixed-point GEMM back into the Darknet code and explore
opportunities for further optimizations in the larger Darknet context. Some
hints for possible avenues:
•
So far, we have performed the floating- to
fixed-point at the GEMM boundary. This will require conversion overhead on
every GEMM call. To gain more significant system-wide performance, you can
explore pushing the conversion boundary further beyond the GEMM.
•
Hint: When and where is the first time in the code
that we operate with floating-point images or weights? Instead of converting to
fixed-point not until the GEMM is called, can we convert them the values
earlier, e.g. the first time we see them?
•
More specifically, many of the weight values used
in the GEMM are constant. Can we convert the weights into fixed-point constants
at compile time (rather than doing run-time conversion)?
•
Some pre-processing operations before the GEMM in
the convolutional layers are filling the matrix C with zeros. The
larger the size of matrix C, the longer run-time it takes to complete.
Can we do something smarter? Do we have to always fill with zeros?
•
Explore the fixed-point data types design space.
What is the smallest fixed-point data type that you can use during conversion?
In general, the smaller data type the better in terms of performance. In
particular, this will be the case in hardware, which can be tuned to implement
arbitrary precisions. But even in software, you can exploit more SIMD
parallelism (data packing) with smaller data types (see below).
Use profiling to measure and guide you towards achieving as much
improvement in total Darknet runtime as you can, with as minimal a loss in the
detection accuracy of the overall YOLO application that includes your converted
fixed-point modules and interfaces. Note that as you are optimizing the entire
Darknet software, as discussed above a certain amount of prediction accuracy
loss is expected. That being said, your optimized version should be able to at least
predict that there are four objects in the picture: dog, bicycle, car, truck.
Your program should at least predict these four objects. The prediction
accuracy of these four objects might vary, but the accuracy vs. performance
tradeoff should be optimized.
To measure accuracy of object detection applications, a commonly used
quality loss metric is the so-called mean
Average Precision (mAP), which is essentially the
average of the maximum precisions at different recall values. For further
theoretical background, refer to this link.
Darknet includes the capability to compute the mAP of your modified program as
follows:
a)
Unfortunately, the mAP computation in Darknet has a
bug and crashes if less than 4 images are provided. To fix the bug, apply the
following patch
and recompile Darknet. The patch will also modify Darknet to only report mAP for object
classes that are actually included in the provided image test set (as opposed
to reporting average detection accuracy across all classes that the CNN was
originally trained for, even if those are not tested). Make sure you are in the
darknet directory and
apply the patch:
b)
Put the (relative) paths of the images you want to
be included in the mAP
computation into a coco_testdev file in the darknet directory. For
example:
c)
Make sure that the ground truth reference files
(e.g. data/dog.txt) are the ones
produced by a run of the original, unmodified floating-point Darknet
implementation. Then, run your modified fixed-point implementation on the
images listed in the coco_testdev file and compute
the mAP:
Report the following:
•
Total execution time of Darknet using your
optimized fixed-point versus the original floating-point implementation.
•
mAP of Darknet using your optimized fixed-point
version as compared against the original floating-point detection results.
As an extra credit item (that will,
however, be very useful for your final project design), start from your
fixed-point code base developed in Step 6 and find additional opportunities to
implement other optimizations that further improve software performance on the
ARM. Report on the optimizations applied and results achieved.
Some suggestions for possible
optimizations are:
• Exploiting SIMD vector processing. Many high-performance
computing applications exploit vectorized instructions and SIMD processing
capabilities of our ARM A53 CPU which includes a NEON SIMD vector unit.
Leverage such hardware capabilities to further improve run-time performance.
You can look for this link as a starting point:
– https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference
•
Cache locality-aware GEMM optimization. By default,
Darknet uses a naïve triple-nested loop to implement the GEMM. This does
not consider data reuse opportunities from the underlying cache and memory
hierarchies in the ARM platform. Implement a locality-aware GEMM and measure
the performance improvement accordingly. See the homework and these links as
starting points:
–
https://github.com/flame/how-to-optimize-gemm/wiki
–
https://sites.google.com/lbl.gov/cs267-spr2023/hw-1
•
Parallelization and/or pipelining of the Darknet
processing chain on our quad-core ARM platform (this may also expose
opportunities for exploiting hardware/software parallelism when mapping the
GEMM out into hardware). This requires a deeper understanding of the Darknet
processing chain, specifically to analyze dependencies (and hence
parallelization opportunities) among Darknet blocks. Some basic instructions
for how to implement parallel processing using the Pthreads
library (available both on the board and on the Linux hosts) are available here.
•
Use of the ARM Mali-400 MP2 GPU on the board.
Talk to us (instructor or TA) if you are interested, have questions or
are looking for ideas/advice around any of these topics.
Submit your modified Darknet code and standalone GEMM code from Part 5
as separate directories in your GitHub
Classroom repository, and your report on Canvas.
Your darknet code must successfully compile using make and run the same
way as in the instructions above. Make sure you have removed any printf() statements that you introduced
when submitting the final version. For any necessary information, please
include a Readme as well.
Similarly, the standalone GEMM code must have a Makefile
and Readme. As mentioned, do not modify the gemm()
function prototype.
The report should list the bottlenecks identified during profiling and
discuss/propose ways used to remove them. List the differences between the
original flp and fxp versions
of the Darknet code with respect to what you observed by profiling them.
Finally, report on the results of floating-point to fixed-point conversion
(Steps 5 and 6, including achieved performance improvements and accuracy
analysis) and any additional optimizations you performed (Step 7).