System-on-Chip (SoC) Design
EE382M.20, Fall 2018
Lab #1
Due: 11:59pm, September 23, 2018
Instructions:
•
This lab is a team exercise,
groups will be assigned in class.
•
Please use the discussion board on Piazza
for Q&A.
•
All reports and code MUST be submitted to the
assignment of Canvas.
•
Please check relevant web pages.
The goals of this lab are to:
•
Learn the structure of the Darknet
source code and compile the code for the ARM platform
•
Identify and propose ways to remove the bottleneck
of the code when run on the ARM platform
The assignment of this lab includes the following:
•
Set up the design and board environment
•
Profile the code to identify the time consuming
portions of the code
•
Complete an exercise to remove a type of bottleneck
•
Isolate modules of the Darknet
and perform floating-to-fixed point conversion
•
Perform additional software optimizations
Lab work for this class will be done on the ECE Department’s LRC
Linux servers and/or the ZedBoard. For Lab 1, you can
compile the application either on the board or cross-compile it on the LRC
servers, but our target is the ARM platform, i.e. all profiling will need to be
done on the board itself. Note that for functional testing, you can also
natively compile and execute Darknet on any other,
e.g. Intel platform. However, the gcc compiler on the
LRC machines is too old, i.e. it will not compile there.
(a)
Linux Servers
We will be using
the LRC servers for the class. Available machines and instructions for remote
access via ssh are listed
here: http://www.ece.utexas.edu/it/remote-linux
You can use the /misc/scratch directory on the
LRC machines as your own workspace. The scratch directory will not be wiped out
until the end of the semester. However, scratch space is also not backed up,
i.e. use at your own risk. Execute the following commands:
% cd /misc/scratch
% mkdir <your
username>
For software
development targeting the board, we will be using Xilinx’s PetaLinux tools and SDK. This includes the capability to compile and link applications for the
board using the arm-linux-gnueabihf-gcc
cross-compiler tool chain, which is installed on the LRC machines and provided
by Xilinx together with their development environment:
%
module load xilinx/2017.4
% source /usr/local/packages/xilinx_2017.4/petalinux/2017.4/settings.[c]sh /usr/local/packages/xilinx_2017.4/petalinux/2017.4
(b)
Boards
Each team will get a ZedBoard
pre-installed with Ubuntu 18.04. You can connect to the board initially from a
Linux or Windows host via USB-UART as follows:
1.
Power on and connect the ZedBoard
to the host machine using the provide USB-UART serial cable.
2.
On a Linux host, search the kernel messaging with
the command dmesg |grep tty and look for an
indication that the USB-UART is enumerated as a device (typically listed as /dev/ttyACM0). Connect the
device with the minicom application, using the following
command:
% minicom –D
/dev/ttyACM0 –b 115200 -8 -o
The minicom terminal will connect and allow the ZedBoard terminal output to be interacted with.
3.
For windows users and more details, visit this getting started link
You can then connect your board to the network and
potentially use ssh to access the board remotely. Install any
necessary tools/libraries as you wish.
Again, you can compile the
application directly on the board or cross-compile it on the LRC servers:
a) Get the latest Darknet code from the following link: https://github.com/AlexeyAB/darknet
% git clone https://github.com/AlexeyAB/darknet
b) Go to the Darknet directory
%
cd darknet
c) Compile the Darknet sources. If you are cross-compiling for the board on the LRC servers, first update the Makefile to use the correct compiler settings:
CC=arm-linux-gnueabihf-gcc
CPP=arm-linux-gnueabihf-g++
Then, run make in the Darknet directory:
% make
d) We will be using the pre-trained Tiny YOLO CNN for small and embedded devices. Get the pre-trained weight model from the following link
%
wget https://pjreddie.com/media/files/yolov3-tiny.weights
e) If you cross-compiled on the LRC machines, transfer the darknet executable and all configuration settings (weights file and cfg/ and data/ subdirectories) to the board. Test and run Darknet/YOLO with the following command on the board:
% ./darknet detect cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg
This will produce the golden reference output with detected classes and bounding boxes and save it in the file data/dog.txt. You can also look at the generated predictions.jpg for a visual representation of detection results.
For more information, and to get familiar with Darknet concepts and the source code, read the material and go through the following links:
https://pjreddie.com/darknet/yolo/
https://pjreddie.com/media/files/papers/yolo.pdf
a)
Before you can profile your program, you must first
recompile it specifically for profiling. To do so, add the -pg option the CFLAGS line in the Makefile. Then, recompile the code:
b)
Profile the code using:
% ./darknet detector test cfg/coco.data cfg/yolov3-tiny.cfg
yolov3-tiny.weights data/dog.jpg
This command uses the expanded, general form of the above detect shortcut in darknet. It has the
advantage that it does not overwrite the reference data/dog.txt output file by
default (unless the –save_labels option is
included). This will allow us to use the original reference output as ground
truth to compare against when we start making modifications and optimizations
of Darknet as discussed below.
c)
Running the program to completion causes a file
named gmon.out to be created in the current directory. gprof works by analyzing the data collected
during the execution of your program after your program has finished running. gmon.out holds this data in a gprof-readable
format.
d)
Run gprof as follows:
%
gprof darknet gmon.out > darknet.perf
e)
Identify the bottleneck of the code based on the
execution time of each function. Report your profiling results.
As you probably realize by now, the general
matrix-matrix multiply (GEMM) part in the convolutional layers occupies the
dominant share of the total execution time. GEMM is known to be
computationally-intensive and expensive operations. Now, let’s try to do
some optimization to improve the execution speed of the GEMM. Image processing
or object detection applications like YOLO in general require algorithms that
are typically specified using floating-point operations. However, for power,
cost, and performance reasons, they are usually implemented with fixed-point
operations either in software or as special-purpose hardware accelerators. To that end, we
will convert the floating-point GEMM in Darknet to a
fixed-point GEMM.
First, isolate the GEMM as a standalone program from the darknet code. By default, Darknet’s
GEMM uses a float
data
type. Convert the GEMM data type from floating-point to fixed-point using only
integer data types, such as short/long ints (signed or unsigned). This code
snippet shows how to perform floating- to fixed-point conversion in C/C++.
As you are converting the GEMM to fixed-point, a certain amount of
accuracy loss is unavoidable. This whole idea of trading off accuracy with
execution speed is often called Approximate
Computing. In
the context of the standalone GEMM, we can define an accuracy
metric by the signal-to-noise ratio (SNR). An example for calculating SNR using Matlab is
given below, where the output matrices of the floating point GEMM and
fixed-point GEMM are assumed to be cout_fp and cout_fxp, respectively:
ddif = cout_fxp – cout_fp;
disp([‘SNR is’, num2str(10*log10(sum(cout_flp(:).^2)/sum(ddiff(:).^2))), ‘dB’]);
Try to maximize the SNR of your fixed-point GEMM. Aim
to achieve at least >40 dB of SNR. Report the SNR of your converted
GEMM. You can use this test
bench to report the SNR.
Integrate the fixed-point GEMM back into the Darknet
code and explore opportunities for further optimizations in the larger Darknet context. Some hints for possible avenues:
•
So far, we have performed the floating- to
fixed-point at the GEMM boundary. This will require conversion overhead on
every GEMM call. To gain more significant system-wide performance, you can
explore pushing the conversion boundary further beyond the GEMM.
•
Hint: When and where is the first time in the code
that we operate with floating-point images or weights? Instead of converting to
fixed-point not until the GEMM is called, can we convert them the values
earlier, e.g. the first time we see them?
•
More specifically, many of the weight values used
in the GEMM are constant. Can we convert the weights into fixed-point constants
at compile time (rather than doing run-time conversion)?
•
Some pre-processing operations before the GEMM in
the convolutional layers are filling the matrix C with zeros. The
larger the size of matrix C, the longer run-time it takes to complete.
Can we do something smarter? Do we have to always fill with zeros?
•
Explore the fixed-point data types design space.
What is the smallest fixed-point data type that you can use during conversion?
In general, the smaller data type the better in terms of performance. In
particular, you can exploit more SIMD parallelism (data packing) with smaller
data types (see below).
Use profiling to measure and guide you towards achieving as much
improvement in total Darknet runtime as you can, with
as minimal a loss in the detection accuracy of the overall YOLO application
that includes your converted fixed-point modules and interfaces. Note that as
you are optimizing the entire Darknet software, as
discussed above a certain amount of prediction accuracy loss is expected. That
being said, your optimized version should be able to at least predict that
there are four objects in the picture: dog, bicycle, car, truck. Your program
should at least predict these four objects. The prediction accuracy of these
four objects might vary, but the accuracy vs. performance tradeoff should be
optimized.
To measure accuracy of object detection applications, a commonly used
quality loss metric is the so-called mean
Average Precision (mAP), which is essentially the
average of the maximum precisions at different recall values. For further
theoretical background, refer to this link.
Darknet includes the capability to compute the mAP of your
modified program as follows:
a)
Unfortunately, the mAP computation in Darknet has a bug and crashes if less than 4 images are
provided. To fix the bug, apply the following patch
and recompile Darknet. The patch will also modify Darknet to only report mAP for object classes that are
actually included in the provided image test set (as opposed to reporting
average detection accuracy across all classes that the CNN was originally
trained for, even if those are not tested). Make sure you are in the darknet directory and
apply the patch:
b)
Put the (relative) paths of the images you want to
be included in the mAP
computation into a coco_testdev file in the darknet directory. For
example:
c)
Make sure that the ground truth reference files
(e.g. data/dog.txt) are the ones
produced by a run of the original, unmodified floating-point Darknet implementation. Then, run your modified fixed-point
implementation on the images listed in the coco_testdev file and compute
the mAP:
Report the following:
•
Total execution time of Darknet
using your optimized fixed-point versus the original floating-point
implementation.
•
mAP of Darknet using your optimized fixed-point version as
compared against the original floating-point detection results.
As an extra credit item (that will,
however, be very useful for your final project design), start from your
fixed-point code base developed in Step 6 and find additional opportunities to
implement other optimizations that further improve software performance on the
ARM. Report on the optimizations applied and results achieved.
Some suggestions for possible
optimizations are:
• Exploiting SIMD vector processing. Many high-performance
computing applications exploit vectorized
instructions and SIMD processing capabilities of our ARM Cortex-A9 CPU which
includes a NEON SIMD vector unit. Leverage such hardware capabilities to
further improve run-time performance. You can look for this link as a starting
point:
– https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference
•
Cache locality-aware GEMM optimization. By default,
Darknet uses a naïve triple-nested loop to implement
the GEMM. This does not consider data reuse opportunities from the underlying
cache and memory hierarchies in the ARM platform. Implement a locality-aware
GEMM and measure the performance improvement accordingly. See these links as
starting points:
–
https://github.com/flame/how-to-optimize-gemm/wiki
–
https://people.eecs.berkeley.edu/~knight/cs267/hw1/dgemm-blocked.c
•
Parallelization and/or pipelining of the Darknet processing chain on our dual-core ARM platform
(this may also expose opportunities for exploiting hardware/software
parallelism when mapping the GEMM out into hardware in Lab 2 and the final
project). This requires a deeper understanding of the Darknet
processing chain, specifically to analyze dependencies (and hence
parallelization opportunities) among Darknet blocks.
Some basic instructions for how to implement parallel processing using the Pthreads library (available both on the board and on the
Linux hosts) are available here.
Talk to us (instructor or TA) if you are
interested, have questions or are looking for ideas/advice around any of these
topics.
Submit your report and files in Canvas.
The report should list the bottlenecks identified during profiling and
discuss/propose ways used to remove them. List the differences between the
original flp and fxp
versions of the Darknet code with respect to what you
observed by profiling them. Finally, report on the results of floating-point to
fixed-point conversion (Steps 5 and 6, including achieved performance
improvements and accuracy analysis) and any additional optimizations you
performed (Step 7). Also include the fixed point code (tar ball archives with -czvf of code from Steps 5 and 6/7) as part of your
report.