System-on-a-Chip (SoC) Design
EE382V, Fall
2014
Lab #1
Due: 11:59pm,
September 22, 2014
Instructions:
•
This
lab is a team exercise, groups will be assigned in
class.
•
Please
use the discussion
board of Canvas for Q&A.
•
All
reports and code MUST be submitted to the assignment of Canvas.
•
Please
check relevant web pages.
The
goals of this lab are to:
•
Learn
the structure of the DRM source code and compile the code for the ARM platform
•
Identify
and propose ways to remove the bottleneck of the code when run on different
platforms
The assignment of
this lab includes the following:
•
Set
up the design and simulation environment
•
Profile
the code to identify the time consuming portions of the code
•
Complete
an exercise to remove a type of bottleneck
•
Isolate
modules of the DRM and perform floating-to-fixed point conversion
We will use both
Linux and Windows machines in this course. This lab work will be done on the
LRC Linux servers. The Windows 7 PCs in physical lab will be used in the final
project for running Xilinx tools necessary to synthesize hardware down to the
board. Access to the Windows machines will be explained later in class as part
of the project instructions.
(a)
Linux
Machines
We will be using the
new 64-bit LRC servers for the class. Available machines are listed here:
http://www.ece.utexas.edu/it/remote-linux
You
will need to use secure shell to access these machines. Execute the following
Linux command to access a server:
% ssh
-X -Y <server>.ece.utexas.edu
The options allow
remote X-Windows viewing. Substitute the server name to reach any of the 64-bit
LRC machines. If you do not have an ECE account, check with a LRC proctor.
(b)
Disk
space
Please use the /scratch directory on the LRC
machines as your own workspace. The scratch directory will not be wiped out
until the end of the semester. Execute the following commands:
% cd /scratch
% mkdir <your
username>
(c)
QEMU-based
virtual platform & ARM cross-compiler tool chain
A QEMU/SystemC-based
virtual platform simulator of our ARM+FPGA board setup is pre-installed on the LRC machines. See the QEMU/SystemC Tutorial for how to setup the simulation environment.
To compile and link
applications for the board, we need to use an arm-xilinx-linux-gnueabi-gcc
cross-compiler tool chain, which is provided by Xilinx together with their development environment:
% module load xilinx/2014.2
Go through the
tutorial on how to setup the QEMU/SystemC simulation
environment for ARM processor emulation of the Zedboard:
Read the following
tutorials on the Dream DRM Receiver
before starting on Section 5:
Open-Source
Implementation of a Digital Radio Mondiale (DRM)
Receiver
Frequency
Synchronization Strategy for a PC based DRM receiver
Software
Implementation of_a_DRM_Receiver
More information and
references are available here and on the SPARK homepage.
a)
Run
the "standard" data stream through the floating point (flp) version of the DRM code:
1.
Download
the DRM source code: drm-1.2.4-flp.tar.gz
2.
%
tar -xzvf drm-1.2.4-flp.tar.gz
3.
%
cd drm-1.2.4-flp
4.
%
perl config_linux.pl
5.
Follow
the instructions on the screen
(Note. Input stream: RTL_ModeB_10kHz.wav;
output stream: dummy.wav)
After
successively compiling the sources code, the executable drm is generated in linux subdirectory. There
are two different input streams through the DRM: short and long versions.
Before running the DRM executable, make sure that in linux subdirectory, RTL_ModeB_10kHz.wav is a symbolic link
to the long version, wave/RTL_ModeB_10kHz.wav, not RTL_ModeB_10kHz_short.wav. Since the DRM
receiver has to spend a lot of effort on initialization and synchronization to
recover the signal and timing in the first frames of each stream, the relative
profiling percentages will be skewed for the shortened input.
There are two output
files: dummy.wav and gmon.out. First, compare the
output dummy.wav with RTL_ModeB_10kHz_gold.wav, which is a golden
reference output, using diff or cmp. They should be
exactly the same:
% diff dummy.wav RTL_ModeB_10kHz_gold.wav
gmon.out is used for
profiling. You can profile the code as it was run on the Linux host using gprof (in the linux subdirectory):
% gprof .libs/drm
You don't need to
specify gmon.out as an argument since it is the default filename that gprof looks for. For
documentation about profiling under Linux, run man gprof.
Identify the
bottlenecks under Linux and report on the results.
b)
Setup
the virtual prototyping environment (see step 3) and simulate the floating
point code running on the QEMU simulator of our ARM board.
1.
Cross-compile
the executable for the ARM on the Linux host.
% cd drm-1.2.4-flp
% make clean
% perl config_linux_arm.pl
% ./config_linux_arm
% make
2.
Copy
the cross-compiled DRM code into the booted QEMU platform:
DRM executable: drm-1.2.4-flp/linux/drm
Input wave file: drm-1.2.4-flp/wave/RTL_ModeB_10kHz.wav
3.
Log
into the root shell of the QEMU simulator and copy the DRM executable and input
wave file into the QEMU simulator. After running the executable, a profiling
data file gmon.out and a decoded stream file dummy.wav will be created as a result.
4.
Copy
the generated profiling data (gmon.out) and decoded audio file (dummy.wav) from the QEMU
simulator back to the Linux host.
5.
On
the Linux host, run gprof on the DRM executable (drm) that you ran on
QEMU and the profile data file (gmon.out).
% gprof
drm gmon.out > outfile.prf
Identify the
bottleneck of the code based on the execution time of each function. Report
your profiling results.
c)
In
the same manner, profile the fixed point code of DRM both on the Linux host and
the ARM. This is legacy code that was partially converted from floating point
to fixed point by previous students in the class: drm-1.2.4-fxp.tar.gz.
Report on the results both under Linux and ARM.
(a)
Complete
an exercise to convert all floating Point (flp)
variables of a simple matrix
inversion example to fixed point. This code
snippet shows how to perform floating point to fixed point conversion in
C/C++.
(b)
Make
sure that the fixed point code output is the same as the floating point code
for the following testcases. Here by fixed point data
types, we mean integer data types such as short/long ints
(signed or unsigned). If the same output cannot be obtained, discuss why it
cannot be obtained as part of the report. Maintain the same code structure. The
function to calculate the determinant should be a fixed-point function, but it
should not be made inline.
Testcases:
1.
Row1:
1, 2, 3; Row2: 10, 0, 1; Row3: 12, 1, 3
2.
Row1:
0.2, 0.3, 1.4; Row2: 100, 12.1, 0; Row3: 1, -0.3, 10
In
the legacy code given to you in step 5c), there still exist opportunities of
floating- to fixed-point conversion. Specifically, the Viterbi decoder still
uses floating-point arithmetic. Convert the Viterbi and perform additional floating-
to fixed-point conversions to achieve the best performance gain that you can. Report
on the profiling results that show your optimizations.
In
the process of conversion, try to not only convert the module itself, but also
try to convert the class interfaces (buffers and method calls) of the module
(and parent modules) into fixed-point form. Use profiling to measure and guide
you towards achieving as much improvement in total DRM runtime as you can, with
as minimal a loss in the signal-to-noise ratio (SNR) of the overall DRM that
includes your converted fixed-point modules and interfaces. You can individually
substitute with the floating-point modules to analyze the cause of any SNR
degradation. Any noise is due to the loss of precision during the conversion
from floating-point to fixed-point arithmetic.
An example for calculating
SNR using Matlab is given below, where the output of
the modified DRM is dummy.wav and the golden file
is RTL_ModeB_10kHz_gold.wav (note that Matlab is currently only available on the 32-bit LRC machines
via module load
matlab):
du = wavread('RTL_ModeB_10kHz_gold.wav');
du2 = wavread('dummy.wav');
d = du(:,2);
d2 = du2(:,2);
ddiff = d - d2;
disp(['SNR
is',num2str(10*log10(sum(d(:).^2)/sum(ddiff(:).^2))),'
dB']);
Information about the
conversion process and existing codebase/infrastructure that was setup for the partially
converted DRM code by previous teams can be found here.
There exist many other software optimization
opportunities, such as exploiting vector or parallel processing capabilities of
our dual-core ARM Cortex-A9 board (the cores of which includes a NEON SIMD vector
unit). As an extra credit item, start from your fixed-point code base developed
in step 7, find additional opportunities implement other optimizations to further
improve software performance on the ARM. Report on the optimizations applied
and results achieved.
Some suggestions for possible optimizations are:
•
Additional
floating- to fixed-point conversions of other remaining floating-point modules/parts
of the code.
•
Vectorization of butterfly (add-compare-select)
computations in the Viterbi decoder using SIMD assembly instructions (or
corresponding compiler intrinsics) available on our
ARM. As a starting point, you can look at the SSE and MMX implementations on
x86 that are part of the original floating-point code.
•
Parallelization
and/or pipelining of the DRM processing chain on the dual-core platform (this
may also expose opportunities for exploiting hardware/software parallelism when
mapping the Viterbi out into hardware in Lab 2 and the final project). This
requires a deeper understanding of the DRM processing chain, specifically to
analyze dependencies (and hence parallelization opportunities) among DRM
blocks. Some basic instructions for how to implement parallel processing using
the Pthreads library (available both on the board and
on the Linux hosts) are available here.
Talk
to us (instructor or TA) if you are interested, have questions or are looking
for ideas/advice around any of these topics.
Submit your report and
files in Canvas.
The report should list the bottlenecks identified during profiling and
discuss/propose ways used to remove them. List the differences between the
original flp and fxp
versions of the DRM code with respect to what you observed by profiling them.
Finally, report on the results of floating-point to fixed-point conversion
(steps 6 and 7, including achieved performance improvements and SNR analysis
for step 7) and any additional optimizations you performed (step 8). Also
include the fixed point code (tar ball archives (with -czvf)
with output of step 6 and step 7/step 8) as part of your report.