System-on-a-Chip (SoC) Design
EE382V, Fall 2014
Due: 11:59pm, September 22, 2014
• This lab is a team exercise, groups will be assigned in class.
• Please use the discussion board of Canvas for Q&A.
• All reports and code MUST be submitted to the assignment of Canvas.
• Please check relevant web pages.
The goals of this lab are to:
• Learn the structure of the DRM source code and compile the code for the ARM platform
• Identify and propose ways to remove the bottleneck of the code when run on different platforms
The assignment of this lab includes the following:
• Set up the design and simulation environment
• Profile the code to identify the time consuming portions of the code
• Complete an exercise to remove a type of bottleneck
• Isolate modules of the DRM and perform floating-to-fixed point conversion
We will use both Linux and Windows machines in this course. This lab work will be done on the LRC Linux servers. The Windows 7 PCs in physical lab will be used in the final project for running Xilinx tools necessary to synthesize hardware down to the board. Access to the Windows machines will be explained later in class as part of the project instructions.
(a) Linux Machines
We will be using the new 64-bit LRC servers for the class. Available machines are listed here:
You will need to use secure shell to access these machines. Execute the following Linux command to access a server:
% ssh -X -Y <server>.ece.utexas.edu
The options allow remote X-Windows viewing. Substitute the server name to reach any of the 64-bit LRC machines. If you do not have an ECE account, check with a LRC proctor.
(b) Disk space
Please use the /scratch directory on the LRC machines as your own workspace. The scratch directory will not be wiped out until the end of the semester. Execute the following commands:
% cd /scratch
% mkdir <your username>
(c) QEMU-based virtual platform & ARM cross-compiler tool chain
A QEMU/SystemC-based virtual platform simulator of our ARM+FPGA board setup is pre-installed on the LRC machines. See the QEMU/SystemC Tutorial for how to setup the simulation environment.
To compile and link applications for the board, we need to use an arm-xilinx-linux-gnueabi-gcc cross-compiler tool chain, which is provided by Xilinx together with their development environment:
% module load xilinx/2014.2
Go through the tutorial on how to setup the QEMU/SystemC simulation environment for ARM processor emulation of the Zedboard:
Read the following tutorials on the Dream DRM Receiver before starting on Section 5:
a) Run the "standard" data stream through the floating point (flp) version of the DRM code:
1. Download the DRM source code: drm-1.2.4-flp.tar.gz
2. % tar -xzvf drm-1.2.4-flp.tar.gz
3. % cd drm-1.2.4-flp
4. % perl config_linux.pl
5. Follow the instructions on the screen
(Note. Input stream: RTL_ModeB_10kHz.wav; output stream: dummy.wav)
After successively compiling the sources code, the executable drm is generated in linux subdirectory. There are two different input streams through the DRM: short and long versions. Before running the DRM executable, make sure that in linux subdirectory, RTL_ModeB_10kHz.wav is a symbolic link to the long version, wave/RTL_ModeB_10kHz.wav, not RTL_ModeB_10kHz_short.wav. Since the DRM receiver has to spend a lot of effort on initialization and synchronization to recover the signal and timing in the first frames of each stream, the relative profiling percentages will be skewed for the shortened input.
There are two output files: dummy.wav and gmon.out. First, compare the output dummy.wav with RTL_ModeB_10kHz_gold.wav, which is a golden reference output, using diff or cmp. They should be exactly the same:
% diff dummy.wav RTL_ModeB_10kHz_gold.wav
gmon.out is used for profiling. You can profile the code as it was run on the Linux host using gprof (in the linux subdirectory):
% gprof .libs/drm
You don't need to specify gmon.out as an argument since it is the default filename that gprof looks for. For documentation about profiling under Linux, run man gprof.
Identify the bottlenecks under Linux and report on the results.
b) Setup the virtual prototyping environment (see step 3) and simulate the floating point code running on the QEMU simulator of our ARM board.
1. Cross-compile the executable for the ARM on the Linux host.
% cd drm-1.2.4-flp
% make clean
% perl config_linux_arm.pl
2. Copy the cross-compiled DRM code into the booted QEMU platform:
DRM executable: drm-1.2.4-flp/linux/drm
Input wave file: drm-1.2.4-flp/wave/RTL_ModeB_10kHz.wav
3. Log into the root shell of the QEMU simulator and copy the DRM executable and input wave file into the QEMU simulator. After running the executable, a profiling data file gmon.out and a decoded stream file dummy.wav will be created as a result.
4. Copy the generated profiling data (gmon.out) and decoded audio file (dummy.wav) from the QEMU simulator back to the Linux host.
5. On the Linux host, run gprof on the DRM executable (drm) that you ran on QEMU and the profile data file (gmon.out).
% gprof drm gmon.out > outfile.prf
Identify the bottleneck of the code based on the execution time of each function. Report your profiling results.
c) In the same manner, profile the fixed point code of DRM both on the Linux host and the ARM. This is legacy code that was partially converted from floating point to fixed point by previous students in the class: drm-1.2.4-fxp.tar.gz. Report on the results both under Linux and ARM.
(a) Complete an exercise to convert all floating Point (flp) variables of a simple matrix inversion example to fixed point. This code snippet shows how to perform floating point to fixed point conversion in C/C++.
(b) Make sure that the fixed point code output is the same as the floating point code for the following testcases. Here by fixed point data types, we mean integer data types such as short/long ints (signed or unsigned). If the same output cannot be obtained, discuss why it cannot be obtained as part of the report. Maintain the same code structure. The function to calculate the determinant should be a fixed-point function, but it should not be made inline.
1. Row1: 1, 2, 3; Row2: 10, 0, 1; Row3: 12, 1, 3
2. Row1: 0.2, 0.3, 1.4; Row2: 100, 12.1, 0; Row3: 1, -0.3, 10
In the legacy code given to you in step 5c), there still exist opportunities of floating- to fixed-point conversion. Specifically, the Viterbi decoder still uses floating-point arithmetic. Convert the Viterbi and perform additional floating- to fixed-point conversions to achieve the best performance gain that you can. Report on the profiling results that show your optimizations.
In the process of conversion, try to not only convert the module itself, but also try to convert the class interfaces (buffers and method calls) of the module (and parent modules) into fixed-point form. Use profiling to measure and guide you towards achieving as much improvement in total DRM runtime as you can, with as minimal a loss in the signal-to-noise ratio (SNR) of the overall DRM that includes your converted fixed-point modules and interfaces. You can individually substitute with the floating-point modules to analyze the cause of any SNR degradation. Any noise is due to the loss of precision during the conversion from floating-point to fixed-point arithmetic.
An example for calculating SNR using Matlab is given below, where the output of the modified DRM is dummy.wav and the golden file is RTL_ModeB_10kHz_gold.wav (note that Matlab is currently only available on the 32-bit LRC machines via module load matlab):
du = wavread('RTL_ModeB_10kHz_gold.wav');
du2 = wavread('dummy.wav');
d = du(:,2);
d2 = du2(:,2);
ddiff = d - d2;
disp(['SNR is',num2str(10*log10(sum(d(:).^2)/sum(ddiff(:).^2))),' dB']);
Information about the conversion process and existing codebase/infrastructure that was setup for the partially converted DRM code by previous teams can be found here.
There exist many other software optimization opportunities, such as exploiting vector or parallel processing capabilities of our dual-core ARM Cortex-A9 board (the cores of which includes a NEON SIMD vector unit). As an extra credit item, start from your fixed-point code base developed in step 7, find additional opportunities implement other optimizations to further improve software performance on the ARM. Report on the optimizations applied and results achieved.
Some suggestions for possible optimizations are:
• Additional floating- to fixed-point conversions of other remaining floating-point modules/parts of the code.
• Vectorization of butterfly (add-compare-select) computations in the Viterbi decoder using SIMD assembly instructions (or corresponding compiler intrinsics) available on our ARM. As a starting point, you can look at the SSE and MMX implementations on x86 that are part of the original floating-point code.
• Parallelization and/or pipelining of the DRM processing chain on the dual-core platform (this may also expose opportunities for exploiting hardware/software parallelism when mapping the Viterbi out into hardware in Lab 2 and the final project). This requires a deeper understanding of the DRM processing chain, specifically to analyze dependencies (and hence parallelization opportunities) among DRM blocks. Some basic instructions for how to implement parallel processing using the Pthreads library (available both on the board and on the Linux hosts) are available here.
Talk to us (instructor or TA) if you are interested, have questions or are looking for ideas/advice around any of these topics.
Submit your report and files in Canvas. The report should list the bottlenecks identified during profiling and discuss/propose ways used to remove them. List the differences between the original flp and fxp versions of the DRM code with respect to what you observed by profiling them. Finally, report on the results of floating-point to fixed-point conversion (steps 6 and 7, including achieved performance improvements and SNR analysis for step 7) and any additional optimizations you performed (step 8). Also include the fixed point code (tar ball archives (with -czvf) with output of step 6 and step 7/step 8) as part of your report.