Compiler Support for Digital Signal Processors and
Multimedia Processors
Papers at the 1999 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
Prof. Brian L. Evans,
UT Austin
Conventional digital signal processors (DSPs) have a dramatically different
architecture from general-purpose processors, as described below in
Table 1. The differences arise primarily because the architectures
of digital signal processors have been optimized for the low-latency
high-throughput data processing common in signal processing and digital
communications systems. Digital signal processors are the enabling
technology behind low-cost high-volume consumer electronics such as
audio CD players, sound cards, disk drives, voiceband modems, and
and cell phones.
Conventional Digital Signal Processor
| Conventional General-Purpose Processor
Separate program and data memories
| Common program and data memories
Separate program and data buses
| One common bus for instructions and data
Separate computational units (ALU,
multiplier, shifter, accumulator)
and large amounts of functional
(operational) parallelism
| No separate computational units
Single-cycle multiply-accumulate (MAC)
instruction with extended precision accumulator
| Multiply and addition are
separate instructions, and
multiply loses precision
Optimized for single cycle
instruction execution
| Many multiple-cycle instructions
For real-time signal processing,
multifunction instructions are
implemented with the help of
parallel architecture
| Usually do not contain
multi-function instructions.
Independent data address generators
| No independent data address generators
Hardware support for special addressing
modes of modulo and bit-reversed addressing
| Special addressing modes would
have to be emulated in software
Table 1: Comparison of the architectures of conventional digital signal
processors and conventional general-purpose processors.
C and C++ compilers have been developed and optimized for the architectures
of general-purpose processors, and are not as efficient at generating code
for digital signal processors. Table 2 compares the performance of
the May 1996 version of the Motorola MC56000 C vs. hand coding for three
kernels (fundamental signal processing operations). For the three kernels,
the compiler generates code with an average overhead of 27% for data memory,
41% for program memory, and 47% on execution time. The speed of kernels
is often the bottleneck in signal processing and multimedia applications,
so they are generally programmed directly in assembly language.
| Program Memory
| X Data Memory
| Y Data Memory
| Execution Time
IIR Filter, Hand Coded
| 43
| 7
| 8
| 517
IIR FIlter, Compiled Code
| 59
| 11
| 11
| 1127
256-point FFT, Hand Coded
| 130
| 67
| 60
| 29,172
256-point FFT, Compiled Code
| 178
| 69
| 77
| 30,927
Goertzel DFT, Hand Coded
| 67
| 41
| 3
| 17,341
Goertzel DFT, Compiled Code
| 111
| 39
| 5
| 20,123
Table 2: Comparison of code generated by C compilers vs. hand-written
assembly code for the Motorola 56000 digital signal processor.
[Unpublished Reference]
The modern trend in both digital signal processors and general-purpose
processors is to evolve hybrid architectural features.
Newer digital signal processors such as the Analog Devices TigerSHARC and
general-purpose processors such as Pentium, UltraSPARC, and PowerPC have
added single-instruction multiple-data (SIMD) instructions to their
instruction sets to accelerate signal processing and multimedia applications.
At present, only the Metrowerks Code Warrior C compiler (Version 5 Pro)
generates SIMD instructions.
In general, a programmer would have to mix the SIMD assembly language
instructions with C code.
In contrast, the Texas Instruments TMS320C6x is a very-long instruction
word (VLIW) architecture with many digital signal processor features.
The architecture was developed simultaneously with the assembler and
C compiler
The simultaneous development of a processor architecture and a C compiler
for it is a trend in multimedia processors.
An overview of digital signal processors and native signal processing
is available at
Preprocessing for C Compilers
"A Floating-Point to Integer C Converter with Shift Reduction
for Fixed-Point Digital Signal Processors"
pp. 2163 - 2166
Processors and processor cores used in high-volume consumer electronics
are generally fixed-point. The fixed-point processors may process
speech or images (8 bits/sample) or audio (16 bits/sample), or may be
used in a voiceband modem (8 bits/sample) or digital subscriber line
modem (20 bits/sample). A key problem is that many algorithms are
initially developed using floating-point arithmetic, and the tedious
conversion of the algorithms to fixed-point arithmetic is generally
performed manually.
This paper presents a general approach to convert the floating-point
variables and arithmetic in a C program to integer variables and arithmetic,
while minimizing the loss of precision and the implementation cost on a
target digital signal processor.
The implementation cost includes the cost of implementing shifts on the
target processor.
The output of the conversion is another C program.
The binary point for each fixed point variable may vary.
Each fixed-variable has a sign bit, a number of integer bits, and a
number of fractional bits.
The determination of the bit widths is performed by the following steps:
- The program is first simulated to determine the dynamic
range of each floating-point variable.
Each variable is initially assigned a number of integer and
fractional bits for the target word size (16 for most digital
signal processors).
- Simulated annealing is used to choose the integer wordlengths for
each variable to minimize the number of scaling (shift) operations.
- The floating-point variables and constants are replaced by the
corresponding integer types, and appropriate scaling code is inserted
The conversion uses the Stanford University Intermediate Format compiler
system to parse, analyze, convert, and generate source code.
The authors demonstrate the conversion on the following systems:
- Fourth-order infinite impulse response filter,
with 7 floating-point variables - speedup of 406 to 1
- Qualcomm Code-Excited Linear Predictive (QCELP) speech codec,
with 381 floating-point variables - speedup of 24.6 to 1 on
a Texas Instruments TMS320C62x digital signal processor
The TMS320C6x family requires 4 instruction cycles for a
floating-point multiplication and 1 instruction cycle for
a fixed-point multiplication.
"Source-Level Loop Optimization for DSP Code Generation"
pp. 2155 - 2158
This paper applies a type of source-level loop optimization, which is
commonly used in C compilers for desktop workstations, to the compilation
of C code for digital signal processors. The paper states that the
"overhead of compiled code, in terms of clock cycles and code space,
falls typically in the range of 2 to 8". In the introduction, we showed
that an older 1996 compiler on the oldest Motorola DSP, the 56000, had
an overhead of about 40%. Nonetheless, the authors correctly point out
that conventional digital signal processors have very few data registers
which makes software pipelining extremely difficult to implement.
For example, the Motorola MC56300 has four data registers.
The authors apply their loop unrolling techniques to reduce the
execution time of the innermost loops of seven signal and image
processing kernels (dot product, vector multiply, finite impulse
response filter, lattice synthesis, infinite impulse response,
vector codebook search, and JPEG compression). On average,
their technique reduces the execution time by 17%. The performance
increase is limited by the available data registers. Note that
the modern Texas Instruments TMS320C62x digital signal processor has
32 registers and 8 parallel functional units, so it is a prime
candidate for source-level loop optimization techniques. These
techniques are built into the TMS320C62x compiler.
Architectural Support for C Compilers
"C/C++ Compiler Support for Siemens TriCode DSP Instruction Set"
pp. 2147 - 2150
This paper is part of the trend to codevelop an architecture of a
media processor and its C compiler. This example is the Siemens
TriCore processor. Tricore has a hybrid DSP/microcontroller/SIMD
architecture. They add DSP data types and functions to the C/C++
languages to increase the coverage of the TriCore instruction set
in compiled code from 50% to 80%. This involves making changes
to the compiler to take advantage of the new features.
Digital signal processors operate on two types of fixed-point data:
fractional data and integer data. With fractional data, the binary
point is left-justified. With integer data, the binary point is
right-justified. Most DSPs have a barrel shifter so that the binary
point can be placed anywhere in the bit field without a performance
penalty; however, the programmer must keep track of where the binary
point is. The problem with the C and C++ programming languages is
that their standards do not define fractional data types.
The authors take a conventional four-tiered approach at extending C
and C++ to provide better support for signal processing and multimedia
algorithms on the TriCore processor:
- Libraries: Assembly-coded functions in a library that may be called from C
- Assembly representation: Basic macros in C and in-line functions in C++
that represent TriCore instructions
- Data representation: New data types as C language extensions and
C++ classes, as well as definitions for coercision, reading, writing,
and arithmetic operations
(coercision occurs when data types are mixed arithmetic computations)
for 16-bit numbers in [-1,1)
for 32-bit numbers in [-1,1)
for 64-bit accumulation
[2-17, 217], which has 45 bits in the
mantissa and provided 17 guard bits
to represent 4 bytes packed into 32 bits
to represent 2 16-bit shorts packed into 32 bits
for circular buffers which are common in filters
- Compiler: Determines when to use MAC, zero-overhead loop, and other
efficient instructions based on the expression
In order to generate efficient compiled C/C++ code for the TriCore
processor, the custom non-ANSI-standard Seimens C/C++ compiler must
be used. The drawback is that developing code for the TriCode
processor in this manner will prevent the code from being portable
to other processors.
"Rapid Prototyping of Multimedia Chip Sets"
pp. 2175 - 2178
The authors construct a hardware/software codesign framework
for design application-specific architectures for multimedia applications,
e.g. video codecs. The framework enables the interactive exploration of
hardware/software tradeoffs. It consists of the following design stages:
- Algorithm development: algorithms are block diagrams of
communicating macroblocks written in C
- Performance modeling: maps algorithms to candidate architectures
and models their performance at a high-level by taking into
account processor utilization, memory size, bus activity, and
interrupt overhead
- Virtual prototyping: simulate applications represented by VHDL
and C code on the candidate architecture, and perform fixed-point
- Low-level implementation: validation is performed using
gate-level simulation and emulation.
The approach bridges the gap between algorithm development and low-level
implementation. Historically, algorithm developers and system designers
have used completely different toolsets, e.g. MATLAB for algorithm
development and Synopsys tools for system design.