Compiler Support for Digital Signal Processors and Multimedia Processors

Papers at the 1999 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing

bevans@ece.utexas.edu

Introduction

Conventional digital signal processors (DSPs) have a dramatically different architecture from general-purpose processors, as described below in Table 1. The differences arise primarily because the architectures of digital signal processors have been optimized for the low-latency high-throughput data processing common in signal processing and digital communications systems. Digital signal processors are the enabling technology behind low-cost high-volume consumer electronics such as audio CD players, sound cards, disk drives, voiceband modems, and and cell phones.

Conventional Digital Signal Processor	Conventional General-Purpose Processor
Separate program and data memories	Common program and data memories
Separate program and data buses	One common bus for instructions and data
Separate computational units (ALU, multiplier, shifter, accumulator) and large amounts of functional (operational) parallelism	No separate computational units
Single-cycle multiply-accumulate (MAC) instruction with extended precision accumulator	Multiply and addition are separate instructions, and multiply loses precision
Optimized for single cycle instruction execution	Many multiple-cycle instructions
For real-time signal processing, multifunction instructions are implemented with the help of parallel architecture	Usually do not contain multi-function instructions.
Independent data address generators	No independent data address generators
Hardware support for special addressing modes of modulo and bit-reversed addressing	Special addressing modes would have to be emulated in software

Table 1: Comparison of the architectures of conventional digital signal processors and conventional general-purpose processors.

C and C++ compilers have been developed and optimized for the architectures of general-purpose processors, and are not as efficient at generating code for digital signal processors. Table 2 compares the performance of the May 1996 version of the Motorola MC56000 C vs. hand coding for three kernels (fundamental signal processing operations). For the three kernels, the compiler generates code with an average overhead of 27% for data memory, 41% for program memory, and 47% on execution time. The speed of kernels is often the bottleneck in signal processing and multimedia applications, so they are generally programmed directly in assembly language.

Implementation	Program Memory	X Data Memory	Y Data Memory	Execution Time
IIR Filter, Hand Coded	43	7	8	517
IIR FIlter, Compiled Code	59	11	11	1127
256-point FFT, Hand Coded	130	67	60	29,172
256-point FFT, Compiled Code	178	69	77	30,927
Goertzel DFT, Hand Coded	67	41	3	17,341
Goertzel DFT, Compiled Code	111	39	5	20,123

Table 2: Comparison of code generated by C compilers vs. hand-written assembly code for the Motorola 56000 digital signal processor. [Unpublished Reference]

The modern trend in both digital signal processors and general-purpose processors is to evolve hybrid architectural features. Newer digital signal processors such as the Analog Devices TigerSHARC and general-purpose processors such as Pentium, UltraSPARC, and PowerPC have added single-instruction multiple-data (SIMD) instructions to their instruction sets to accelerate signal processing and multimedia applications. At present, only the Metrowerks Code Warrior C compiler (Version 5 Pro) generates SIMD instructions. In general, a programmer would have to mix the SIMD assembly language instructions with C code. In contrast, the Texas Instruments TMS320C6x is a very-long instruction word (VLIW) architecture with many digital signal processor features. The architecture was developed simultaneously with the assembler and C compiler The simultaneous development of a processor architecture and a C compiler for it is a trend in multimedia processors.

An overview of digital signal processors and native signal processing is available at

http://www.ece.utexas.edu/~bevans/talks/hp-dsp-processors/index.html

Preprocessing for C Compilers

"A Floating-Point to Integer C Converter with Shift Reduction for Fixed-Point Digital Signal Processors" pp. 2163 - 2166

Processors and processor cores used in high-volume consumer electronics are generally fixed-point. The fixed-point processors may process speech or images (8 bits/sample) or audio (16 bits/sample), or may be used in a voiceband modem (8 bits/sample) or digital subscriber line modem (20 bits/sample). A key problem is that many algorithms are initially developed using floating-point arithmetic, and the tedious conversion of the algorithms to fixed-point arithmetic is generally performed manually.

This paper presents a general approach to convert the floating-point variables and arithmetic in a C program to integer variables and arithmetic, while minimizing the loss of precision and the implementation cost on a target digital signal processor. The implementation cost includes the cost of implementing shifts on the target processor. The output of the conversion is another C program.

The binary point for each fixed point variable may vary. Each fixed-variable has a sign bit, a number of integer bits, and a number of fractional bits. The determination of the bit widths is performed by the following steps:

The program is first simulated to determine the dynamic range of each floating-point variable. Each variable is initially assigned a number of integer and fractional bits for the target word size (16 for most digital signal processors).
Simulated annealing is used to choose the integer wordlengths for each variable to minimize the number of scaling (shift) operations.
The floating-point variables and constants are replaced by the corresponding integer types, and appropriate scaling code is inserted

The conversion uses the Stanford University Intermediate Format compiler system to parse, analyze, convert, and generate source code. The authors demonstrate the conversion on the following systems:

Fourth-order infinite impulse response filter, with 7 floating-point variables - speedup of 406 to 1
Qualcomm Code-Excited Linear Predictive (QCELP) speech codec, with 381 floating-point variables - speedup of 24.6 to 1 on a Texas Instruments TMS320C62x digital signal processor

The TMS320C6x family requires 4 instruction cycles for a floating-point multiplication and 1 instruction cycle for a fixed-point multiplication.

"Source-Level Loop Optimization for DSP Code Generation" pp. 2155 - 2158

This paper applies a type of source-level loop optimization, which is commonly used in C compilers for desktop workstations, to the compilation of C code for digital signal processors. The paper states that the "overhead of compiled code, in terms of clock cycles and code space, falls typically in the range of 2 to 8". In the introduction, we showed that an older 1996 compiler on the oldest Motorola DSP, the 56000, had an overhead of about 40%. Nonetheless, the authors correctly point out that conventional digital signal processors have very few data registers which makes software pipelining extremely difficult to implement. For example, the Motorola MC56300 has four data registers. The authors apply their loop unrolling techniques to reduce the execution time of the innermost loops of seven signal and image processing kernels (dot product, vector multiply, finite impulse response filter, lattice synthesis, infinite impulse response, vector codebook search, and JPEG compression). On average, their technique reduces the execution time by 17%. The performance increase is limited by the available data registers. Note that the modern Texas Instruments TMS320C62x digital signal processor has 32 registers and 8 parallel functional units, so it is a prime candidate for source-level loop optimization techniques. These techniques are built into the TMS320C62x compiler.

Architectural Support for C Compilers

"C/C++ Compiler Support for Siemens TriCode DSP Instruction Set" pp. 2147 - 2150

This paper is part of the trend to codevelop an architecture of a media processor and its C compiler. This example is the Siemens TriCore processor. Tricore has a hybrid DSP/microcontroller/SIMD architecture. They add DSP data types and functions to the C/C++ languages to increase the coverage of the TriCore instruction set in compiled code from 50% to 80%. This involves making changes to the compiler to take advantage of the new features.

Digital signal processors operate on two types of fixed-point data: fractional data and integer data. With fractional data, the binary point is left-justified. With integer data, the binary point is right-justified. Most DSPs have a barrel shifter so that the binary point can be placed anywhere in the bit field without a performance penalty; however, the programmer must keep track of where the binary point is. The problem with the C and C++ programming languages is that their standards do not define fractional data types.

The authors take a conventional four-tiered approach at extending C and C++ to provide better support for signal processing and multimedia algorithms on the TriCore processor:

Libraries: Assembly-coded functions in a library that may be called from C
Assembly representation: Basic macros in C and in-line functions in C++ that represent TriCore instructions
Data representation: New data types as C language extensions and C++ classes, as well as definitions for coercision, reading, writing, and arithmetic operations (coercision occurs when data types are mixed arithmetic computations)
- ```
_sfract
```
  for 16-bit numbers in [-1,1)
- ```
_fract
```
  for 32-bit numbers in [-1,1)
- ```
_accum
```
  for 64-bit accumulation [2^-17, 2¹⁷], which has 45 bits in the mantissa and provided 17 guard bits
- ```
_packb
```
  to represent 4 bytes packed into 32 bits
- ```
_packhw
```
  to represent 2 16-bit shorts packed into 32 bits
- ```
_circ
```
  for circular buffers which are common in filters
Compiler: Determines when to use MAC, zero-overhead loop, and other efficient instructions based on the expression

In order to generate efficient compiled C/C++ code for the TriCore processor, the custom non-ANSI-standard Seimens C/C++ compiler must be used. The drawback is that developing code for the TriCode processor in this manner will prevent the code from being portable to other processors.

"Rapid Prototyping of Multimedia Chip Sets" pp. 2175 - 2178

The authors construct a hardware/software codesign framework for design application-specific architectures for multimedia applications, e.g. video codecs. The framework enables the interactive exploration of hardware/software tradeoffs. It consists of the following design stages:

Algorithm development: algorithms are block diagrams of communicating macroblocks written in C
Performance modeling: maps algorithms to candidate architectures and models their performance at a high-level by taking into account processor utilization, memory size, bus activity, and interrupt overhead
Virtual prototyping: simulate applications represented by VHDL and C code on the candidate architecture, and perform fixed-point optimization
Low-level implementation: validation is performed using gate-level simulation and emulation.

The approach bridges the gap between algorithm development and low-level implementation. Historically, algorithm developers and system designers have used completely different toolsets, e.g. MATLAB for algorithm development and Synopsys tools for system design.