Advances in Native Signal Processing
Papers at the 1999 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
Prof. Brian L. Evans,
UT Austin
bevans@ece.utexas.edu
Introduction
In 1995, the Sun UltraSPARC processor was the first to extend an existing
instruction set architecture to support signal processing primitives,
a.k.a. native signal processing. Sun's extensions are called the Visual
Instruction Set (VIS). In 1996, Intel added native signal processing in
its Pentium processors with 57 MMX instructions. The intention in both
cases was to accelerate multimedia applications. For MMX and VIS, speedups
for digital signal processing applications are modest (1.5:1 to 2:1) but
speedups for graphics operations are high (4:1 to 6:1).
In VIS and MMX, the native signal processing only works with integer data.
The integer data is packed into 64-bit words and the same operation is
performed on all words simultaneously. This is known as single-instruction
multiple-data (SIMD) processing. The 64-bit word can contain 8-bit,
16-bit, or 32-bit words. In general, compilers do not generate native
signal processing code. One exception is
Metrowerks Code Warrior 5 Pro,
which generates Pentium MMX and AMD 3DNow! code. In order to obtain high
performance, however, a programmer must inline assembly code (either by hand
or by using intrinics) or write the entire application in assembly language.
A less efficient approach is to call libraries of native signal processing
functions, e.g. as provided by the Intel signal processing and image
processing libraries.
A good introduction to native signal processing is available at
http://www.ece.utexas.edu/~ravib/nsp/
A good overview of digital signal processors and native signal processing
is available at
http://www.ece.utexas.edu/~bevans/talks/hp-dsp-processors/index.html
Advances in Native Signal Processing
"AMD 3DNow! Vectorization for Signal Processing Applications"
pp. 2127 - 2130
3DNow! complements fixed-point MMX technology with high-speed
floating-point operations. It provides concurrent execution of two
single-precision (32-bit) floating-point data operations on a single
instruction cycle (one in the U pipeline, and one in the L pipeline).
3DNow! adds 21 new instructions: multiplication, addition, subtraction,
reduction, reciprocal, reciprocal square root, maximum, minimum, and
comparison for vector (SIMD) operations. Reduction accumulates the
entries in a vector into a single number, which is used at the end of a
vector dot product to accumulate the element-by-element product of two
vectors. Reciprocal and reciprocal square root are used in computer
graphics. The eight 3DNow! 64-bit floating-point SIMD registers are
aliased to the eight Pentium floating-point registers. The eight
64-bit MMX registers are also aliased to the eight Pentium floating-point
registers.
Speedup for 1-D FIR, IIR, and FFT routines increases with the size of
the input data
| 1-D FIR
| 1-D IIR
| 1-D FFT
| 2-D Wavelet Transform
|
upper bound
| 2
| 2
| 2
| 2
|
vectorized arithmetic +
vectorized memory moves
| 1.5
| 1.5 - 1.6
| 1.4
| 1.5
|
vectorized arithmetic
| 1.3 - 1.5
| 1.2 - 1.3
| 1.2 - 1.3
| 1.3
|
Fig. 1: Speedup for floating-point kernels using 3DNow! instructions.
The 2-D Wavelet Transform is a 6-level wavelet transform applied
to an entire 512 x 512 image.
"Some Fast Speech Processing Algorithms Using AltiVec Technology"
pp. 2135-2139
AltiVec is a SIMD extension to the PowerPC. AltiVec adds 32 128-bit
SIMD registers which can be divided into 8-bit, 16-bit, or 32-bit
integers or 32-bit IEEE single-precision floating-point numbers. The
SIMD registers are separate from the integer and floating-point registers
on the PowerPC. AltiVec adds permutation operations (pack data, unpack
data, and table lookup) and arithmetic operations (multiply-accumulate,
multiply-sum, and sum-across). The sum-across plays the same role as
reduction on the AMD 3DNow! extensions. The AltiVec can simultaneously
issue 1 arithmetic and 1 permutation instruction without placing any
restrictions on scalar PowerPC instructions. On each instruction cycle,
the AltiVec instruction can compute 16 multiplications and 16 additions
on 8-bit data that has been packed into a 128-bit SIMD word.
The paper evaluates the speed up of three speech processing kernels
when using AltiVec instructions: autocorrelation, linear prediction,
and cross-correlation. Autocorrelation and linear prediction are
commonly used in the speech compression standards. Cross-correlation
is used in speech coding for wireless GSM systems.
The autocorrelation function takes unsigned 8-bit speech samples as
input and computes a enough autocorrelation coefficients to build an
autocorrelation matrix for input to a kernel to compute the linear
prediction coefficients. The amount of speedup in the autocorrelation
increases with the filter order until it reaches a maximum speedup of
30.74:1 for 16 autocorrelation terms for a 256-sample sequence.
Two different algorithms were used to compute linear prediction
coefficients from the autocorrelation sequence: (1) Levinson-Durbin
recursion and (2) Schur recursion. For N samples of an autocorrelation
sequence, both recursions take quadratic time. On a sequential
processor, Levinson-Durbin recursion is about 25% faster than Schur
recursion. On a SIMD architecture, however, the Schur recursion is
about 40% faster than the most efficient Levinson-Durbin recursion.
Cross-correlation is used in wireless GSM systems for long-term
prediction. In GSM, a cross-correlation of a sequence of 40 samples
with a sequence of 120 samples is computed. Using the AltiVec extensions,
the speedup was 12.5:1.
Multimedia Processors
"Radix-4 FFT Implementation Using SIMD Multimedia Instructions"
pp. 2131 - 2135
The NEC V830R
has a 32-bit integer pipeline and a 64-bit multimedia coprocessor.
The architecture can issue an instruction for each pipeline on each
instruction cycle. The architecture has 32 64-bit multimedia registers.
It can perform 4 16-bit multiply-accumulate operations simultaneously.
It has an extended precision 32-bit accumulator with 1 guard bit and
saturation arithmetic. The extra guard bit means that accumulation
effectively has 33 bits of precision. Saturation arithmetic means that
when an integer exceeds its maximum, it is assigned the maximum value
instead of wrapping around as it would in two's complement arithmetic.
The authors implement a radix-4 complex FFT. FFTs are used in audio
coding, as well as Asymmetric Digital Subscriber Line modems and
Third-Generation Wireless Systems. A radix-4 FFT is iteratively
decomposed into 4-point discrete Fourier transforms, and requires
log_4 N stages to compute the FFT of N points. A radix-4 complex FFT
has simpler address calculations and fewer arithmetic operations than
a radix-2 complex FFT. On a Texas Instruments TMS320C62x digital
signal processor, for example, the radix-4 FFT requires 35% fewer
instructions than a radix-2 FFT for a 256-point FFT. The authors
report that the V830R radix-4 FFT is 33% slower on the V830R than
on the TMS320C62x.
"A New Parallel DSP with Short-Vector Memory Architecture"
pp. 2139 - 2142
The paper describes the new TigerSHARC from Analog Devices Inc., which
combines SIMD processing with its traditional super-Harvard architecture
(SHARC) floating-point digital signal processor (DSP) line. At a clock rate
of 250 MHz, its peak performance is 1.5 GFLOPS for 32-bit floating-point
arithmetic and 6 BOPS for 16-bit fixed-point arithmetic. The processor
has parallel multiply-accumulate units which can deliver either 2 32-bit
floating-point or 8 16-bit arithmetic operations per instruction cycle.
The processor has 128 32-bit registers, which is a significant departure
from the usual 8 or 16 registers in floating-point DSPs. Two consecutive
32-bit registers may be treated as a 64-bit SIMD registers. The TigerSHARC,
like the Texas Instruments TMS320C62x processor, uses a Very Long
Instruction Word (VLIW) architecture.