Advances in Native Signal Processing

Papers at the 1999 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing

bevans@ece.utexas.edu

Introduction

In 1995, the Sun UltraSPARC processor was the first to extend an existing instruction set architecture to support signal processing primitives, a.k.a. native signal processing. Sun's extensions are called the Visual Instruction Set (VIS). In 1996, Intel added native signal processing in its Pentium processors with 57 MMX instructions. The intention in both cases was to accelerate multimedia applications. For MMX and VIS, speedups for digital signal processing applications are modest (1.5:1 to 2:1) but speedups for graphics operations are high (4:1 to 6:1).

In VIS and MMX, the native signal processing only works with integer data. The integer data is packed into 64-bit words and the same operation is performed on all words simultaneously. This is known as single-instruction multiple-data (SIMD) processing. The 64-bit word can contain 8-bit, 16-bit, or 32-bit words. In general, compilers do not generate native signal processing code. One exception is Metrowerks Code Warrior 5 Pro, which generates Pentium MMX and AMD 3DNow! code. In order to obtain high performance, however, a programmer must inline assembly code (either by hand or by using intrinics) or write the entire application in assembly language. A less efficient approach is to call libraries of native signal processing functions, e.g. as provided by the Intel signal processing and image processing libraries.

A good introduction to native signal processing is available at

http://www.ece.utexas.edu/~ravib/nsp/

A good overview of digital signal processors and native signal processing is available at

http://www.ece.utexas.edu/~bevans/talks/hp-dsp-processors/index.html

Advances in Native Signal Processing

"AMD 3DNow! Vectorization for Signal Processing Applications" pp. 2127 - 2130

3DNow! complements fixed-point MMX technology with high-speed floating-point operations. It provides concurrent execution of two single-precision (32-bit) floating-point data operations on a single instruction cycle (one in the U pipeline, and one in the L pipeline). 3DNow! adds 21 new instructions: multiplication, addition, subtraction, reduction, reciprocal, reciprocal square root, maximum, minimum, and comparison for vector (SIMD) operations. Reduction accumulates the entries in a vector into a single number, which is used at the end of a vector dot product to accumulate the element-by-element product of two vectors. Reciprocal and reciprocal square root are used in computer graphics. The eight 3DNow! 64-bit floating-point SIMD registers are aliased to the eight Pentium floating-point registers. The eight 64-bit MMX registers are also aliased to the eight Pentium floating-point registers.

Speedup for 1-D FIR, IIR, and FFT routines increases with the size of the input data

	1-D FIR	1-D IIR	1-D FFT	2-D Wavelet Transform
upper bound	2	2	2	2
vectorized arithmetic + vectorized memory moves	1.5	1.5 - 1.6	1.4	1.5
vectorized arithmetic	1.3 - 1.5	1.2 - 1.3	1.2 - 1.3	1.3

Fig. 1: Speedup for floating-point kernels using 3DNow! instructions. The 2-D Wavelet Transform is a 6-level wavelet transform applied to an entire 512 x 512 image.

"Some Fast Speech Processing Algorithms Using AltiVec Technology" pp. 2135-2139

AltiVec is a SIMD extension to the PowerPC. AltiVec adds 32 128-bit SIMD registers which can be divided into 8-bit, 16-bit, or 32-bit integers or 32-bit IEEE single-precision floating-point numbers. The SIMD registers are separate from the integer and floating-point registers on the PowerPC. AltiVec adds permutation operations (pack data, unpack data, and table lookup) and arithmetic operations (multiply-accumulate, multiply-sum, and sum-across). The sum-across plays the same role as reduction on the AMD 3DNow! extensions. The AltiVec can simultaneously issue 1 arithmetic and 1 permutation instruction without placing any restrictions on scalar PowerPC instructions. On each instruction cycle, the AltiVec instruction can compute 16 multiplications and 16 additions on 8-bit data that has been packed into a 128-bit SIMD word.

The paper evaluates the speed up of three speech processing kernels when using AltiVec instructions: autocorrelation, linear prediction, and cross-correlation. Autocorrelation and linear prediction are commonly used in the speech compression standards. Cross-correlation is used in speech coding for wireless GSM systems.

The autocorrelation function takes unsigned 8-bit speech samples as input and computes a enough autocorrelation coefficients to build an autocorrelation matrix for input to a kernel to compute the linear prediction coefficients. The amount of speedup in the autocorrelation increases with the filter order until it reaches a maximum speedup of 30.74:1 for 16 autocorrelation terms for a 256-sample sequence.

Two different algorithms were used to compute linear prediction coefficients from the autocorrelation sequence: (1) Levinson-Durbin recursion and (2) Schur recursion. For N samples of an autocorrelation sequence, both recursions take quadratic time. On a sequential processor, Levinson-Durbin recursion is about 25% faster than Schur recursion. On a SIMD architecture, however, the Schur recursion is about 40% faster than the most efficient Levinson-Durbin recursion.

Cross-correlation is used in wireless GSM systems for long-term prediction. In GSM, a cross-correlation of a sequence of 40 samples with a sequence of 120 samples is computed. Using the AltiVec extensions, the speedup was 12.5:1.

Multimedia Processors

"Radix-4 FFT Implementation Using SIMD Multimedia Instructions" pp. 2131 - 2135

The NEC V830R has a 32-bit integer pipeline and a 64-bit multimedia coprocessor. The architecture can issue an instruction for each pipeline on each instruction cycle. The architecture has 32 64-bit multimedia registers. It can perform 4 16-bit multiply-accumulate operations simultaneously. It has an extended precision 32-bit accumulator with 1 guard bit and saturation arithmetic. The extra guard bit means that accumulation effectively has 33 bits of precision. Saturation arithmetic means that when an integer exceeds its maximum, it is assigned the maximum value instead of wrapping around as it would in two's complement arithmetic.

The authors implement a radix-4 complex FFT. FFTs are used in audio coding, as well as Asymmetric Digital Subscriber Line modems and Third-Generation Wireless Systems. A radix-4 FFT is iteratively decomposed into 4-point discrete Fourier transforms, and requires log_4 N stages to compute the FFT of N points. A radix-4 complex FFT has simpler address calculations and fewer arithmetic operations than a radix-2 complex FFT. On a Texas Instruments TMS320C62x digital signal processor, for example, the radix-4 FFT requires 35% fewer instructions than a radix-2 FFT for a 256-point FFT. The authors report that the V830R radix-4 FFT is 33% slower on the V830R than on the TMS320C62x.

"A New Parallel DSP with Short-Vector Memory Architecture" pp. 2139 - 2142

The paper describes the new TigerSHARC from Analog Devices Inc., which combines SIMD processing with its traditional super-Harvard architecture (SHARC) floating-point digital signal processor (DSP) line. At a clock rate of 250 MHz, its peak performance is 1.5 GFLOPS for 32-bit floating-point arithmetic and 6 BOPS for 16-bit fixed-point arithmetic. The processor has parallel multiply-accumulate units which can deliver either 2 32-bit floating-point or 8 16-bit arithmetic operations per instruction cycle. The processor has 128 32-bit registers, which is a significant departure from the usual 8 or 16 registers in floating-point DSPs. Two consecutive 32-bit registers may be treated as a 64-bit SIMD registers. The TigerSHARC, like the Texas Instruments TMS320C62x processor, uses a Very Long Instruction Word (VLIW) architecture.