Memory and DSP Processors

A Brief History and Survey of On-Going Innovations

Purpose

In the research effort into the feasibility of intelligent DRAM, one question that has to be answered is "What type of processors would be useful to embed into memory?" The possible solutions vary from small micro-controllers to to a CRAY-1. In order to better answer the question, an analysis of the history and current developments of processors is needed. This paper will address a subset of processors, the digital signal processor (DSP).

Overview of a DSP Processor

Digital signal processors operate on many samples of data per second, require a large memory bandwidth, and perform very intense computations. They can be programmable, as in the case of the Texas Instrument TMS320 series, or dedicated. A typical DSP architecture can be seen in Figure 1. The principle components of a DSP are a multiplier, adder, shifter, fast registers, and memory. In this paper, the relationship between a digital signal processor and its memory will be examined.

Figure 1: Typical DSP Architecture

[Mad95]

DSP Memory Architecture

Today's DSP utilize a Harvard architecture, with modifications for particular applications. Using Madisetti's [Mad95] classifications of SISC (Special-Instruction Set-Computers) memories, the memory architectures can be divided into the basic Harvard architecture (single data memory and single program memory) and five modifications.

Basic Harvard Architecture

The instuction and data memory are each provided with a bus. The instruction fetch is pipelined with the data access for an operand in the previous cycle. The TMS320C10 uses this architecture.

Modification 1

The data and instructions are stored in one memory. Two memory cycles are required to access the operands. This problem is alleviated by having two parallel memories. The memory cycle time is half of the basic instruction cycle time, so three operand instructions can still be executed in one instruction cycle, assuming one of the three operands in a different memory bank from the instruction. In the DSP32C chip, the processor fetches 2 32-bit operands, performs a calculation, and writes back the result in one instruction cycle.

Modification 2

Modification 2 assumes use of a multi-port memory, accessing data several times per cycle. This is a solution that would be appropriate for on-chip memory. An example of a chip using this is the Fujitsu MB86232 which has a three port access to memory. Three-operand instructions can be processing in one cycle. This might be an architecture to consider for IRAM.

Modification 3

This alleviates the data/instruction conflict seen in Modification 1. A cache is added to store frequently executed instructions. This results in one instruction per cycle through-put. This modification can be seen in the TMS320C25 in the form of a one instruction cache for instructions that appear in a loop. Signal processing routines are loop-intensive. Other chips have 15-16 instruction caches.

Modification 4

Instead of the cache seen in Modification 3, two memories are used for data, and a third memory is added to store instructions. The TMS320C30 and C40 use RAMs for data and a ROM for the instructions.

Modification 5

This modification uses multiple memory banks (>2). Multiple operand instructions are executed concurrently with I/O access, but the programmer must optimize the data storage to benefit from this architecture.

Optimizing Off-Chip Memory Access

DMA
DMA access is utilized in order to optimize off-chip memory access. DMA stands for direct memory access. It is a transfer that, once initialized, does not require CPU intervention. The DMA process can perform memory transfers between internal and external memory. DMA works as follows: (1) The first address of the source block is loaded. (2) The first address of the destination is loaded. (3) The block size is loaded to intialize a counter. (4) The DMA begins the transfer of an entire block, using incremental addresses calculated from the initial source and data addresses.

VRAM

As data throughput increases, the arbitrator (built into the ASIC) must prioritize access. Eventually this will limit the data throughput. A Video RAM is a solution for these data bottlenecks. For instance, a triple-port VRAM could enhance the data throughput. The HOST I/F and the R/W channel can read and write data to the serial buffers at very high speeds without affecting the DSP/microController access to the data in the DRAM cells. When the serial buffer is full, the DSP/microController can transfer a block of data to the DRAM cell in a few cycles. The DSP/mC can therefore execute program from the VRAM without sacrificing performance.

Required Memory Capacity

Most fixed- point DSPs are aimed at embedded applications, which only requires a small amount of memory. Therefore, these processors tend to have small to medium on-chip memories (between 256 and 12K words) and small address spaces.
Floating-point DSPs provide relatively little (or no) on-chip memory, but feature large external address spaces. Additionally, these chips provide caches to allow efficient use of slower external memories.

[Mad95], [Bier95], [TI95]

DSP Operations:

Pipelining

A DSP operation is as follows: Instruction Fetch, Instruction Decode, Operand Read, Execute. These steps canbe pipelined using a reservation table to wisely allocate resources. DSP operations are easily pipelined. Madisetti goes into great depth about pipelining the DSP process [Mad95]. Keshab Parhi and Dave Messerschmitt also discuss pipeline interleaving and parallelism in recursive filters using techniques called scattered look-ahead and decomposition. Computational latency associated with the recursive operations had previously limited pipelining efforts, but Parhi and Messerschmitt developed a way to pipeline recursive loops while guaranteeing stability. They also developed an optimization for non-recursive operations that extended to time-varying recursive systems.

Addressing Requirements

There are many types of addressing used with memory, most common to microprocessor architectures. These include register addressing, direct addressing, indirect addressing, immediate addressing, and parallel addressing. However, in DSP there are two additional types of addressing that merit mentioning.

Circular Addressing

DSP operations are typically computations involving an infinite stream of real-time data. The data is accumulated into a buffer, and the oldest sample is overwritten by the newest sample. The block size for this circular buffer should be specified and reserved in physical memory. The memory basically has a round-robin FIFO instantiated within it. Access is provided by a base address for the memory, and a system of pointers.

Bit-Reversed Addressing

When performing a butterfly fast Fourier transform, the address of the outputs are bit-reversed with respect to the inputs. Some DSPs perform this bit-reversal as part of the addressing proces.

[Mad95], [Par89]

DSP Cores

One advent in DSP processor technology that seems to bode well for IRAM is the emergence of the DSP core. A DSP core is a DSP processor designed to be used as a a building-block in a custom or semi-custom integrated circuit. Using a DSP core as part of an ASIC can enable higher integration than is possible with the use of packaged DSP processors. This higher integration in turn can yield smaller products that consume lower power and are less costly to produce. Most DSP cores today are 16-bit fixed-point architectures. The core may also include memory and peripherals. Silicon area measurements for common DSP cores typically range from 4 to 10 mm2.

Another benefit to the DSP core is low power consumption. DSP core-based ASICs can consume significantly less power than equivalent printed-circuit-board-based designs, since signals that remain on-chip drive significantly smaller loads than signals that move off-chip.

[Bier95]

Power Management

Although not directly related to the memory issue, power management is an important consideration when considering how to embed a DSP core into memory. DSPs are beginning to follow the low-power trend in industry, providing reduced voltage operation at 3-3.3 volts, idle modes to turn off the clock when it is not needed in certain areas of the chip, programmable clock dividers to allow for minimum clock speed use, and disabling of peripheral control when peripheral access is not needed.

[Bier95]

Current DSP Processor Specs

There are a variety of digital signal processors on the market. They range from low-end fixed point DSPs to high-end concurrent processing of floating point operations. The low cost fixed-point DSP typically processes 16 bit data and contain 32 bit registers. They have various RAM and ROM configurations, a 16 bit I/O bus, and serial ports. The mid-range processor operates between 27-50 Mhz, with 16-32 bit floating point operations and 16-24 bit fixed point operations. A mid-range processor typically has around 32-40 bit registers. In order to increase throughput, some DSPs include a DMA controller, dual serial ports, cache, and even in the case of the TMS320C80, a RISC processor to arbitrate between four fixed-point DSP processors. The operation of two higher end Texas Instrument DSPs are discussed below. For a survey of DSP chips, consult the DSP FAQ addressing the question "What are the available DSP chips and chip architectures?"

The TMS320C82 is the latest TI DSP chip that in the family of high performance DSPs. It delivers performance that is the equivalent of 300 MIPS. This chip contains 2 DSPs, a RISC master processor with 100-MFLOP, IEEE-compatible flaoting point unit, and enhanced on-chip memory capacity and transfer control. The 'C82 has a a direct interface to various memory types such as DRAM, SDRAM, and VRAM.

The TMS320C40 from Texas Instruments, an older DSP chips, has six high-speed bi-directional communications ports that can carry data up to 20 MBytes/second. A six-channel DMA coprocessor is also used to allow for concurrent I/O and CPU operations. The high performance DSP CPU runs at 50 MHz and is capable of 275 million operations/second and 50MFLOPS. It utilizes a 40/32 bit floating/fixed point mulitplier/ALU and has hardware support for divide and inverse square root operations. The TMS320C40's link with the external world is two identical 100 MBytes/sec data and address buses. Four sets of memory control signals support different speed memories in hardware, allowing choices of SRAM and DRAM use. On chip memory includes a 512-byte instruction cache, 8K of single cycle, dual-access program or data RAM, and a ROM-based bootloader downloaded program storage.

Merged Memory and DSP Solutions:

BASAVA

In 1990 a paper from Texas Instruments was published that touted merging some of the basic computational elements of DSP into memory. This architecture was christened BASAVA. It claimed to reduce fetch and store overhead and reduce compute time by having multiple ALUs on board one memory chip. The system was simulated in Exploror-Odyssey, and a factor of 10 times improvement was observed in speech recognition applications and a 16 times improvement was observed for 128x128 matrix multiplication.

The Basava architecture consisted of a standard 256 Kbit memory organized as 32kx8 bits, a multiply-accumulate pipeline, and six words that could be programmed by the CPU. The first 3 words could specify the starting address and number of rows and columns in a matrix. The fourth word can specify where to store the result, the fifth word starts the operation, and the sixth word configures the operation. There are also three address registers and two general purpose resisters

The advantages to placing the processors on-board were as follows:

1. Fetch and store overhead is reduced by alleviating the I/O bottleneck at the CPU.

2. A wide data bus is naturally available in a memory chip.

3. Multiple ALUs can be added to the chip to make efficient use of silicon.

4. Since data is not moved across chip boundaries, power is conserved.

The speed-up of Basava compare to a TMS320C25 is calculated as follows:

Speed-up = 2*pc[(1+n)/(pc+bn)]

where p = number of Basava processing elements, c = number of Basava SRAM chips, b = relative cycles to
perform a multiply accumulate, and n = the number of rows of the square matrix to be operated upon.

Figure 2: Basava Architecture

Computational RAM

Computational RAM is semiconductor RAM with processors built into the design in order to implement a massively parallel computer. In a paper from the University of Toronto by Duncan Elliot [Ell92], C-RAM applications for DSP operations was examined. C-RAM utilizes conventional RAM with SIMD (single instruction, multiple data) processors to the sense amplifiers, along one edge of a 2-D array of memory cells. The processors are bit-serial externally programmed, adding only a small amount of area to the chip. In a 32 MB memory, 13 billion 32-bit operations can be performed in one second.

Applications for C-RAM are as a video frame buffer, computer main memory, or for stand-alone signal processing. The processing elements added a cost of 9-20% to the chip. A working 64-processing element chip has been fabricated, and the processor for a 2048 processor 4-MB chip has been designed. Performance data was determined via simulations using a prototype compiler. Applications benchmarked came from a variety of fields, including CAD, signal processing, data base, and computer graphics.

The architecture of C-RAM uses 32n 1-bit processors instead of n 32-bit processors. These bit-serial processors use less area/bit. The ALU is a three input device (2 registers and memory) that can be programmable. Processors can communicate via a bus tie. The bus tie, when enabled, ANDS all of the processors' results.

A proof-of-concept prototype was fabricated. It is an 8 kBit C-RAM fabbed in 1.2 um CMOS. Read-operate-write cycle time is 114 ns. A charge-sharing problem was discovered, as well as an error in one of the column decoders. The fab was done in an ASIC process with SRAM, and the processors occupied 9% of the die area. The addresses were also not multi-plexed as in seen in DRAM processes.

This highly parallel design is ideal for image processing which requires large amounts of data accessed quickly. Not all operations can be optimized in this parallel design.

Of note to the IRAM class is the fact that Northern Telecom provided the fab via the Canadian Micro-Electronics Corporation.

Figure 3: C-RAM Architecture

[Paw90], [Ell92]

Conclusions

The research that has been on-going in merging DSP and memory seems to be targeted at parallel processing applications. Digital signal processing is a field that makes concurrent operations feasible, so the idea of many processors on a memory chip is appealing. However, I am not sure that the simulations performed in the Basava and C-RAM projects are truly representations of implementation in DRAM technology. The proof-of-concept model for C-RAM was actually built with SRAM. Certainly we can merge SRAM and logic, but the allure of DRAM is its low cost and greater density. That is the reason that moving toward a merged memory DRAM solution appears attractive. DSP Processors are very expensive chips that perform vital calculations in consumer products. Cost is a key issue, so if we can build processors into memory, we save money not only in the packaging, but in SRAM.

One issue that has been suggested for DSP implementation in IRAM is replacing the multiplier with a look-up table. This could seem like a good idea given the potentially high bandwidth of on-chip memory. However, if IRAM requires as long an access time as DRAM currently does, I think a look-up table would be at best only slightly faster than a fast multiplier, if you assume that the access time requires only one cycle. Another crucial issue with the look-up table is that it would occupy much more area than a fast multiplier. To implement a 16-bit multiply (used in lower-end DSP processors) it would require a lookup table on the order of 2^37 bits. This is stretching the "area is free" mantra a bit far.

I feel that IRAM might be a good solution for image processing, due to the block nature of the computations. If the interface to the DRAM core was designed so as to facilitate block accesses, the high density of DRAM could offer a considerable advantage over SRAM in area saved. High density memory is particularly important for image processing because of the large quantities of data inherent in images.

References

[Anon95] DSP FAQ. http://www.bdti.com/faq/31.htm.

[Bier95] J. Bier. "DSP Processors and Cores -- The Options Multiply."
	http://www.bdti.com/articles/multiply.htm, 1995.

[Ell92] D. Elliott. Computational RAM: A Memory-SIMD Hybrid & Application to DSP."
	Proceedings of the Custom Integrated Circuits Conference; May 3-6, 1992.
	pp. 30.6.1-4

[Mad95] Vijay Madisetti. "Digital Signal Processors-An Into to Rapid Prototyping
	and Design Synthesis" Butterworth-Heinemann, Newton, MA, 1995.

[Par89] K. Parhi and D.G. Messerschmitt. "Pipeline Interleaving and Parallelism
	in Recursive Digital Filters: Part I: Pipelining Using Scattered Look-Ahead
	and Decomposition" IEEE Trans. on Acoustics, Speech, and Signal
	Processing, July 1989.
[Paw90] B.Pawate and G. Doddington. "Memory Based Digital Signal Processing."
	ICASSP, 1990. p. 941-4, Vol.2.
[TI95] "Using VRAMs and DSPs for System Performance."
	TMS 320 DSP Designers Notebook, no. 28, 1995.
[TI95_2] http://www.ti.com/sc/docs/dsps/details/41/c82lead.htm.
	Details on 'C82.