In the research effort into the feasibility of intelligent DRAM, one question that has to be answered is "What type of processors would be useful to embed into memory?" The possible solutions vary from small micro-controllers to to a CRAY-1. In order to better answer the question, an analysis of the history and current developments of processors is needed. This paper will address a subset of processors, the digital signal processor (DSP).
Digital signal processors operate on many samples of data per second, require a large memory bandwidth, and perform very intense computations. They can be programmable, as in the case of the Texas Instrument TMS320 series, or dedicated. A typical DSP architecture can be seen in Figure 1. The principle components of a DSP are a multiplier, adder, shifter, fast registers, and memory. In this paper, the relationship between a digital signal processor and its memory will be examined.
[Mad95]
Today's DSP utilize a Harvard architecture, with modifications for particular applications. Using Madisetti's [Mad95] classifications of SISC (Special-Instruction Set-Computers) memories, the memory architectures can be divided into the basic Harvard architecture (single data memory and single program memory) and five modifications.
VRAM
[Mad95], [Bier95], [TI95]
[Mad95], [Par89]
One advent in DSP processor technology that seems to bode well for IRAM is the emergence of the DSP core. A DSP core is a DSP processor designed to be used as a a building-block in a custom or semi-custom integrated circuit. Using a DSP core as part of an ASIC can enable higher integration than is possible with the use of packaged DSP processors. This higher integration in turn can yield smaller products that consume lower power and are less costly to produce. Most DSP cores today are 16-bit fixed-point architectures. The core may also include memory and peripherals. Silicon area measurements for common DSP cores typically range from 4 to 10 mm2.
Another benefit to the DSP core is low power consumption. DSP core-based ASICs can consume significantly less power than equivalent printed-circuit-board-based designs, since signals that remain on-chip drive significantly smaller loads than signals that move off-chip.
[Bier95]
[Bier95]
There are a variety of digital signal processors on the market. They range from low-end fixed point DSPs to high-end concurrent processing of floating point operations. The low cost fixed-point DSP typically processes 16 bit data and contain 32 bit registers. They have various RAM and ROM configurations, a 16 bit I/O bus, and serial ports. The mid-range processor operates between 27-50 Mhz, with 16-32 bit floating point operations and 16-24 bit fixed point operations. A mid-range processor typically has around 32-40 bit registers. In order to increase throughput, some DSPs include a DMA controller, dual serial ports, cache, and even in the case of the TMS320C80, a RISC processor to arbitrate between four fixed-point DSP processors. The operation of two higher end Texas Instrument DSPs are discussed below. For a survey of DSP chips, consult the DSP FAQ addressing the question "What are the available DSP chips and chip architectures?"
The TMS320C82 is the latest TI DSP chip that in the family of high performance DSPs. It delivers performance that is the equivalent of 300 MIPS. This chip contains 2 DSPs, a RISC master processor with 100-MFLOP, IEEE-compatible flaoting point unit, and enhanced on-chip memory capacity and transfer control. The 'C82 has a a direct interface to various memory types such as DRAM, SDRAM, and VRAM.
The TMS320C40 from Texas Instruments, an older DSP chips, has six high-speed bi-directional communications ports that can carry data up to 20 MBytes/second. A six-channel DMA coprocessor is also used to allow for concurrent I/O and CPU operations. The high performance DSP CPU runs at 50 MHz and is capable of 275 million operations/second and 50MFLOPS. It utilizes a 40/32 bit floating/fixed point mulitplier/ALU and has hardware support for divide and inverse square root operations. The TMS320C40's link with the external world is two identical 100 MBytes/sec data and address buses. Four sets of memory control signals support different speed memories in hardware, allowing choices of SRAM and DRAM use. On chip memory includes a 512-byte instruction cache, 8K of single cycle, dual-access program or data RAM, and a ROM-based bootloader downloaded program storage.
The Basava architecture consisted of a standard 256 Kbit memory organized as 32kx8 bits, a multiply-accumulate pipeline, and six words that could be programmed by the CPU. The first 3 words could specify the starting address and number of rows and columns in a matrix. The fourth word can specify where to store the result, the fifth word starts the operation, and the sixth word configures the operation. There are also three address registers and two general purpose resisters
The advantages to placing the processors on-board were as follows:
1. Fetch and store overhead is reduced by alleviating the I/O bottleneck at the CPU.
2. A wide data bus is naturally available in a memory chip.
3. Multiple ALUs can be added to the chip to make efficient use of silicon.
4. Since data is not moved across chip boundaries, power is conserved.
The speed-up of Basava compare to a TMS320C25 is calculated as follows:
Speed-up = 2*pc[(1+n)/(pc+bn)]
where p = number of Basava processing elements, c = number of Basava SRAM chips, b = relative cycles to
perform a multiply accumulate, and n = the number of rows of the square matrix to be operated upon.
Applications for C-RAM are as a video frame buffer, computer main memory, or for stand-alone signal processing. The processing elements added a cost of 9-20% to the chip. A working 64-processing element chip has been fabricated, and the processor for a 2048 processor 4-MB chip has been designed. Performance data was determined via simulations using a prototype compiler. Applications benchmarked came from a variety of fields, including CAD, signal processing, data base, and computer graphics.
The architecture of C-RAM uses 32n 1-bit processors instead of n 32-bit processors. These bit-serial processors use less area/bit. The ALU is a three input device (2 registers and memory) that can be programmable. Processors can communicate via a bus tie. The bus tie, when enabled, ANDS all of the processors' results.
A proof-of-concept prototype was fabricated. It is an 8 kBit C-RAM fabbed in 1.2 um CMOS. Read-operate-write cycle time is 114 ns. A charge-sharing problem was discovered, as well as an error in one of the column decoders. The fab was done in an ASIC process with SRAM, and the processors occupied 9% of the die area. The addresses were also not multi-plexed as in seen in DRAM processes.
This highly parallel design is ideal for image processing which requires large amounts of data accessed quickly. Not all operations can be optimized in this parallel design.
Of note to the IRAM class is the fact that Northern Telecom provided the fab via the Canadian Micro-Electronics Corporation.
[Paw90], [Ell92]
The research that has been on-going in merging DSP and memory seems to be targeted at parallel processing applications. Digital signal processing is a field that makes concurrent operations feasible, so the idea of many processors on a memory chip is appealing. However, I am not sure that the simulations performed in the Basava and C-RAM projects are truly representations of implementation in DRAM technology. The proof-of-concept model for C-RAM was actually built with SRAM. Certainly we can merge SRAM and logic, but the allure of DRAM is its low cost and greater density. That is the reason that moving toward a merged memory DRAM solution appears attractive. DSP Processors are very expensive chips that perform vital calculations in consumer products. Cost is a key issue, so if we can build processors into memory, we save money not only in the packaging, but in SRAM.
One issue that has been suggested for DSP implementation in IRAM is replacing the multiplier with a look-up table. This could seem like a good idea given the potentially high bandwidth of on-chip memory. However, if IRAM requires as long an access time as DRAM currently does, I think a look-up table would be at best only slightly faster than a fast multiplier, if you assume that the access time requires only one cycle. Another crucial issue with the look-up table is that it would occupy much more area than a fast multiplier. To implement a 16-bit multiply (used in lower-end DSP processors) it would require a lookup table on the order of 2^37 bits. This is stretching the "area is free" mantra a bit far.
I feel that IRAM might be a good solution for image processing, due to the block nature of the computations. If the interface to the DRAM core was designed so as to facilitate block accesses, the high density of DRAM could offer a considerable advantage over SRAM in area saved. High density memory is particularly important for image processing because of the large quantities of data inherent in images.
[Anon95] DSP FAQ. http://www.bdti.com/faq/31.htm.
[Bier95] J. Bier. "DSP Processors and Cores -- The Options Multiply." http://www.bdti.com/articles/multiply.htm, 1995.
[Ell92] D. Elliott. Computational RAM: A Memory-SIMD Hybrid & Application to DSP." Proceedings of the Custom Integrated Circuits Conference; May 3-6, 1992. pp. 30.6.1-4
[Mad95] Vijay Madisetti. "Digital Signal Processors-An Into to Rapid Prototyping and Design Synthesis" Butterworth-Heinemann, Newton, MA, 1995.
[Par89] K. Parhi and D.G. Messerschmitt. "Pipeline Interleaving and Parallelism in Recursive Digital Filters: Part I: Pipelining Using Scattered Look-Ahead and Decomposition" IEEE Trans. on Acoustics, Speech, and Signal Processing, July 1989.
[Paw90] B.Pawate and G. Doddington. "Memory Based Digital Signal Processing." ICASSP, 1990. p. 941-4, Vol.2.
[TI95] "Using VRAMs and DSPs for System Performance." TMS 320 DSP Designers Notebook, no. 28, 1995.
[TI95_2] http://www.ti.com/sc/docs/dsps/details/41/c82lead.htm. Details on 'C82.