Literature Survey and Analysis of Low-Power Techniques for Memory and Microprocessors

Low power has become the mantra of circuit design today, driven by the increasing complexity and operating speeds of microprocessors and the demands of portable electronic equipment. Many techniques have been developed to address low power concerns. This paper includes a summary of conventional low power circuit design techniques, as well as a special emphasis on low power memory. Discussed will be techniques for reducing power in memory, including intelligent and OS Controlled refresh in DRAMs, multi-divided arrays and power/performance ratios, and a survey of low power SRAM and DRAM. The paper will also discuss power requirements of microprocessors, as one aspect of IRAM is adding a microprocessor on-board DRAM.

Proportions of Power

Device		Process	Current	Shrink	Current	Die(shrunk)  Ref
1Gb  DRAM	.16um	 240mA	.16um	 240mA	576mm^2	     [Yoo96]
StrongARM/200/1	.35um 	 550mA	.16um	 275mA	 12mm^2	     [Mon96]
PPC/200/2	.50um	1600mA	.16um	 355mA	  9mm^2      [San96]
DEC Alpha/433/4	.35um  12500mA	.16um   6250mA	 52mm^2	     [Gron96]
Above is a table of several devices recently presented at ISSCC '96. Power figures are given as reported in the papers in the given process, along with a theoretical shrink to a common .16um DRAM process (for comparison.) The shrink reduces the power quadratically with the process, and multiplies by two to compensate for the shift to the DRAM fab process. The shrink shows the result of reduced gate sizes and capacitances on bit lines, as well as a fudge factor for the optimization opportunites for processors when technology shrinks. Horowitz [Hor96] reports the power benefits for logic responding to a process shrink of x are on the order of x^4, while a strict process shrink would only yield an x improvement. The first numbers in the processor device names are the reported operating frequencies, and the last are the superscalarness. Again, because of process shrink approximations, it is not fair to compare the processors amongst themselves. All processors include at least 32kB of on-chip cache.

Comparing the low-end processor's performance to that of the 1Gb DRAM device one sees that the processor would consume slightly more power than the the DRAM array. These numbers indicate that the power considerations of a small processor (not aimed at high-performance computing) included on-chip with a DRAM device are not likely to be a critical issue. Any serious processing, on the level of a high-end Alpha, however, would easily dominate over the power of any on-chip DRAM. This is further evidence for Horowitz's prediction that IRAM might be best targeted for an embedded applications market.

The big power-sink of an IRAM type device, however, is likely to come from the interconnect between the DRAM and processor blocks. These lines would be needed to fill cache lines, feed vector units, or whatever else is dreamed up by the IRAM students. The sheer number of these lines, and the distance they may have to travel (halfway across the chip) make them a problematic power drain. The interconnect problem is fast becomming a concern for modern logic designers. Specific numbers on interconnect were unobtainable at the time of writing, but this is an area for future research. [Burd96]

Summary of Circuit Design Techniques for Low Power

In the 1994 ISSCC Conference Proceedings, a paper by Mike Horowitz did a good job of summarizing low power CMOS circuit design techniques at a relatively high level. The first techniques developed were mostly common sense design practices such as lowering the power-supply to the chip rather than having a 5 volt supply and internal voltage regulation. The main techniques that Horowitz discusses are voltage scaling, transistor sizing, and adiabatic circuits, as well as technology scaling, transition reduction, and parallelism.

A Look at Power in Microprocessors

StrongARM Microprocessor (.5 W)

The StrongARM microprocessor was presented at the 1996 ISSCC Conference. The StrongARM is a general-purpose, 32-bit microprocessor with a 16KB instruction cache (32-way set associative); a 16KB, write-back data cache (32-way set associative); a write buffer; and a memory-management unit (MMU) combined in a single chip. The five-stage pipeline distributes tasks evenly over time to remove bottlenecks, ensuring high throughput for the core logic. The StrongARM is touted as a low power chip, mainly because of its two power-down modes: idle and sleep. It uses a 1.5 V internal supply and delivers 184 Drystone/MIPS at 162 MHz and dissipates .5 W. It optionally operates at 215 MHz with a 2 V internal supply, dissipating 1.1 W. The low-power operation of the device was attributed to simple techniques, such as process shrinks and the simple ARM architecture (single-issue). A power breakdown is given below. It is interesting to note that processor power is dominated (43%) by the caches.

	icache	27%
	ibox	18%  (instruction issue)
	dcache	16%
	clock	10%
	immu	 9%
	ebox	 8%  (execution units)
	dmmu	 8%
	others   4%

[Mon96]

68040 compatible 1W Microprocessor

A 1-watt 68040 compatible microprocessor was introduced at the 1994 IEEE Symposium on Low Power Electronics. It achieved a 70% reduction in power from its predecessor. It achieved a 17.7 SPECint92/W at 33 MHz and 3.3 V. How was this power reduction accomplished?

[Big94]

Multi-Divided Arrays

Multi-divided arrays are the mechanism by which high-volume DRAMs are constructed. Individual sense-amp blocks can only handle arrays of approximately 256kB, so memories greater than this size must be multi-divided. These blocks may either be accessed one at a time, or all at once. Activating all the blocks at once becomes expensive power-wise with large memories. A group of researchers at Hitachi have noticed this trend toward multi-divide arrays in DRAM. They say that the activation scheme of a mulit-divided array is the key to DRAM array power reduction [Itoh94]. The Samsung 1Gb DRAM chip [Yoo96] makes heavy use of multi-divided techniques and activates only a small number (relative to the total numer) of blocks at any given time. The drawback of only activating a subset of blocks is the extra delay incurred by increases levels of selection. (This is a classic example of a power-performance tradeoffs.) In order to select only a few blocks, a computation must be preformed before the row-activation cycle is initiated. This selection would be on the critical path of the memory access. The extreme case would be to have one row access performed at a time, which would incure the maximum (and unacceptable) delay of a 32M:1 selector.

Figure 1: Shared decoder strategies

Figure 1 [Itoh94] shows several different possible memory orginizations for sharing decoders. Sharing these decoders is one way to exploit an area/power trade off. Shared decodes decrease the necessary area (fewer decoders needed) but increase the capcitance on individual bit lines. (Because each line is connected to several memory blocks.)

Techniques for Lower Power Using Refresh in DRAMs

The Motarola 4Mx1 low-power CMOS DRAM [Mot93] part has a 128ms refresh cycle. With a 110ns cycle time and 1024 banks, the decice is required to be refreshing approximately 0.7% of the time. This percentage is small enough that it is likely to be overwhelmed by regular data accesses to the device. However, with the advent of the 1Gb DRAM device, the numbers change. The 1Gb DRAM from ISSCC [Yoo96] has a 128ms refresh interval, requiring 1GB/sec of refresh bandwidth. This is equivalent to the peak internal 1GB/sec bandwith for data transfer. This indicates that at peak theoretical operation, 50% of the power is going towards refresh. In periods of non-peak operation, the power consumed will be dominated by refresh. This ratio indicates that the always-refresh line of thinking may not be ideal.

A Survey of Power Reduction in SRAM Caches

Techniques to Reduce Power in Wide Fast Memories

CMOS memories have an access path that can be examined in two parts: the address to the local wordline select, and from the local wordline to the sense amps. Driving the wordline bus and sensing the data consumes the most power in this process.

In a paper by Bharadwaj Amruter in the 1994 IEEE Symposium on Low Power Electronics, it was proposed that power consumption could be reduced by limiting the energy consumed by each bitline. This energy is conserved by limiting the swing of the bitline by controlling the local wordline drive strength. This circuit technique adds an overhead of two extra columns and rows to implement a reference cell and reference bitline used in the drive strenth regulation. The swing on the data lines is also limited.

One other optimization used was to only pre-charge selected blocks that were to be accessed, instead of pre-charging the whole array.

Supply Voltage      Power            Gate Delay
      (V)           (mW)                (ns)
      1.5           5.2                 2.63
      3.0           75.0                0.62
      5.0           66.0                0.38

[Amr94]

6 ns 1.5 V 4 Mb BiCMOS SRAM

One of the problems in designing with Bipolar CMOS is that it is extremely difficult to scale. The fixed .8 V threshold voltage prevents scaling the voltage down as much as in other processes. So why design in a BiCMOS process? Speed. However, the speed benefit is not so much as to exclude consideration of other technologies. The speed-up as around a factor of 2, and BiCMOS designs usually require more area.

The 4 Mb BiCMOS SRAM presented at the 1996 ISSCC conference was a 1.5 V, 6 ns SRAM. This low-power SRAM was achieved using several low-power techniques.

Process:   .3 um 4-poly 2-metal p-sub triple well BiCMOS
Supply:    1.5-3.3 V   
Access:    6 ns
Power:     180 mW at 1.5 V

[Kuh96]

A Look at a Low Power DSP

Low power DSPs have used basically the same techniques as low power microprocessors, i.e. using a better process and adding a sleep or idle mode to conserve power when not in use. Here is an example of a low power DSP:

A small embedded power management system was included in a DSP introduced at the ISSCC96 conference. The DSP is targeted for mobile phone applications. When talking occurs, the DSP is activated. Otherwise, it goes into a sleep mode to conserve power.

Process                    0.5 MTCMOS                       
Chip Size                  225 mm^2         
Operating Frequency        13.2 MHz at 1.0 V
MAC Performance            26.4 MOPS at 1.0 V      
Power Consumption          2.2 mW/MHz (1.1mW/MOPS) at 1.0V
Standby Power		   350uW (active), 600nW (sleep) at 1.0V

[Mutoh96]

Conclusions:

As far as implementing low power IRAMs, it looks like we should target using a smaller RISC microprocessor (possibly with a vector extension). This will prevent the microprocessor from dominating the power consumption in the IRAM. As far as reducing power in DRAM, we can sub-divide the memory array into blocks and share the row and column decoders. If we only activate the blocks we need, we can save power in this manner. We can also perform intelligent refreshes, such as refreshing only blocks that have been written to, instead of an entire array. There are also many circuit tricks that can be perpetrated on the DRAM or SRAM cores in order to optimize for low power. A main concern of power consumption in IRAM will be the interconnect situation. Application specific designs will further detemine what sort of connection grid will be required.

A rough estimate of the power required for an IRAM would be to simply add the power requirements for the microprocessor and the DRAM core. This of course neglects the interconnect scheme, and the conversion of the microprocessor to a DRAM process, but it should provide at least a minimum power requirement. In the case of a 1 GB DRAM with a StrongARM microprocessor, we are looking at a minimum power requirement of 1.03 W plus the interconnect.

Bibliography

[Amr94]   Bharadwaj Amrutur. "Techniques to Reduce Power in Fast Wide Memories"
	    1994 IEEE Symposium on Low Power Electronics. October 1994.
[Big94]   Terry Biggs. "A 1 Watt 68040-Compatible Microprocessor." 
            1994 IEEE Symposium on Low Power Electronics. October 1994.
[Burd96]  Tom Burd. "An interview with Tom Burd" Tom Burd's Ph.D. thesis
	    is power efficient computing.
[Gron96]  Paul Gronowski. "A 433MHz 64b Quad-Issue RISC Microprocessor."
	    1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Hor94]   Mark Horowitz. "Low Power Digital Design." 1994 IEEE Symposium on Low
	    Power Electronics. October 1994.
[Itoh94]  Kiyoo Itoh. "Trends in Low-Power RAM Circuit Technologies."
            1994 IEEE Symposium on Low Power Electronics. October 1994.
[Kuh96]   Shigeru Kuhara. "A 6 ns 1.5 V 4 Mb BiCMOS SRAM."      
            1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Mon96]   James Montanaro, et. al. "A 160Mhz 32b 0.5W CMOS RISC Microprocessor"
            1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Mot93]   Dynamic Rams and Memory Modules Data Book.
	     Motorola, Inc. 1993.	
[Mutoh96] Shin'ichiro Mutoh. "A 1 V Multi-Threshold Voltage CMOS DSP
             with an Efficient Power Management Technique for Mobile Phone
             Application." 1996 IEEE ISSCC Digest of Technical
             Papers. Feb. 8-10, 1996.
[Nitta96] Yasuhiko Nitta. "A 1.6GB/s Data-Rate 1Gb Synchronous DRAM with
	     Hierarchical Square-Shape Memory Block and Distributed Bank
	     Architecture." 1996 IEEE ISSCC Digest of Technical
	     Papers. Feb. 8-10, 1996.
[San96]	  Hector Sanchez, et. al. "A 200Mhz 2.5V 4W Superscalar RISC Micro."
             1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Yoo96]   Jei-Hwan Yoo. "A 32-Bank 1 Gb DRAM with 1 GB/s Bandwidth"
	     1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.

Trevor Pering / pering@eecs.berkeley.edu
Heather Bowers / hbowers@cory.eecs.berkeley.edu

March 20, 1996