Literature Survey and Analysis of Low-Power Techniques for Memory and Microprocessors

Low power has become the mantra of circuit design today, driven by the increasing complexity and operating speeds of microprocessors and the demands of portable electronic equipment. Many techniques have been developed to address low power concerns. This paper includes a summary of conventional low power circuit design techniques, as well as a special emphasis on low power memory. Discussed will be techniques for reducing power in memory, including intelligent and OS Controlled refresh in DRAMs, multi-divided arrays and power/performance ratios, and a survey of low power SRAM and DRAM. The paper will also discuss power requirements of microprocessors, as one aspect of IRAM is adding a microprocessor on-board DRAM.

Proportions of Power

Device		Process	Current	Shrink	Current	Die(shrunk)  Ref
1Gb  DRAM	.16um	 240mA	.16um	 240mA	576mm^2	     [Yoo96]
StrongARM/200/1	.35um 	 550mA	.16um	 275mA	 12mm^2	     [Mon96]
PPC/200/2	.50um	1600mA	.16um	 355mA	  9mm^2      [San96]
DEC Alpha/433/4	.35um  12500mA	.16um   6250mA	 52mm^2	     [Gron96]

Above is a table of several devices recently presented at ISSCC '96. Power figures are given as reported in the papers in the given process, along with a theoretical shrink to a common .16um DRAM process (for comparison.) The shrink reduces the power quadratically with the process, and multiplies by two to compensate for the shift to the DRAM fab process. The shrink shows the result of reduced gate sizes and capacitances on bit lines, as well as a fudge factor for the optimization opportunites for processors when technology shrinks. Horowitz [Hor96] reports the power benefits for logic responding to a process shrink of x are on the order of x^4, while a strict process shrink would only yield an x improvement. The first numbers in the processor device names are the reported operating frequencies, and the last are the superscalarness. Again, because of process shrink approximations, it is not fair to compare the processors amongst themselves. All processors include at least 32kB of on-chip cache.

Comparing the low-end processor's performance to that of the 1Gb DRAM device one sees that the processor would consume slightly more power than the the DRAM array. These numbers indicate that the power considerations of a small processor (not aimed at high-performance computing) included on-chip with a DRAM device are not likely to be a critical issue. Any serious processing, on the level of a high-end Alpha, however, would easily dominate over the power of any on-chip DRAM. This is further evidence for Horowitz's prediction that IRAM might be best targeted for an embedded applications market.

The big power-sink of an IRAM type device, however, is likely to come from the interconnect between the DRAM and processor blocks. These lines would be needed to fill cache lines, feed vector units, or whatever else is dreamed up by the IRAM students. The sheer number of these lines, and the distance they may have to travel (halfway across the chip) make them a problematic power drain. The interconnect problem is fast becomming a concern for modern logic designers. Specific numbers on interconnect were unobtainable at the time of writing, but this is an area for future research. [Burd96]

Summary of Circuit Design Techniques for Low Power

In the 1994 ISSCC Conference Proceedings, a paper by Mike Horowitz did a good job of summarizing low power CMOS circuit design techniques at a relatively high level. The first techniques developed were mostly common sense design practices such as lowering the power-supply to the chip rather than having a 5 volt supply and internal voltage regulation. The main techniques that Horowitz discusses are voltage scaling, transistor sizing, and adiabatic circuits, as well as technology scaling, transition reduction, and parallelism.

Voltage Scaling
Voltage scaling is the easiest and most effective way of controlling power. Adjustments to the operating voltage affect the delay in a linear manner, while having a quadratic effect on the power. The most common technique is to architecturally increase the performance of a system, and then lower the voltage for a reduction in the power consumption (see parallelisim below.)
Transistor Sizing
This technique directly trades speed for power. Decreasing the size of the transistor lowers power requirements and decreases the current drive, thereby making the gates slower. Increasing the size of non-critical path transistors can decrease power while not effecting the delay. This is difficult to implement in synthesis tools, however, and thus is not a widely-used technique.
Adiabatic Circuits
Adiabatic circuits are also known as charge recovery circuits. They resonate the load capacitance with an inductor in order to recover some of the energy used to change the capacitor's voltage. This is not a widely used technique because it introduces substantial delay.
Technology Scaling
Technology shrinks cause the capacitance of nets to decrease. This decrease in capacitance results in not only the performance of a design increasing, but also in a reduction of the power requirements. This is not so much of a technique as an effect that occurs with the passage of time. While maintaining constant performance, the power dissipation of a circut is related to x^4 where x is the ratio of the process shrink.
Transition Reduction
In static CMOS design, a transition on a bit line is the fundamental event that uses power. Gating clocks to functional blocks is one common and effective method for reducing unnecssary switching. It is also theoretically possible to synthesize circuits so as to reduce the number of spurious transitions, but this is difficult and hard to achieve in practice.
Parallelism
Parallelism can be used in a system to increase overall performance. The voltage of the system can then be reduced, lowering the performance to original levels, and lowering the power consumed even further. There is an overhead incurred with adding parallelism (control, inefficency) so this is not always a win-win situation. (For example, the overhead of super-scalar operation makes it poor for power-reduction.)

A Look at Power in Microprocessors

StrongARM Microprocessor (.5 W)

The StrongARM microprocessor was presented at the 1996 ISSCC Conference. The StrongARM is a general-purpose, 32-bit microprocessor with a 16KB instruction cache (32-way set associative); a 16KB, write-back data cache (32-way set associative); a write buffer; and a memory-management unit (MMU) combined in a single chip. The five-stage pipeline distributes tasks evenly over time to remove bottlenecks, ensuring high throughput for the core logic. The StrongARM is touted as a low power chip, mainly because of its two power-down modes: idle and sleep. It uses a 1.5 V internal supply and delivers 184 Drystone/MIPS at 162 MHz and dissipates .5 W. It optionally operates at 215 MHz with a 2 V internal supply, dissipating 1.1 W. The low-power operation of the device was attributed to simple techniques, such as process shrinks and the simple ARM architecture (single-issue). A power breakdown is given below. It is interesting to note that processor power is dominated (43%) by the caches.

	icache	27%
	ibox	18%  (instruction issue)
	dcache	16%
	clock	10%
	immu	 9%
	ebox	 8%  (execution units)
	dmmu	 8%
	others   4%

[Mon96]

68040 compatible 1W Microprocessor

A 1-watt 68040 compatible microprocessor was introduced at the 1994 IEEE Symposium on Low Power Electronics. It achieved a 70% reduction in power from its predecessor. It achieved a 17.7 SPECint92/W at 33 MHz and 3.3 V. How was this power reduction accomplished?

Synthesized combinational logic replaced all precharge/discharge PLAs resulting in a 330mW reduction at a cost of a 2.2% increase in die area.
On chip clocks were held static during phase lock.
Static Management techniques were implemented. The modes are Low Power STOP mode, Low Power Frequency Operation mode, and power-supply cycling. The static management techniques eliminate the need for an asynchronous global reset, lowering power and die area.

[Big94]

Multi-Divided Arrays

Multi-divided arrays are the mechanism by which high-volume DRAMs are constructed. Individual sense-amp blocks can only handle arrays of approximately 256kB, so memories greater than this size must be multi-divided. These blocks may either be accessed one at a time, or all at once. Activating all the blocks at once becomes expensive power-wise with large memories. A group of researchers at Hitachi have noticed this trend toward multi-divide arrays in DRAM. They say that the activation scheme of a mulit-divided array is the key to DRAM array power reduction [Itoh94]. The Samsung 1Gb DRAM chip [Yoo96] makes heavy use of multi-divided techniques and activates only a small number (relative to the total numer) of blocks at any given time. The drawback of only activating a subset of blocks is the extra delay incurred by increases levels of selection. (This is a classic example of a power-performance tradeoffs.) In order to select only a few blocks, a computation must be preformed before the row-activation cycle is initiated. This selection would be on the critical path of the memory access. The extreme case would be to have one row access performed at a time, which would incure the maximum (and unacceptable) delay of a 32M:1 selector.

Figure 1: Shared decoder strategies

Figure 1 [Itoh94] shows several different possible memory orginizations for sharing decoders. Sharing these decoders is one way to exploit an area/power trade off. Shared decodes decrease the necessary area (fewer decoders needed) but increase the capcitance on individual bit lines. (Because each line is connected to several memory blocks.)

Techniques for Lower Power Using Refresh in DRAMs

The Motarola 4Mx1 low-power CMOS DRAM [Mot93] part has a 128ms refresh cycle. With a 110ns cycle time and 1024 banks, the decice is required to be refreshing approximately 0.7% of the time. This percentage is small enough that it is likely to be overwhelmed by regular data accesses to the device. However, with the advent of the 1Gb DRAM device, the numbers change. The 1Gb DRAM from ISSCC [Yoo96] has a 128ms refresh interval, requiring 1GB/sec of refresh bandwidth. This is equivalent to the peak internal 1GB/sec bandwith for data transfer. This indicates that at peak theoretical operation, 50% of the power is going towards refresh. In periods of non-peak operation, the power consumed will be dominated by refresh. This ratio indicates that the always-refresh line of thinking may not be ideal.

Intellegent Refresh
For the DRAM cell, the refresh operation functionality is accomplished by a read or write operation. This means that if a cell has been recently read (or written to) then it does not need to be refreshed. This has the attraction of a more consistant power behavior. The device will take less power to refresh if accesses are being made to it, so this technique would be most effective during periods of great use. Implementation of this idea is probably not technically feasible because of the overhead needed to remember which lines have been recently accessed. Some clever algorithm, however, similar to a clock-paging algorithm (one bit used to approximate LRU) may be applicable.
One simular situation where this concept might be useful is for systems with cache and DRAM on the same chip. If a word line is known to be in the cache, then it does not need to be refreshed. However, because of the relative size of on-chip caches and the size of DRAM, this technique is not likely to make much difference. (A 1Mb cache would only be able to 'prevent' .1% of the refreshes in a 1Gb DRAM.)
OS Controlled Refresh
With memory sizes increasing and increasing (both system memory and single-chip memory), it is more and more likely that physical memory is not utilized at any given time. As it is not necessary to refresh unused memory, a considerable amount of power can be saved by intelligently controlling which pages get refreshed. The OS of a system knows which pages are used and unused, so given the opportunity it could disable refresh on selected pages.
Traditionally, a system only worries about swapping out pages when the memory space is full. Under a OS controlled refresh scheme, the OS could start to swap out pages to save power. The performance/benefit tradeoff of such actions is difficult to analyze because no current operating systems do this. This technique, however, would only help reduce the average power disipation, not the maximum. This means that it can only be used for conserving battery life, but for not preventing a chip meltdown.

A Survey of Power Reduction in SRAM Caches

Techniques to Reduce Power in Wide Fast Memories

CMOS memories have an access path that can be examined in two parts: the address to the local wordline select, and from the local wordline to the sense amps. Driving the wordline bus and sensing the data consumes the most power in this process.

In a paper by Bharadwaj Amruter in the 1994 IEEE Symposium on Low Power Electronics, it was proposed that power consumption could be reduced by limiting the energy consumed by each bitline. This energy is conserved by limiting the swing of the bitline by controlling the local wordline drive strength. This circuit technique adds an overhead of two extra columns and rows to implement a reference cell and reference bitline used in the drive strenth regulation. The swing on the data lines is also limited.

One other optimization used was to only pre-charge selected blocks that were to be accessed, instead of pre-charging the whole array.

Supply Voltage      Power            Gate Delay
      (V)           (mW)                (ns)
      1.5           5.2                 2.63
      3.0           75.0                0.62
      5.0           66.0                0.38

[Amr94]

6 ns 1.5 V 4 Mb BiCMOS SRAM

One of the problems in designing with Bipolar CMOS is that it is extremely difficult to scale. The fixed .8 V threshold voltage prevents scaling the voltage down as much as in other processes. So why design in a BiCMOS process? Speed. However, the speed benefit is not so much as to exclude consideration of other technologies. The speed-up as around a factor of 2, and BiCMOS designs usually require more area.

The 4 Mb BiCMOS SRAM presented at the 1996 ISSCC conference was a 1.5 V, 6 ns SRAM. This low-power SRAM was achieved using several low-power techniques.

Since the voltage is reduced, the speed benefit could be potentially lost. This can be fixed by using a boost voltage to accelerate the speed of the gates used in address decoding.
The standard method of reading and writing low power SRAM involved a word-boost technique on all cells in the array. This SRAM boosts only 1 of the 16,000 word lines.
The chip includes a stepped-down sense amp.
The chip includes an optimized boost voltage generator.

Process:   .3 um 4-poly 2-metal p-sub triple well BiCMOS
Supply:    1.5-3.3 V   
Access:    6 ns
Power:     180 mW at 1.5 V

[Kuh96]

A Look at a Low Power DSP

Low power DSPs have used basically the same techniques as low power microprocessors, i.e. using a better process and adding a sleep or idle mode to conserve power when not in use. Here is an example of a low power DSP:

A small embedded power management system was included in a DSP introduced at the ISSCC96 conference. The DSP is targeted for mobile phone applications. When talking occurs, the DSP is activated. Otherwise, it goes into a sleep mode to conserve power.

Process                    0.5 MTCMOS                       
Chip Size                  225 mm^2         
Operating Frequency        13.2 MHz at 1.0 V
MAC Performance            26.4 MOPS at 1.0 V      
Power Consumption          2.2 mW/MHz (1.1mW/MOPS) at 1.0V
Standby Power		   350uW (active), 600nW (sleep) at 1.0V

[Mutoh96]

Conclusions:

As far as implementing low power IRAMs, it looks like we should target using a smaller RISC microprocessor (possibly with a vector extension). This will prevent the microprocessor from dominating the power consumption in the IRAM. As far as reducing power in DRAM, we can sub-divide the memory array into blocks and share the row and column decoders. If we only activate the blocks we need, we can save power in this manner. We can also perform intelligent refreshes, such as refreshing only blocks that have been written to, instead of an entire array. There are also many circuit tricks that can be perpetrated on the DRAM or SRAM cores in order to optimize for low power. A main concern of power consumption in IRAM will be the interconnect situation. Application specific designs will further detemine what sort of connection grid will be required.

A rough estimate of the power required for an IRAM would be to simply add the power requirements for the microprocessor and the DRAM core. This of course neglects the interconnect scheme, and the conversion of the microprocessor to a DRAM process, but it should provide at least a minimum power requirement. In the case of a 1 GB DRAM with a StrongARM microprocessor, we are looking at a minimum power requirement of 1.03 W plus the interconnect.

Bibliography

[Amr94]   Bharadwaj Amrutur. "Techniques to Reduce Power in Fast Wide Memories"
	    1994 IEEE Symposium on Low Power Electronics. October 1994.
[Big94]   Terry Biggs. "A 1 Watt 68040-Compatible Microprocessor." 
            1994 IEEE Symposium on Low Power Electronics. October 1994.
[Burd96]  Tom Burd. "An interview with Tom Burd" Tom Burd's Ph.D. thesis
	    is power efficient computing.
[Gron96]  Paul Gronowski. "A 433MHz 64b Quad-Issue RISC Microprocessor."
	    1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Hor94]   Mark Horowitz. "Low Power Digital Design." 1994 IEEE Symposium on Low
	    Power Electronics. October 1994.
[Itoh94]  Kiyoo Itoh. "Trends in Low-Power RAM Circuit Technologies."
            1994 IEEE Symposium on Low Power Electronics. October 1994.
[Kuh96]   Shigeru Kuhara. "A 6 ns 1.5 V 4 Mb BiCMOS SRAM."      
            1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Mon96]   James Montanaro, et. al. "A 160Mhz 32b 0.5W CMOS RISC Microprocessor"
            1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Mot93]   Dynamic Rams and Memory Modules Data Book.
	     Motorola, Inc. 1993.	
[Mutoh96] Shin'ichiro Mutoh. "A 1 V Multi-Threshold Voltage CMOS DSP
             with an Efficient Power Management Technique for Mobile Phone
             Application." 1996 IEEE ISSCC Digest of Technical
             Papers. Feb. 8-10, 1996.
[Nitta96] Yasuhiko Nitta. "A 1.6GB/s Data-Rate 1Gb Synchronous DRAM with
	     Hierarchical Square-Shape Memory Block and Distributed Bank
	     Architecture." 1996 IEEE ISSCC Digest of Technical
	     Papers. Feb. 8-10, 1996.
[San96]	  Hector Sanchez, et. al. "A 200Mhz 2.5V 4W Superscalar RISC Micro."
             1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
[Yoo96]   Jei-Hwan Yoo. "A 32-Bank 1 Gb DRAM with 1 GB/s Bandwidth"
	     1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.

Trevor Pering / pering@eecs.berkeley.edu
Heather Bowers / hbowers@cory.eecs.berkeley.edu

March 20, 1996