Device Process Current Shrink Current Die(shrunk) Ref 1Gb DRAM .16um 240mA .16um 240mA 576mm^2 [Yoo96] StrongARM/200/1 .35um 550mA .16um 275mA 12mm^2 [Mon96] PPC/200/2 .50um 1600mA .16um 355mA 9mm^2 [San96] DEC Alpha/433/4 .35um 12500mA .16um 6250mA 52mm^2 [Gron96]Above is a table of several devices recently presented at ISSCC '96. Power figures are given as reported in the papers in the given process, along with a theoretical shrink to a common .16um DRAM process (for comparison.) The shrink reduces the power quadratically with the process, and multiplies by two to compensate for the shift to the DRAM fab process. The shrink shows the result of reduced gate sizes and capacitances on bit lines, as well as a fudge factor for the optimization opportunites for processors when technology shrinks. Horowitz [Hor96] reports the power benefits for logic responding to a process shrink of x are on the order of x^4, while a strict process shrink would only yield an x improvement. The first numbers in the processor device names are the reported operating frequencies, and the last are the superscalarness. Again, because of process shrink approximations, it is not fair to compare the processors amongst themselves. All processors include at least 32kB of on-chip cache.
Comparing the low-end processor's performance to that of the 1Gb DRAM device one sees that the processor would consume slightly more power than the the DRAM array. These numbers indicate that the power considerations of a small processor (not aimed at high-performance computing) included on-chip with a DRAM device are not likely to be a critical issue. Any serious processing, on the level of a high-end Alpha, however, would easily dominate over the power of any on-chip DRAM. This is further evidence for Horowitz's prediction that IRAM might be best targeted for an embedded applications market.
The big power-sink of an IRAM type device, however, is likely to come from the interconnect between the DRAM and processor blocks. These lines would be needed to fill cache lines, feed vector units, or whatever else is dreamed up by the IRAM students. The sheer number of these lines, and the distance they may have to travel (halfway across the chip) make them a problematic power drain. The interconnect problem is fast becomming a concern for modern logic designers. Specific numbers on interconnect were unobtainable at the time of writing, but this is an area for future research. [Burd96]
In the 1994 ISSCC Conference Proceedings, a paper by Mike Horowitz did a good job of summarizing low power CMOS circuit design techniques at a relatively high level. The first techniques developed were mostly common sense design practices such as lowering the power-supply to the chip rather than having a 5 volt supply and internal voltage regulation. The main techniques that Horowitz discusses are voltage scaling, transistor sizing, and adiabatic circuits, as well as technology scaling, transition reduction, and parallelism.
Voltage scaling is the easiest and most effective way of controlling power. Adjustments to the operating voltage affect the delay in a linear manner, while having a quadratic effect on the power. The most common technique is to architecturally increase the performance of a system, and then lower the voltage for a reduction in the power consumption (see parallelisim below.)
This technique directly trades speed for power. Decreasing the size of the transistor lowers power requirements and decreases the current drive, thereby making the gates slower. Increasing the size of non-critical path transistors can decrease power while not effecting the delay. This is difficult to implement in synthesis tools, however, and thus is not a widely-used technique.
Adiabatic circuits are also known as charge recovery circuits. They resonate the load capacitance with an inductor in order to recover some of the energy used to change the capacitor's voltage. This is not a widely used technique because it introduces substantial delay.
Technology shrinks cause the capacitance of nets to decrease. This decrease in capacitance results in not only the performance of a design increasing, but also in a reduction of the power requirements. This is not so much of a technique as an effect that occurs with the passage of time. While maintaining constant performance, the power dissipation of a circut is related to x^4 where x is the ratio of the process shrink.
In static CMOS design, a transition on a bit line is the fundamental event that uses power. Gating clocks to functional blocks is one common and effective method for reducing unnecssary switching. It is also theoretically possible to synthesize circuits so as to reduce the number of spurious transitions, but this is difficult and hard to achieve in practice.
Parallelism can be used in a system to increase overall performance. The voltage of the system can then be reduced, lowering the performance to original levels, and lowering the power consumed even further. There is an overhead incurred with adding parallelism (control, inefficency) so this is not always a win-win situation. (For example, the overhead of super-scalar operation makes it poor for power-reduction.)
StrongARM Microprocessor (.5 W)
The StrongARM microprocessor was presented at the 1996 ISSCC Conference. The StrongARM is a general-purpose, 32-bit microprocessor with a 16KB instruction cache (32-way set associative); a 16KB, write-back data cache (32-way set associative); a write buffer; and a memory-management unit (MMU) combined in a single chip. The five-stage pipeline distributes tasks evenly over time to remove bottlenecks, ensuring high throughput for the core logic. The StrongARM is touted as a low power chip, mainly because of its two power-down modes: idle and sleep. It uses a 1.5 V internal supply and delivers 184 Drystone/MIPS at 162 MHz and dissipates .5 W. It optionally operates at 215 MHz with a 2 V internal supply, dissipating 1.1 W. The low-power operation of the device was attributed to simple techniques, such as process shrinks and the simple ARM architecture (single-issue). A power breakdown is given below. It is interesting to note that processor power is dominated (43%) by the caches.
icache 27% ibox 18% (instruction issue) dcache 16% clock 10% immu 9% ebox 8% (execution units) dmmu 8% others 4%
[Mon96]
68040 compatible 1W Microprocessor
A 1-watt 68040 compatible microprocessor was introduced at the 1994 IEEE Symposium on Low Power Electronics. It achieved a 70% reduction in power from its predecessor. It achieved a 17.7 SPECint92/W at 33 MHz and 3.3 V. How was this power reduction accomplished?
[Big94]
Figure 1 [Itoh94] shows several different possible memory orginizations for sharing decoders. Sharing these decoders is one way to exploit an area/power trade off. Shared decodes decrease the necessary area (fewer decoders needed) but increase the capcitance on individual bit lines. (Because each line is connected to several memory blocks.)
The Motarola 4Mx1 low-power CMOS DRAM [Mot93] part has a 128ms refresh cycle. With a 110ns cycle time and 1024 banks, the decice is required to be refreshing approximately 0.7% of the time. This percentage is small enough that it is likely to be overwhelmed by regular data accesses to the device. However, with the advent of the 1Gb DRAM device, the numbers change. The 1Gb DRAM from ISSCC [Yoo96] has a 128ms refresh interval, requiring 1GB/sec of refresh bandwidth. This is equivalent to the peak internal 1GB/sec bandwith for data transfer. This indicates that at peak theoretical operation, 50% of the power is going towards refresh. In periods of non-peak operation, the power consumed will be dominated by refresh. This ratio indicates that the always-refresh line of thinking may not be ideal.
For the DRAM cell, the refresh operation functionality is accomplished by a read or write operation. This means that if a cell has been recently read (or written to) then it does not need to be refreshed. This has the attraction of a more consistant power behavior. The device will take less power to refresh if accesses are being made to it, so this technique would be most effective during periods of great use. Implementation of this idea is probably not technically feasible because of the overhead needed to remember which lines have been recently accessed. Some clever algorithm, however, similar to a clock-paging algorithm (one bit used to approximate LRU) may be applicable.
One simular situation where this concept might be useful is for systems with cache and DRAM on the same chip. If a word line is known to be in the cache, then it does not need to be refreshed. However, because of the relative size of on-chip caches and the size of DRAM, this technique is not likely to make much difference. (A 1Mb cache would only be able to 'prevent' .1% of the refreshes in a 1Gb DRAM.)
With memory sizes increasing and increasing (both system memory and single-chip memory), it is more and more likely that physical memory is not utilized at any given time. As it is not necessary to refresh unused memory, a considerable amount of power can be saved by intelligently controlling which pages get refreshed. The OS of a system knows which pages are used and unused, so given the opportunity it could disable refresh on selected pages.
Traditionally, a system only worries about swapping out pages when the memory space is full. Under a OS controlled refresh scheme, the OS could start to swap out pages to save power. The performance/benefit tradeoff of such actions is difficult to analyze because no current operating systems do this. This technique, however, would only help reduce the average power disipation, not the maximum. This means that it can only be used for conserving battery life, but for not preventing a chip meltdown.
Techniques to Reduce Power in Wide Fast Memories
CMOS memories have an access path that can be examined in two parts: the address to the local wordline select, and from the local wordline to the sense amps. Driving the wordline bus and sensing the data consumes the most power in this process.
In a paper by Bharadwaj Amruter in the 1994 IEEE Symposium on Low Power Electronics, it was proposed that power consumption could be reduced by limiting the energy consumed by each bitline. This energy is conserved by limiting the swing of the bitline by controlling the local wordline drive strength. This circuit technique adds an overhead of two extra columns and rows to implement a reference cell and reference bitline used in the drive strenth regulation. The swing on the data lines is also limited.
One other optimization used was to only pre-charge selected blocks that were to be accessed, instead of pre-charging the whole array.
Supply Voltage Power Gate Delay (V) (mW) (ns) 1.5 5.2 2.63 3.0 75.0 0.62 5.0 66.0 0.38
[Amr94]
6 ns 1.5 V 4 Mb BiCMOS SRAM
One of the problems in designing with Bipolar CMOS is that it is extremely difficult to scale. The fixed .8 V threshold voltage prevents scaling the voltage down as much as in other processes. So why design in a BiCMOS process? Speed. However, the speed benefit is not so much as to exclude consideration of other technologies. The speed-up as around a factor of 2, and BiCMOS designs usually require more area.
The 4 Mb BiCMOS SRAM presented at the 1996 ISSCC conference was a 1.5 V, 6 ns SRAM. This low-power SRAM was achieved using several low-power techniques.
Process: .3 um 4-poly 2-metal p-sub triple well BiCMOS Supply: 1.5-3.3 V Access: 6 ns Power: 180 mW at 1.5 V
[Kuh96]
Low power DSPs have used basically the same techniques as low power microprocessors, i.e. using a better process and adding a sleep or idle mode to conserve power when not in use. Here is an example of a low power DSP:
A small embedded power management system was included in a DSP introduced at the ISSCC96 conference. The DSP is targeted for mobile phone applications. When talking occurs, the DSP is activated. Otherwise, it goes into a sleep mode to conserve power.
Process 0.5 MTCMOS Chip Size 225 mm^2 Operating Frequency 13.2 MHz at 1.0 V MAC Performance 26.4 MOPS at 1.0 V Power Consumption 2.2 mW/MHz (1.1mW/MOPS) at 1.0V Standby Power 350uW (active), 600nW (sleep) at 1.0V
[Mutoh96]
As far as implementing low power IRAMs, it looks like we should target using a smaller RISC microprocessor (possibly with a vector extension). This will prevent the microprocessor from dominating the power consumption in the IRAM. As far as reducing power in DRAM, we can sub-divide the memory array into blocks and share the row and column decoders. If we only activate the blocks we need, we can save power in this manner. We can also perform intelligent refreshes, such as refreshing only blocks that have been written to, instead of an entire array. There are also many circuit tricks that can be perpetrated on the DRAM or SRAM cores in order to optimize for low power. A main concern of power consumption in IRAM will be the interconnect situation. Application specific designs will further detemine what sort of connection grid will be required.
A rough estimate of the power required for an IRAM would be to simply add the power requirements for the microprocessor and the DRAM core. This of course neglects the interconnect scheme, and the conversion of the microprocessor to a DRAM process, but it should provide at least a minimum power requirement. In the case of a 1 GB DRAM with a StrongARM microprocessor, we are looking at a minimum power requirement of 1.03 W plus the interconnect.
[Amr94] Bharadwaj Amrutur. "Techniques to Reduce Power in Fast Wide Memories" 1994 IEEE Symposium on Low Power Electronics. October 1994. [Big94] Terry Biggs. "A 1 Watt 68040-Compatible Microprocessor." 1994 IEEE Symposium on Low Power Electronics. October 1994. [Burd96] Tom Burd. "An interview with Tom Burd" Tom Burd's Ph.D. thesis is power efficient computing. [Gron96] Paul Gronowski. "A 433MHz 64b Quad-Issue RISC Microprocessor." 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [Hor94] Mark Horowitz. "Low Power Digital Design." 1994 IEEE Symposium on Low Power Electronics. October 1994. [Itoh94] Kiyoo Itoh. "Trends in Low-Power RAM Circuit Technologies." 1994 IEEE Symposium on Low Power Electronics. October 1994. [Kuh96] Shigeru Kuhara. "A 6 ns 1.5 V 4 Mb BiCMOS SRAM." 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [Mon96] James Montanaro, et. al. "A 160Mhz 32b 0.5W CMOS RISC Microprocessor" 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [Mot93] Dynamic Rams and Memory Modules Data Book. Motorola, Inc. 1993. [Mutoh96] Shin'ichiro Mutoh. "A 1 V Multi-Threshold Voltage CMOS DSP with an Efficient Power Management Technique for Mobile Phone Application." 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [Nitta96] Yasuhiko Nitta. "A 1.6GB/s Data-Rate 1Gb Synchronous DRAM with Hierarchical Square-Shape Memory Block and Distributed Bank Architecture." 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [San96] Hector Sanchez, et. al. "A 200Mhz 2.5V 4W Superscalar RISC Micro." 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996. [Yoo96] Jei-Hwan Yoo. "A 32-Bank 1 Gb DRAM with 1 GB/s Bandwidth" 1996 IEEE ISSCC Digest of Technical Papers. Feb. 8-10, 1996.
Trevor Pering / pering@eecs.berkeley.edu
Heather Bowers / hbowers@cory.eecs.berkeley.edu
March 20, 1996