# Architectures for Extremely Scaled Memories Paul Franzon

**Department of Electrical and Computer Engineering** 

paulf@ncsu.edu

919.515.7351

# **High Level Overview**

## **Challenges for Memories**

- Bandwidth
- Power consumption
- Resiliency
- Flexibility
- Scaling (density, size, level of integration)

## **Opportunities for Memories**

- ▷ 3DIC with TSV
- Architectural Customization
- Use of memory in computing

# **Challenge: Bandwidth**

## Soon to exceed 1 TBps

|               | 2004      | Multi-core | Reverse    | Reverse    |
|---------------|-----------|------------|------------|------------|
|               | Baseline  | Approach   | scaling    | scaling    |
| Frequency     | 4 GHz     | 8 GHz      | 8 GHz      | 4GHz       |
| No. of Cores  | 1 Core    | 4 Cores    | 16 Cores   | 16 Cores   |
|               |           |            |            |            |
| Core rel. IPC | 1         | 1          | 0.5        | 1          |
| Total Flops   | 32 GFlops | 256 GFlops | 512 GFlops | 512 GFlops |
| Supply        | 1.2V      | 1.0V       | 1.0V       | 1.0V       |
| Power         | 84W       | 233W       | 233W       | 117-163W   |
| Bandwidth     | 32GB/s    | 256GB/s    | 512GB/s    | 512GB/s    |
| requirement   |           |            |            |            |

MULTICORE AND REVERSE SCALING



Intel

#### Future microprocessors and off-chip SOP interconnect

Hofstee, H.P.;

Advanced Packaging, IEEE Transactions on [see also Components, Packaging and Manufacturing Technology, Part B: Advanced Packaging, IEEE Transactions on] Volume 27, Issue 2, May 2004 Page(s):301 - 303



# **Challenge: Power**

# Specifically providing this bandwidth at reduced power

▷ DDR3 : 1 TBps  $\rightarrow$  600 W of power



Figure 6.25: DDR3 current breakdown for Idle, Active, Read and Write.

## NC STATE UNIVERSITY **Comparative power consumptions** 4.8 nJ/word DDR3 0.4 nJ/cycle MIPS 64 core\* 45 nm 0.8 V FPU 38 pJ/Op 128 pJ/Word 20 mV I/O 💻 40 pJ/Word Rotating Disk (64 bit words)

Without better solutions, memory power will dominate computing

\* At 90 nm. Includes 40 kB cache, no FPU



-16 Demux

VTERM

16/1~

DQ15..0

\$ 16

DQN15..0

16:1 Mux

16/tc/

Inefficient

# Where does the power go?

### Command, address, data pipeline and "assist" circuits

- Many flip-flops
- DRAM process not ideal

## Input/Output

- Difficult timing specs consume considerable power
- ▷ > 40 mW/pin



Figure 6.24: Block diagram of 1Gbit, X8 DDR2 device.



# **Power Scaling**

## **Scaling Core Voltage**

- ▹ Today 1.8 V
- ▶ Tomorrow, possibly 1.0 V, but scaling slowly
- ▹ What would be required to scale to 0.6 V?
- Advantages: Core power reduction; Reduced need for charge pumps

## Scaling Command/Address/Data power

- Complex pipeline with many registers
- Increased desire for this pipeline to be configurable, increasing its design challenge and power consumption

# **Challenge: Resiliency**

### **Issues:**

- ▷ Soft Error Rate (SEU) of SRAM
- Checkpointing and resiliency of entire processor
- Future scaled server computers could spend 80% of their time checkpointing

|                            | FIT per   | Components per | FIT per |
|----------------------------|-----------|----------------|---------|
| Component                  | Component | 64K System     | System  |
| DRAM                       | 5         | 608,256        | 3,041K  |
| Compute + I/O ASIC         | 20        | 66,560         | 1,331K  |
| ETH Complex                | 160       | 3,024          | 484K    |
| Non-redundant power supply | 500       | 384            | 384K    |
| Link ASIC                  | 25        | 3,072          | 77K     |
| Clock chip                 | 6.5       | 1,200          | 8K      |
| Total FITs                 |           |                | 5,315K  |

Table 6.12: BlueGene FIT budget.

#### Note: DRAM Failures almost all due to packaging

# **Challenge: Cost per bit**

#### **Issues:**

| Technology    | Cell Size         | Comments                      |
|---------------|-------------------|-------------------------------|
| DRAM          | 6F <sup>2</sup>   | Capacitance scaling challenge |
| Flash         | 4.5F <sup>2</sup> | Scaling uncertanties          |
| PCRAM         | 5.5F <sup>2</sup> | Density Challenges            |
| Resistive RAM | $4F^2 - 6F^2$     | Most promising?               |
|               |                   | F can be small.               |

- ▶ **Fill Factor** (% of total silicon area used for memory cells)
  - Sub-array size
  - Area of peripheral and interface circuits
  - ▷ Most DRAMS ~ 30% 40%



Figure 6.13: ITRS roadmap memory density projections.

# Speed/Power $\leftarrow \rightarrow$ Area tradeoff

## Example: DRAM vs. Reduced Latency DRAM (RLDRAM)



(a) A Conventional DRAM



(a) A Reduced Latency DRAM

Figure 6.20: Reduced latency DRAM.

# **High Level Overview**

**Challenges for Memories** 

- ► Bandwidth
- Power consumption
- ▹ Resiliency
- ▹ Flexibility
- ▷ Scaling (density, speed, power)

## **Opportunities for Memories**

- ▷ 3DIC with TSV
- Architectural Customization
- IR1D cell
- Increased use of memory in logic and routing

# **3DIC with Through Silicon Vias**



S. Denda, Nagano Prefectural Institute of Technology.

# **Coarse pitch TSV**

 $\triangleright$  Pitch: 40  $\mu$ m to 250  $\mu$ m

### Advantages

- Reduces need for wafer thinning
- Established production route because of cell phone cameras

### Disadvantages

- Limits architectural solutions
- Really Advanced Packaging, not advanced integration



Samsung



# **High Density TSV**

 $\triangleright$  Pitch: 1 µm to 10 µm

## Advantages:

 Permits architectural optimization

### Disadvantages

- Adds processing cost
- Adds complexity in design and test
- Limited supply chain



MIT LL





# **3-Tier 3DIC Cross-Section** Second DARPA *Multiproject Run* (3DM2)

Two Digital & One RF 180-nm 1.5V FDSOI CMOS Tiers



#### **3DM2 Process Highlights**

11 metal interconnect levels 1.75-μm 3D via tier interconnect Stacked 3D vias allowed Tier-2 back-metal/back-via process 2-μm-thick RF back metal Tier-3 W gate shunt Tier-3 silicide block

#### **MIT Lincoln Labs**

## **Tezzaron 3D Technology: 0.13 um Bulk CMOS**



# **3DIC and Memory**

## **Immediate application space:**

- > 3D memory stacking with coarse pitch TSVs
- Challenges:
  - Justifying initial cost
  - Cost scaling

## More exciting application space:

- 3D-specific architectures
  - Memory-on-logic
  - High-density TSVs
- Challenges
  - Cost; test; design complexity

# Example

## **D 3D Synthetic Aperture Radar Processor**

Specifically FFT engine

## Opportunities Exploited

- Co-architected memory and logic
- **D 3D specific design achieved the following** 
  - ▷ 65% power reduction
  - 800% increase in memory bandwidth
  - At cost of 22% increase in total silicon area (for the repartitioned memory)

## ▷ 1024 point FFT:

## 16 GFLOPS, 50 GBps in 2.6 x 3 mm

# **3D FFT for Radar Processor**

2DIC "optimal" design (+/-)



One Big Slow Memory on Shared Bus

Table 3: Read and write energy from Cacti comparing the un-optimized to the optimized design.

| Metric                  | Unopti. | Opti.  | %        |
|-------------------------|---------|--------|----------|
| Wires (#)               | 150     | 2272   | -1414.7% |
| Bandwidth $(GBps)$      | 13.4    | 128.4  | 854.9%   |
| Energy Per Write $(pJ)$ | 14.48   | 6.142  | 57.6%    |
| Energy Per Read $(pJ)$  | 68.205  | 26.718 | 60.8%    |

#### **3DIC Optimal design**



Multiple Individual Fast Memories

- 60% reduction in memory power
- 67% increase in memory area
- 8x increase in bandwidth

# **3D FFT Floorplan**

# Support multiple small memories WITHOUT an interconnect penalty

Gives 60% memory power savings

Memories communicate vertically only



#### NC STATE U

# **Implications of 3D**



Multiple Individual Fast Memories

# What are differences between 2D and 3D implementations of <u>THIS</u> architecture?

| Metric                      | $2\mathrm{D}$ | 3D    | %     |
|-----------------------------|---------------|-------|-------|
| Total Area (mm2)            | 31.36         | 23.40 | 25.3% |
| Core Area (mm2)             | 29.16         | 20.16 | 30.9% |
| Mean Net Length (um)        | 836.0         | 392.9 | 53.0% |
| Total Wire Length (m)       | 19.107        | 8.238 | 56.9% |
| Max Speed (MHz)             | 63.7          | 79.4  | 24.6% |
| Critical Path (ns)          | 15.7          | 12.6  | 19.7% |
| Logic Power @ 63.7 MHz (mW) | 340.0         | 324.9 | 4.4%  |
| Logic Power @ 79.4 MHz (mW) | -             | 409.2 |       |
| FFT Logic Energy (nJ)       | 3.552         | 3.366 | 5.2%  |

#### NC STATE UNIVERSITY Memory bank size tradeoffs E.g. 32 x 2 kbit SRAM 10x less energy/bit than 1 x 64 kbit SRAM With 17% increase in area (partially recoverable by in 3D) SRAM\_Energy 4.00E-09-4.50E-09 4.50E-09 3.50E-09-4.00E-09 4.00E-09 3.00E-09-3.50E-09 3.50E-09 3.00E-09 2.50E-09-3.00E-09 2.50E-09 Energy 2.00E-09 1.50E-09 1.00E-09 5.00E-10 0.00E+00 E read (1024 rows) E\_read (512 rows) 8 16 32 E\_read (256 rows) Rows 64 128 256 E\_read (128 rows) 512

1024

Cols

# **TSV Placement**

Floorplanning, TSV placement and partitioning are easier in a memory-on-logic device than a logic-on-logic design

| e <u>A</u> nalyze <u>D</u> isplay <u>V</u> iew <u>O</u> | uery Settings Window Debug Help |                                              |                                                                                                                 |                                                                                                                |        |
|---------------------------------------------------------|---------------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--------|
|                                                         | 12 2 4 10 10 10                 | E Q Q X X L    X 84 (                        | ן (] ♦                                                                                                          |                                                                                                                |        |
|                                                         | aus fir                         |                                              |                                                                                                                 |                                                                                                                | ACT_C  |
|                                                         |                                 |                                              |                                                                                                                 | 181                                                                                                            | 694.13 |
|                                                         |                                 |                                              |                                                                                                                 | 12.2                                                                                                           | P/     |
| Mask Vi ×                                               | 155 222                         |                                              |                                                                                                                 | 1                                                                                                              | 81     |
| AGT_A                                                   | 122 122                         |                                              | the second se | · · · · · · · · · · · · · · · · · · ·                                                                          | 8      |
| POLY_A                                                  | 377 332                         |                                              |                                                                                                                 |                                                                                                                | ÷.'    |
| CON_A                                                   | 111 112 112                     |                                              |                                                                                                                 | (FI                                                                                                            | P      |
| M1_A                                                    | 772 222                         |                                              |                                                                                                                 | 11-1                                                                                                           |        |
| V12_A                                                   | 122 444                         |                                              |                                                                                                                 | 18.8                                                                                                           |        |
| M2 A                                                    |                                 |                                              |                                                                                                                 | 10 A                                                                                                           |        |
| V23_A                                                   | 255 555                         | Propher of the Propher of the Propher of the |                                                                                                                 |                                                                                                                |        |
|                                                         | 1177 4994                       |                                              |                                                                                                                 | 17.1                                                                                                           |        |
| M3.B                                                    | 225 222                         |                                              |                                                                                                                 | 12.0                                                                                                           |        |
| V23_B                                                   |                                 |                                              |                                                                                                                 |                                                                                                                |        |
| M2_B                                                    | 222 223                         |                                              |                                                                                                                 |                                                                                                                | 1      |
| V12_8                                                   |                                 |                                              |                                                                                                                 |                                                                                                                |        |
| M1_B                                                    |                                 |                                              |                                                                                                                 | 1000                                                                                                           | 1      |
| BVIAO_B                                                 |                                 |                                              |                                                                                                                 | 100                                                                                                            |        |
| CON_B                                                   |                                 |                                              | and a state of the second                                                                                       |                                                                                                                |        |
| POLY_B                                                  | 377 101                         |                                              |                                                                                                                 |                                                                                                                | -      |
| ACT_B                                                   | 101                             |                                              |                                                                                                                 | 121                                                                                                            |        |
| Dacity C                                                |                                 |                                              |                                                                                                                 | Contract of the owner owner owner owner owner | 1      |
| M3 C                                                    | arr bur                         |                                              |                                                                                                                 | 12.1                                                                                                           |        |
| V23_C                                                   | 14 M 24                         |                                              | 1                                                                                                               | 1                                                                                                              |        |
| M2_C                                                    | 150 114                         |                                              |                                                                                                                 |                                                                                                                |        |
| V12_C                                                   |                                 |                                              |                                                                                                                 | 1.1                                                                                                            |        |
| M1_C                                                    | 122 122                         |                                              |                                                                                                                 | Harden -                                                                                                       |        |
| EVIAO_C                                                 |                                 |                                              |                                                                                                                 |                                                                                                                |        |
| CON_C                                                   |                                 |                                              |                                                                                                                 | 12 H                                                                                                           |        |
| POLY_C                                                  | 155 175                         |                                              |                                                                                                                 | <b>F</b>                                                                                                       |        |
| LONB ACT O                                              | 861 225                         |                                              |                                                                                                                 |                                                                                                                |        |
|                                                         |                                 |                                              |                                                                                                                 | E.                                                                                                             | 2      |
| Daoge c                                                 | 205 225                         |                                              |                                                                                                                 | 10 A                                                                                                           |        |
| Check Al                                                |                                 |                                              |                                                                                                                 | ·                                                                                                              | 11.    |
| CHOCK AL                                                |                                 |                                              |                                                                                                                 | 38-0                                                                                                           | 12     |
| Uncheck All                                             | 130 222                         |                                              |                                                                                                                 | 121                                                                                                            | 4      |
|                                                         | 255 P25                         |                                              |                                                                                                                 |                                                                                                                | 6      |
|                                                         |                                 |                                              |                                                                                                                 |                                                                                                                |        |

17,634 TSVs

Power/Ground: • 4554 A  $\leftarrow \rightarrow$  B • 4800 B  $\leftarrow \rightarrow$  C

Signal: • 4140 A  $\leftarrow \rightarrow$  B • 4140 B  $\leftarrow \rightarrow$  C

**0.14 mm<sup>2</sup> of TSV** (1.7% area)

# **TSV Tradeoffs in FFT Processor**

| Process            | Area loss                           |
|--------------------|-------------------------------------|
| Lincoln Labs SOI   | 0.14 mm²<br>1.7%                    |
| Tezzaron bulk CMOS | 0.02 mm <sup>2</sup><br>0.3%        |
| Package style TSV  | 2 mm <sup>2</sup> or more*<br>(18%) |

\* Assumed "aggressive" effective 15  $\mu$ m pitch (i.e. TSV + keepout)

# **Circuit Level Partitioning**

## Above is block level partitioning



### What about circuit level partitioning?

- Distributing banks amongst tiers?
- Distributing peripheral circuits
- ► Issues:
  - Size of TSV vs. memory cell
  - Capacitance of TSV



# **Distributing banks amongst tiers**

#### ▷ SRAM, DRAM:

- Potential advantages in a homogeneous technology memory stack are small
- Little potential to decrease power or area

#### Content Addressable Memory

- Searches memory for content
- Significant potential advantage
  - Due to high capacitance of match line
  - Match line == "found"

Search for "55"



## **3D CAM: Advantages over 2D**

### In CAM Memory Core,

- ▶ 40% C\_ML (matchline capacitance) reduction
- 27% P\_ML (matchline power) reduction
- > 23% overall power reduction
- Area (footprint) reduction of CAM core cells: ~50%



<Q3D model of interconnects for capacitance analysis>

|         | 2D<br>Structure | 3D<br>Structure<br>with 3 Tiers | Power reduction in % |
|---------|-----------------|---------------------------------|----------------------|
| P_ML    | 2.9p            | 2.1p                            | 27%                  |
| P_total | 8.0p            | 6.2p                            | 23%                  |

Only makes sense in low-capacitance SOI process



Oh



# **Tezzaron "Dis-integrated RAM"**

## Mixed technology concept

- DRAM arrays in low-leakage DRAM technology (at node N)
- Peripheral circuits in high-performance logic process (at node N-1)
- Bit and word lines fed vertically at array edge

### **Expected results**

- Reduced overall cost/bit
- Faster interfaces
- Lower latency
- Reduced power/bit
- Greater architectural flexibility



# **3DIC "Issues"**

## 1. Cost

- Cost in low volumes with 12" equipment will be high
- Currently at bottom of volume and cost reduction learning curve
- Try to recover through unique product advantage and reduced silicon area

## 2. Test

- Known Good Die (or wafer) issues
- Changes RAM test and burn-in strategies

## 3. Thermal

- Power delivery / thermal dissipation codesign issue
- Must keep DRAM below 90 C

# **Exascale Computing Node**

## **Snapshot of the future?**

- "Extreme" stacking needed to manage bandwidth and energy
- One computing node:



# **Architectural Solutions**

## **DDR optimized towards cache row refill**

And well suited for little else

## **Architectural Opportunities created by 3DIC RAM:**

- Can separate memory array structure from architectural specification
- E.g. Tezzaron supplies "raw" multi-bank memory with SDRAM style interface
- Permits co-optimization of floorplan, logic, and memory
- With CPU cores, fast 3D RAM removes need for L2 cache

# Nanoscale Emerging Memory Solutions

## ⊳ 3DIC

"Dis-integrate" with non-MOSFET based memories

## Non-volatile memory

- Integrated functionality to improve resiliency of computers and logic
  - E.g. Embedded check-pointing

## Neuromorphic computing

Need: Analog memory or high density digital memory with DAC

## **Non-memory applications of emerging memory**

Routing; Analog functions

# **Neuromorphic Computing**

FACETS Stage 2 Technology

Neural Processing Unit, up to 2x10<sup>5</sup> Neurons, 5x10<sup>7</sup> Synapses

## **Need for scaling:**

- Fast compact analog memory
- ▶ 3DIC



Voltage dependent part, changes membrane conductance

## Synaptic Computation Model



Computing *Differently* - A Potential Approach to Living with the Constraints of the Nanoscale - Motivators - Technological Approaches and Achievements - Poture Challenges and Plans - Market Mediators - Market Mediators - Market Mediators - Market Mediators - Market Mediators





## Metal Nanocrystal Floating Gate

- High density of states
- Reliable
- Good retention

# **Example: NC FG-based FPGA**

a)

- Shows benefit of a memory device in a static reconfigurable interconnect application
- Palladium Metal nanocrystal flash reduces programming voltage to 3-4 V

 Table 1: Results for 16 bit Carry Ripple Adder (Design I) and

 32-tap FIR Filter (Design II)

|          | NC                 | SRAM                | NC                  | SRAM                 |
|----------|--------------------|---------------------|---------------------|----------------------|
|          | Design 1           | Design 1            | Design II           | Design II            |
| Area     |                    |                     |                     |                      |
| - Logic  | $27 \mu\text{m}^2$ | $27 \mu m^2$        | 128 µm <sup>2</sup> | 128 μm <sup>2</sup>  |
| - Con Bl | $7  \mu m^2$       | 10 µm <sup>2</sup>  | 317 µm <sup>2</sup> | 490 μm <sup>2</sup>  |
| - Sw Box | 33 µm <sup>2</sup> | 113 µm <sup>2</sup> | 394 µm <sup>2</sup> | 1358 μm <sup>2</sup> |
| - Total  | 66 µm <sup>2</sup> | 194 µm²             | 839 µm²             | 1977 μm <sup>2</sup> |
| Power    |                    |                     |                     |                      |
| - Static | 14 μW              | 87 μW               | 149 µW              | 1273 μW              |
| - Total  | 63 μW              | 149 µW              | 1491 μW             | 4101 μW              |

### 8x power savings 4x area savings





# Conclusions

### Memory Business readying for disruptive change

- Mix of rising challenges and emerging opportunities
- Key: Delivering new technological responses costeffectively

## ▷ Challenges

- Bandwidth
- Power at this bandwidth
- ▷ Cost

## Opportunities

- ▷ 3DIC
- ▷ 1D1R memory
- Non-traditional architectural mixes

# **Acknowledgements**



William Rhett Davis, Michael B. Steer, Mehmet Ozturk, Hua Hao, Steven Lipa, Sonali Luniya, Christopher Mineo, Julie Oh, Ambirish Sule, Thor Thorolfsson, Chirs Amsinck, Neil DiSpigna, Shep Pitts, Daniel Schinke, Department of Electrical and Computer Engineering NC State University

# **Acknowledgments**

## My colleagues on

Final Report Exascale Study Group: Technology Challenges in Achieving Exascale Systems



*ExaScale* Data Center



*TeraScale* Embedded



*PetaScale* Departmental

# **Benefits of 1R1D cell**

- Permits highest core density
- ▶ With high on:off ratio, large arrays are possible



| On:off Ratio | Max. Array |
|--------------|------------|
| 7:1          | 64x64      |
| 13:1         | 128X128    |
| 100:1        | 1225X1225  |
| 1000:1       | 12kX12k    |
| 8000:1       | 1MX1M      |

C. Amsinck, N. DiSpigna, D. Nackashi, P. Franzon, "Scaling constraints in nanoelectronic random-access memories," Nanotechnology 16(10), Oct. 2005, pp. 2251 – 2260.

# **3DIC Test**

 Problem: Yield impact of accumulated (untested) silicon area

#### Wafer on wafer stacking

 Test before assembly has uncertain utility

#### **Chip on wafer stacking**

Known Good Die potentially highly useful

| One  | Two   | Three | Four  |
|------|-------|-------|-------|
| tier | tiers | tiers | tiers |
| 95%  | 90%   | 85%   | 81%   |





# **3DIC Test**

# Wafer probing a multi-thousand pin TSV field is unscalable



5  $\mu$ m pad alignment

### 100 kg contact force

- ▶ Logic die:
  - Need Known Good Die solution with compact test set

### ▷ Memory stack:

Need yield management and Known Good Die solution



# **TSV Self-Test**

- 1. Self-test for leakage easy to implement
- 2. Gives 1/0 answer for read-out via scan chain



# Power delivery, I/O and thermal

### 1. 2D chip:

- Heat spreader next to heat source
- Short Idd Iss wires
- Short I/O wires over oxide

### 2. 3D chip:

- Bottom side power and signal delivery
- Top-side heat dissipation
- Through TSVs needed for thermal dissipation
- Through TSVs increase LCR of Vdd, Gnd and IO

