# Compute-in-Memory and AI Accelerator Technologies for the Sub-18Å Era

SRC Industry Talk, August 2023 Ram K. Krishnamurthy High Performance and Low Voltage Circuits Research Circuits Research Lab, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA ram.krishnamurthy@intel.com

# Internet of Everything (IoE)



#### Need end-to-end energy efficiency, ML everywhere

# DATA DEFINES THE FUTURE



**WMIT** Other names and brands may be claimed as the property of others.



The Future Begins Here



#### Compute and Memory Challenges for Al



- Compute demand growth rate: Doubling every 3-4 months
- Memory capacity growth rate: 10X per year





### AI Has Moved to the Edge

#### **Edge Devices**

#### **Cloud Computing**



Source: L. LOH, isscc 2020





чк Л К

# **Diversified Workload & Increasing Demands**

| 0.1 TOPS          | 1 TOPS              | 10 TOPS        | 100 TOPS        |  |  |
|-------------------|---------------------|----------------|-----------------|--|--|
| 1 TOPS/W          | 3 TOPS/W            | 10 TOPS/W      | 30 TOPS/W       |  |  |
| Vision Perception | Vision Construction | Visual Quality | Multi-Streaming |  |  |



\*\*

# Intel Process Technology







Moore's Law continues

when combining the power of processing and packaging innovation

#### Go Wider: Within Package Interconnect Scaling



Power Efficiency

# Normalized Energy



Source: facebook

#### **Memory Bottleneck**

 Performance gap between processor and memory

#### Von Neumann Bottleneck

- Huge energy consumption of memory
- Reduce data movement between processor and memory
- Computation-in-Memory (CiM)



| Operation            | Energy [pJ] | Relative Cost |  |  |
|----------------------|-------------|---------------|--|--|
| 32 bit int ADD       | 0.1         |               |  |  |
| 32 bit float ADD     | 0.9         | 9             |  |  |
| 32 bit Register File | 1           | 10            |  |  |
| 32 bit int MULT      | 3.1         | 31            |  |  |
| 32 bit float MULT    | 3.7         | 37            |  |  |
| 32 bit SRAM Cache    | 5           | 50            |  |  |
| 32 bit DRAM Memory   | 640         | 6400          |  |  |

Source: Song Han et al. "EIE: efficient inference engine on compressed deep neural network," ISCA 2016.

SC2-6: Alternate Technologies for SRAM, Hai Li, Duke University, *IEDM*, 2020. Source: K. Takeuchi, IRPS 2023

38

# **Compute-in-Memory Motivation**

- Data movement is costly
- Multiply-accumulate (MAC) operation
- Massively parallel processing
- Beyond von Neumann architecture





# **Compute-in-Memory Challenges**

#### • TOPS/W versus Precision



#### **Intel Labs Analog CIM Architecture Overview**



#### **Intel Labs Analog CIM Measured Performance**

- Energy efficiency: 15.5-32.2 TOPS/W
- Area efficiency: 2.4-4.0 TOPS/mm<sup>2</sup>
- Clock frequency: 145-240 MHz
- Supply Voltage: 0.7-1.1 V





#### Opportunities for in-memory/near-memory Process and Circuit Innovation (Both Digital and Analog/Mixed-Signal)



#### **COMPUTE NEAR MEMORY CHALLENGES AND OPPORTUNITIES**



E. Sumbul, R. Krishnamurthy et al, IEEE ESSCIRC 2021

#### **10nm Near Memory Computing AI Inference Accelerator**



- 4 CNM cores with 8KB of weight memory and 64 8b multipliers
- Supports memory-intensive batch-1, large-batch, and in-place convolution

#### **10nm Near Memory Computing Measurement Results**



- Peak throughput 170 8b TOPS @ 0.9V that scaled up with number of CNM cores
- NTV operation down to 450mV decreases energy by 3.1x to 2.9 8b TOPS/W
- Variable precision improves energy efficiency by 11.4x to 33.0 1b TOPS/W
- G. Chen, R. Krishnamurthy et al, IEEE European Solid-State Circuits Conference 2021

#### 10nm Binary Neural Network Inference Accelerator



- Array of 128 Memory Execution Units (MEU) combine latch base memory and inner product compute in fine grain manner to minimize interconnect energy
- Central controller manages data flow from four 256b memory banks to MEUs
- 2 latch words per MEU enables data reuse reducing input bandwidth by 2x

#### **Comparison to Previously Published BNNs**



P. Knag, R. Krishnamurthy et al, IEEE Journal of Solid-State Circuits Invited Paper, April 2021

#### Compute Near Last Level Cache (CNC)



- CNC enables fine grain mixing of near-memory vector and GP scalar computation
- High BW access to highest capacity on-chip memory instead of RF/scratchpad

#### Compute Near Last Level Cache of RISC-V Multiprocessor



- 8-core RV64GC processor with 128 INT8 MACs near 512kB shared, distributed LLC
- CNC ISA extension with support for virtual addressing and cache coherence

G. Chen, R. Krishnamurthy et al, IEEE VLSI Circuits Symposium 2022 & JSSC Journal Invited Paper April 2023

# Intel 4 Silicon Implementation of 8-Core RISC-V



- 1.15GHz Intel 4 test-chip runs programs in C++ with inline CNC and boots Linux
- CNC circuits add 1.4% area overhead over baseline core + LLC design
- Flip-chip packaged with PLL and 32b IO to FPGA chipset

#### 8-Core RISC-V DNN Layer Performance in Intel 4



- Fully Connected Layers: up to 46× higher performance and 52× lower energy
- Convolutional Layers: up to 27× higher performance and 29× lower energy

G. Chen, R. Krishnamurthy et al, IEEE VLSI Circuits Symposium 2022 & JSSC Journal Invited Paper April 2023

#### Domain-Specific Computation Enables Workload Optimization which Drives Performance and Efficiency

Tailor architecture by application

 Adapt algorithms to use lower precision math formats for significant improvements in energy efficiency



# Intel Advanced Matrix Extensions (Intel AMX)

Tiled Matrix Multiplication Accelerator

#### TILES – Data Structure

- New expandable 2D register file 8 new registers, 1Kb each
- Supports basic data operators: load/store, clear, set to constant, etc.
- TILES declares state and is OSmanaged by XSAVE architecture

#### TMUL – Accelerator Operations

- Set of matrix multiplication instructions, first operators on TILES resgister files
- A MAC computation grid calculates "tiles" of data
- TMUL performs Matrix ADD-MULTIPLY (C=+A\*B) using three Tile register (T2=+T1\*T0)
- TMUL requires TILE to be present





# **Multi-precision Neural Networks Matrix Multipliers**



| a00 | a01 | a02 | a03 |              | b00 | b01 | b02 | b03 |  | y00 | y01 | y02 | y03 |
|-----|-----|-----|-----|--------------|-----|-----|-----|-----|--|-----|-----|-----|-----|
| a10 | a11 | a12 | a13 | $\checkmark$ | b10 | b11 | b12 | b13 |  | y10 | y11 | y12 | y13 |
| a20 | a21 | a22 | a23 | $\wedge$     | b20 | b21 | b22 | b23 |  | y20 | y21 | y22 | y23 |
| a30 | a31 | a32 | a33 |              | b30 | b31 | b32 | b33 |  | y30 | y31 | y32 | y33 |

Simple Neural Network

- Matrix-multiply: power, performance, and area limiter
- Large matrices with many iterations
- Specialized architectures enable higher performance and energy efficiency
- Varying numeric requirements (FP16/INT16/INT8) across applications
  - Require low overhead reconfigurable circuits
- Varying matrix sparsity across applications
  - Optimized circuits can take advantage of sparsity

#### **Variable Precision Matrix Multiply Accelerator**



- 4x4 systolic array
- Fabric reconfigures to optimize data movement in dense/sparse mode
- Reconfigurable MAC with signed/unsigned INT16/4xINT8/FP16 support

M. Anders, R. Krishnamurthy et al, VLSI Circuits Symposium 2018

# 14nm Chip Micrograph and Nominal Performance



| Mult/Acc Mode | Nominal (750mV, 25°C)       |  |  |  |  |
|---------------|-----------------------------|--|--|--|--|
| FP16/FP32     | 800MHz, 42.7mW, 0.6TFLOPS/W |  |  |  |  |
| INT16/INT48   | 940MHz, 37.6mW, 0.8TOPS/W   |  |  |  |  |
| INT8/INT24    | 1.06GHz, 47.7mW, 2.9TOPS/W  |  |  |  |  |

#### Matrix Multiplier Energy Efficiency Measurements



- Efficiency increases 4X from nominal 750mV to near threshold voltage
- Peak energy efficiency range from 2.97TFLOPS/W (FP16) to 11.3TOPS/W (INT8)

### The Future – Zetta Flop Systems



#### **Roadmap to Lower Voltage Operation**



Low voltage operation requires careful selection and optimization of storage elements

### Memory Challenges in AI Accelerators



- AI accelerators are built using a large array of processing elements (PEs) containing small capacity local register files (RFs)
- Register files contribute a significant amount of power (39%) and area (35%) within the PEs

#### Static AI Register File Micrograph



S. Hsu, R. Krishnamurthy et al, IEEE VLSI Circuits Symposium 2022



Ultra-low voltage operation at 325mV, 100°C consuming 36.7μW, 60MHz

#### "Extreme" efficiency research



System-Wide Breakthroughs Needed Across the Board





#### intelligence Inside