# Al Semiconductor (On-Device Al) Present and Future

2024.03.26

KAIST

Hoi-Jun Yoo ICT Chair Professor Director, AI-PIM Center Dean, AI Semiconductor Graduate School



## Contents

- **1. What is AI Semiconductor**
- 2. Present AI Semiconductors
- 3. Future of AI Semiconductor



## Contents

#### **1. What is AI Semiconductor**

2. Present Al Semiconductor

3. Future of Al Semiconductor



## **AI & Deep Learning**

Conventional AI



• Deep Learning (DNN): Learn by Data





### **Deep Neural Network (DNN)**



## **Neural Processing Unit (NPU)**

- Basic NPU Architecture
  - Fetch Inputs and Weights from DRAM or SRAM
  - Matrix Multiplication and Addition in PE Array



## **Evolution of DNN Processors**

- CPU : Low Performance of DNN
- GPU : High Power
- NPU : Optimized for DNN Operation **GPU**



- <10 Processing Cores</li>
- General Purpose
- Floating Point Operation
- SW Programmability



Floating Point Operation

CUDA Programmability

Matrix Computation

~1K FP PEs

PE



- >10K Integer PEs
- FP/Integer Operation
- Convolution Operation
- Data Reuse

#### KAIST MPIM

#### **Developments of DNN**



## **Evolution of DNN Accelerators**





## **Intelligence Revolution**





## **Intelligence Revolution**





## **Trends in AI Semiconductor**

- Large Model with Low Power Consumption → On-Device AI
- Co-Optimization of SW, HW, and Domain Specific Application

#### **1**. Basic

– DRAM PIM and NVM PIM
– Neuromorphic & SNN

#### 2. Domain Specific App.

- DRL, NeRF, Gen. AI NPU

– 6G , Metaverse, DigitalTwin

#### **3. Large Model**

– LLM (chatGPT) Acceleration– LMM Optimization







## Contents

**1. What is Al Semiconductor** 

#### 2. Present AI Semiconductor

**3. Future of AI Semiconductor** 



## **History of AI Semiconductor**



## **2007 BONE-V2 : Visual Attention**

- Implementation of "Visual Attention" on silicon chip
  - 이미지 상의 중요 키포인트 강조
  - 키포인트 필터링을 통한 Pixel-level visual attention



KAIST MPIM

## **2007 World First CNN Accelerator**



## BONE-V2 (2007 ISSCC)



□ 0.13µm 8M CMOS Tech.
□ 6mm x 6mm
□ Power Supply

- □ Power Supply
  - 1.2V: Core
  - 2.5V: I/O

#### Operating Frequency

- 200MHz for IPs
- 400MHz for NoC

#### □ # of Transistors

- 1.9M gates
- 228kB SRAM

#### Power Consumption

 Less than 583mW (Object recognition)

#### **2009 BONE-V4: Unified Attention Model\***

- Feedforward Attention
  - → Bottom-up : Salient Image Features
- Attention-Recognition Feedback Loop

→ Top-down : Familiar Objects



KAIST MPIM

\*S. Lee et al., "Familiarity based unified visual attention model for fast and robust object recognition" , Pattern Recognition 2010

### **BONE-V4**





#### 10mm

| Technology             | 0.13um 1P8M Logic CMOS           |                     |
|------------------------|----------------------------------|---------------------|
| Die Size               | 50mm <sup>2</sup> 10.0mm x 5.0mm |                     |
| Gates / SRAM           | 2.92M Gates / 612 kB             |                     |
| NoC IPs                | 51                               |                     |
| Power Supply           | CCL & NoC                        | 1.2 V               |
|                        | PPL                              | 0.65 ~ 1.2 V        |
| Operating<br>Frequency | Global NoC                       | 400MHz (45FO4)      |
|                        | CCL                              | 200MHz (90FO4)      |
|                        | PPL                              | 50 ~ 200MHz (90FO4) |

#### **2009 BONE-V4: Demonstration**





## **2017 Low Power Face Recognition SoC**

- Always-On  $\rightarrow$  Ultra Low Power
  - 0.6mW Full CNN Operation
- Hybrid Face Detector
  - Face detection by CMOS image sensor
  - Combine analog & digital face detector







KAIST MPIM

### **2017 Face Recognition Demo Video**





## 2017 CNN + RNN Deep Neural Network

CNN: Static Picture Recognition

Dongjoo Shin et al, ISSCC 2017

- Face recognition, image classification...
- RNN: Temporal Video Recognition
  - Translation, speech recognition...
- CNN + RNN: CNN−extracted features → RNN input



- Previous works
  - Optimized for convolution layer only: [6], [3]
  - Optimized for FC layer and RNN only: [5]

[3] B. Moons, SOVC 2016
[5] S. Han, ISCA 2016
[6] Y. Chen, ISSCC 2016

### **2017 DNPU : Pet Robot Demonstration**



D. Shin, et al. "14.2 DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks." *ISSCC 2017* KAIST MPIM

### 2018 Unified NPU: Programmable DNN Arch.

- Unified Data Path
  - Dynamically Programmable for CNN, RNN/FC
- Support Various CNN & RNN Workload



Lee, Jinmook, et al.

"UNPU: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision." ISSCC 2018

## **2018 UNPU : Emotion Recognition**



Lee, Jinmook, et al. "UNPU: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision." *ISSCC 2018* 

## **History of AI Semiconductor**



## **Inference & Training**



### **Inference & Training**



## **Robust Object Detection w/ DNN Training**

- HNPU-V2: 정확도 보상을 위한 Online DNN Tuning
- 예상치 못한 상황에서 자동으로 정확도 회복 가능

![](_page_29_Picture_3.jpeg)

### **2021 HNPU-V2 Demonstration Video**

![](_page_30_Picture_1.jpeg)

![](_page_30_Picture_2.jpeg)

### **Generative Adversarial Network**

![](_page_31_Figure_1.jpeg)

![](_page_31_Picture_2.jpeg)

#### **2020 GANPU**

![](_page_32_Picture_1.jpeg)

![](_page_32_Picture_2.jpeg)

### **Deep Reinforcement Learning**

![](_page_33_Figure_1.jpeg)

![](_page_33_Picture_2.jpeg)

## **OmniDRL : Advanced DRL Processor**

Humanoid Robot Agent Training w/ DRL processor

![](_page_34_Figure_2.jpeg)

J. Lee, et al. "OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dual-mode Weight. Compression and On-chip Sparse Weight Transposer,", VLSI 2021

![](_page_34_Picture_4.jpeg)

## **2021 OmniDRL : Demonstration Video**

![](_page_35_Picture_1.jpeg)

J. Lee, et al. "OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dual-mode Weight. Compression and On-chip Sparse Weight Transposer,", VLSI 2021

![](_page_35_Picture_3.jpeg)

## Contents

**1. What is AI Semiconductor** 

2. Present Al Semiconductor

#### 3. Future of AI Semiconductor

![](_page_36_Picture_4.jpeg)

## **History of AI Semiconductor**

![](_page_37_Figure_1.jpeg)

![](_page_37_Picture_2.jpeg)

## **Spatial Computing**

#### □ Human + CPS (Cyber Physical System)

The digitization of activities of machines, people, objects, and the environments in which they enable and optimize actions and interactions.

![](_page_38_Picture_3.jpeg)

## **Conventional 3D Modelling**

#### Manual Design w/ 3D Graphics Tool

- Expert-only 8
- 70-110h to design 😕

![](_page_39_Picture_4.jpeg)

![](_page_39_Picture_5.jpeg)

#### Specialized 3D Scanning Studio

High-cost equipment (8)
 (~150 DSLR Cameras)

![](_page_39_Picture_8.jpeg)

![](_page_39_Picture_9.jpeg)

#### Photogrammetry w/ Mobile Camera

- Requires feature extraction
- Fail for featureless surface

![](_page_39_Picture_13.jpeg)

## **NeRF 3D Modelling**

![](_page_40_Figure_1.jpeg)

#### 2023 MetaVRain: 3D NeRF Processor

• Mobile AR/VR 기기를 위한 AI 기반 Real-time Rendering

![](_page_41_Figure_2.jpeg)

![](_page_41_Picture_3.jpeg)

### 2023 MetaVRain : YTN

![](_page_42_Figure_1.jpeg)

### 2024 NeuGPU: 3D NeRF Processor

NeRF-based Instant Modeling & Real-time Rendering Processor

![](_page_43_Figure_2.jpeg)

#### **2024 NeuGPU: Demonstration Video**

![](_page_44_Picture_1.jpeg)

![](_page_44_Picture_2.jpeg)

## **Evolution of PIM Architecture**

- 메모리와 연산기의 융합성 증가
  - Near Memory Processing  $\rightarrow$  Processing in Memory

![](_page_45_Figure_3.jpeg)

![](_page_45_Picture_4.jpeg)

## **Evolution of DNN Processor**

Evolved to Memory Centric Computing

![](_page_46_Figure_2.jpeg)

### **Advantages of PIM**

![](_page_47_Figure_1.jpeg)

## **KAIST PIM: Triple Mode Cell**

- Multi-functional 3T-2C Cell
  - Support dynamic resource switching (Computing \(Computing))

![](_page_48_Figure_3.jpeg)

### DynaPlasia

• DynaPlasia (ISSCC'23) : Reconfigurable IMC

![](_page_49_Figure_2.jpeg)

S. Kim, et al. "DynaPlasia: An eDRAM In-Memory-Computing-Based Reconfigurable Spatial Accelerator with Triple-Mode Cell for Dynamic Resource Switching,", ISSCC 2023

![](_page_49_Picture_4.jpeg)

### 2023 DynaPlasia

![](_page_50_Picture_1.jpeg)

![](_page_50_Picture_2.jpeg)

## **Neuromorphic/Spiking NN**

#### □ Microscopic Brain Structure or Macroscopic Brain Function

![](_page_51_Figure_2.jpeg)

![](_page_51_Picture_3.jpeg)

#### **2023 C-DNN: Complementary-DNN Processor**

Energy Efficient CNN/SNN Hybrid Processor

KAIST MPIM

![](_page_52_Figure_2.jpeg)

53/60

## **KAIST C-DNN: Neuromorphic Accelerator**

- □ Input magnitude incurs small performance variation in CNN
  - Small magnitude input data ↑ → SNN domain efficient
  - Small magnitude input data ↓ → CNN domain efficient

![](_page_53_Figure_4.jpeg)

![](_page_53_Figure_5.jpeg)

![](_page_53_Figure_6.jpeg)

### **2023 C-DNN: Demonstration Video**

![](_page_54_Picture_1.jpeg)

![](_page_54_Picture_2.jpeg)

#### **2024 C-Transformer : DNN/Spiking Transformer processor**

Motivation 

![](_page_55_Figure_2.jpeg)

![](_page_55_Figure_3.jpeg)

Spiking- DNN-Transformer Transformer

Cross Attention

Feed-Forward

#### High Reconfigurability is Required

![](_page_55_Figure_5.jpeg)

KAIST MPIM

## **2024 C-Transformer Architecture**

- Homogeneous DT/ST Core
  - Dynamically changed ratio of spike and non-spike domain
  - → Hybrid multiplication/accumulation unit (HMAU) is proposed!

![](_page_56_Figure_4.jpeg)

### **2024 C-Transformer Architecture**

Results of 3-Stage Compression

□ Extended Sign Compression: 74~81% parameters are reduced

![](_page_57_Figure_3.jpeg)

#### **2024 C-Transformer: Demonstration Video**

![](_page_58_Picture_1.jpeg)

![](_page_58_Picture_2.jpeg)

![](_page_59_Picture_0.jpeg)

![](_page_59_Picture_1.jpeg)

# Thank you!