Slide 1

Slide 1 text

高精度、高効率アナログ Compute-in-Memory回路に向けて Kentaro Yoshioka 集積回路研究会(ICD)2024 Assistant Professor, Keio University, Japan

Slide 2

Slide 2 text

今日の流れ • 自己紹介 • DNNアクセラレータ − なぜ、いつアナログコンピューティング? − Computing-in-Memory(CIM)回路 • 吉岡研のアナログCIM研究紹介 − An 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers(ISSCC’24) − OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration(ASP-DAC’24) 集積回路研究会(ICD)2024

Slide 3

Slide 3 text

自己紹介 • 2014 慶應大 • 2014-2021 株式会社東芝 • 2017-2018 Stanford Visiting Scholar • 2021- 慶應大専任講師着任 吉岡研究室PI • 集積回路(LSI) − 高前田JST CREST(主共同) 2021- アナログCIM回路 • 3Dセンシング(LiDAR) − JSTさきがけ(代表) 2022- 自動運転LiDARセキュリティ − 森JST CREST(主共同) 2023- 自動運転システムセキュリティ − 科研費基盤B(代表) 2024- LiDAR設計 集積回路研究会(ICD)2024 研究室マスコット:CSG君

Slide 4

Slide 4 text

・2014 Graduate@Keio Univ. ・2014-2021 Toshiba Research ・2017-2018 Stanford Visiting Scholar @Mark Horwitz Group ・2021-Keio Univ. Assistant Prof. ・Expertise: Mixed-signal circuit design, LiDAR design, ML accelerators, LiDAR Security 自己紹介 WiFi/ADCs VLSI 2020 ISSCC2018 JSSC 2018 ISSCC 2020 JSSC 2020 LiDAR SoCs ISSCC2017 ISSCC2018 JSSC2018 TVLSI2019 CIM/AIアクセラレータ ISSCC2024 ASP-DAC2024 Slide 3

Slide 5

Slide 5 text

・2014 Graduate@Keio Univ. ・2014-2021 Toshiba Research ・2017-2018 Stanford Visiting Scholar @Mark Horwitz Group ・2021-Keio Univ. Assistant Prof. ・Expertise: Mixed-signal circuit design, LiDAR design, ML accelerators, LiDAR Security 自己紹介 VLSI 2020 ISSCC2018 JSSC 2018 ISSCC 2020 JSSC 2020 LiDAR SoCs Slide 4 Rambus創業者 とてもメモリ通信に 詳しい。 コンピュータ・ アーキテクチャも強い。

Slide 6

Slide 6 text

DNNアクセラレータ ◼ ディープラーニングでは膨大な計算量(と通信) ◆ 物体検出、顔認証、巨大言語モデル… ◼ アクセラレータ開発が活発 ◼ ボトルネックは? 5

Slide 7

Slide 7 text

DNNアクセラレータ • CPUはアクセラレータの対照 – 汎用性を重視した構成 CPU core 集積回路研究会(ICD)2024

Slide 8

Slide 8 text

DNNアクセラレータ • DNNアクセラレータ:並列計算に特化したアーキテクチャ – 並列計算(画像処理等)に特化するため、小型ALUを多数配置 – 汎用的な機能は捨てる(Windowsは走らない) – TPUに至ってはキャッシュ機能すらない CPU core TPU core 集積回路研究会(ICD)2024

Slide 9

Slide 9 text

集積回路研究会(ICD)2024 DNNアクセラレータの重要な研究課題 • ①チップ内外のデータ移動最小化 – Dataflow問題やスケジューリング問題としてコンピュータアーキテクチャ分野で活発 ⇨チップ内メモリ、演算器に入り切らないほどのデータをどのように処理するか? データ分割、チャネル分割、レイヤ分割・・etc.

Slide 10

Slide 10 text

集積回路研究会(ICD)2024 DNNアクセラレータの重要な研究課題 • ②演算回路の低電力化 – デジタル演算回路は既に最適化済み – さらなる低電力化は難しい – Extreme option: Analog computing!? – Required DNN arithmetic precision is low (INT2-INT8) – 低精度演算ではアナログコンピューティングにより低電力化が可能 Full(FP32) INT2 INT3 INT4 INT5 Resnet50 ImageNet top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085

Slide 11

Slide 11 text

Compute-in-Memory(CIM)回路 ◼ Computing In-Memory(CIM)は両方のボトルネック解消(と期待) ◆ メモリと演算器を一体化 ⚫ 通信量を大幅に削減する期待 ⚫ 超小型SIMDと極小レジスタが合体 したような感じ? 10

Slide 12

Slide 12 text

CIM Topology 11 Digital CIM (DCIM) Analog CIM (ACIM) Digital Logic ➔ Accuracy ☺ Bulky Circuit ➔ Efficiency  PVT Variation ➔ Accuracy  Compact; high throughput ➔ Efficiency ☺ アナログCIM: アナログ領域で積分 デジタルCIM: デジタルAdder treeで積分 Adder treeはアレー数の2乗 で面積・電力増加。 列数の増加には制限。 数1000といった列数も実現可能。 一方で巨大な列数を活かしきれる大きなDNNが必要

Slide 13

Slide 13 text

デジタル回路 vs アナログ回路 • デジタル回路による演算 −ゲート(論理回路)に1/0信号を伝搬し計算処理 −Pros: ノイズの影響なし;高精度な計算可能(e.g. FP128) −Cons: 低電力化ポテンシャルはない • アナログ回路による演算 −0.1, 0.3..といった連続値を扱い、積分といった動作に向く −Pros: 連続値活用による高効率化 −Cons:ノイズに弱く、高精度計算には不向き 集積回路研究会(ICD)2024

Slide 14

Slide 14 text

アナログ/デジタル、どっちがいいの? • When is analog computation efficient? − At high precision (>ENOB 10b), energy exponentially increase due to thermal noise constrains − Digital is efficient for binary precision; not much advantage 集積回路研究会(ICD)2024 [Ref] B.Murmann, “Mixed-Signal Co mputing for Deep Neural Network In ference” TVLSI 2021. Binary ~10b?

Slide 15

Slide 15 text

アナログ/デジタル、どっちがいいの? • When is analog computation efficient? − Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region 集積回路研究会(ICD)2024 Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.

Slide 16

Slide 16 text

Charge-based CIM • “Multiply” is done by digital, and “accumulation” of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” 集積回路研究会(ICD)2024 [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] Accumulate via charge 電荷領域(Q=ΣCV)で演算 ・2000要素のベクトル加算を1サイクルで実施 ・必要回路要素がデジタル回路に比べ少なく、 低電力化を実現 ADC

Slide 17

Slide 17 text

CIMのMulti-bit extension • How can we extend to multi-bit MACs? − Binary computation can extend to arbitrary precision by “bit- serial” processing 集積回路研究会(ICD)2024 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.

Slide 18

Slide 18 text

CIMのMulti-bit extension • How can we extend this to multi-bit MACs? − Binary computation can extend to arbitrary precision by “bit- serial” processing 集積回路研究会(ICD)2024 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.

Slide 19

Slide 19 text

吉岡研のCIM研究 集積回路研究会(ICD)2024 Gen.1 CIM技術の確立 問い:高精度アナログCIMは 実現できるか? ISSCC‘24 ASP-DAC’24 Gen.2 Saliency-based CIM 問い:精度可変型 CIMは実現できる? Under review. Gen.3 CIMプロセッサへ 問い:システム全体の 通信量も加味した CIMプロセッサの実現 Under review.

Slide 20

Slide 20 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 19 of 30 An 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers Kentaro Yoshioka Keio University Compute and Sensing Group (CSG) https://sites.google.com/keio.jp/keio-csg/

Slide 21

Slide 21 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 20 of 30 Background ◼ Modern edge ML workloads are diverse ⚫ CNN, Transformer, Hybrid that blends both → Can analog CIM handle all workloads? Wikipedia: Convolutional neural network A. Dosovitskiy, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021

Slide 22

Slide 22 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 21 of 30 CNN/Transformer acceleration challenges ◼ Unique computational requirements ⚫ Transformer require high compute precision ⚫ CNNs are fine with low precision Denote precision as compute SNR (CSNR) 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 )

Slide 23

Slide 23 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 22 of 30 CNN/Transformer acceleration challenges ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy NOT a primary concern in ACIMs 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) [Jia, JSSC2020] [Lee, VLSI2021] Our CSNR target

Slide 24

Slide 24 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 23 of 30 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) [Jia, JSSC2020] [Lee, VLSI2021] ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy not a primary concern in ACIMs ◼ DCIMs: Excel in CSNR, but require bulky adders Our CSNR target CNN/Transformer acceleration challenges

Slide 25

Slide 25 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 24 of 30 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) [Jia, JSSC2020] [Lee, VLSI2021] CNN/Transformer acceleration challenges ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy not a primary concern in ACIMs ◼ DCIMs: Excel in CSNR, but require bulky adders Can we design an ACIM supporting both CNN/Transformer? Our CSNR target

Slide 26

Slide 26 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 25 of 30 Review of ACIM Architectures ◼ Current-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Transistor current inherently non-linear ◼ Time-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Delay prone to mismatch W[0] *I[0] W[1] *I[1] W[0] *I[0] W[1] *I[1]

Slide 27

Slide 27 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 26 of 30 Review of ACIM Architectures ◼ Current-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Transistor current inherently non-linear ◼ Time-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Delay prone to mismatch ◼ Charge-domain ACIMs ⚫ ☺ High-linearity of MOM caps ⚫  High ADC resolution → Worsened area efficiency ⚫  Noise resilience → Worsened energy efficiency W[0] *I[0] W[1] *I[1]

Slide 28

Slide 28 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 27 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: • Low-CSNR CNN mode: Our CIM Macro Concept SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell

Slide 29

Slide 29 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 28 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: – High-precision bit-serial CIM – 10-b ADC resolution and noise • Low-CSNR CNN mode: – Low-precision bit-parallel CIM – Achieve high-efficiency Our CIM Macro Concept SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell

Slide 30

Slide 30 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 29 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: – High-precision bit-serial CIM – 10-b ADC resolution and noise • Low-CSNR CNN mode: – Low-precision bit-parallel CIM – Achieve high-efficiency SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell Our CIM Macro Concept A. Area efficient capacitor-reconfigured (CR) CIM B. Resource efficient multi-bit driver C. Transformer レイヤに応じてノ イズ改善

Slide 31

Slide 31 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 30 of 30 Outline ◼ Background ◼ CNN/Transformer unified acceleration challenges ◼ Our CIM Macro Concept ⚫ Capacitor-Reconfigured CIM (CR-CIM) ⚫ Resource-efficient Multi-bit Driver ◼ Measurement results ◼ Conclusion

Slide 32

Slide 32 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 31 of 30 CR-CIM concept Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic

Slide 33

Slide 33 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 32 of 30 Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic CR-CIM concept Area consuming 

Slide 34

Slide 34 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 33 of 30 Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic CR-CIM concept Proposed CR-CIM PE[1] PE[2] PE[N] SAR Feedback 10b SAR Logic Successive Approx. Quantizer Comp. & ADC Cap. Array CR-CIM Array PE[0] Reconfigure cap array for dual use: Computation/ADC → Area efficient 10-b ADC!

Slide 35

Slide 35 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 34 of 30 CR-CIM Array Implementation Computation 6T SRAM W I 6T SRAM SAR Logic Vcomp SAR Quantizer CIM Cap. Array

Slide 36

Slide 36 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 35 of 30 CR-CIM Array Implementation Computation ADC 6T SRAM W I 6T SRAM SAR Logic Vcomp SAR Quantizer CIM Cap. Array SAR Quantizer 6T SRAM SAR Logic CIM Cap. Array 6T SRAM Vcomp fComp RST RST DDAC /Reset[9:0]

Slide 37

Slide 37 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 36 of 30 CR-CIM Transistor Implementation 10T CR-CIM cell IN 6T SRAM Vcomp C W Computation ADC&Reset 10T CR-CIM cell IN DAC/Reset 6T SRAM Vcomp C VSS VDD VSS VDD MP1 MP2 MP3 MN1

Slide 38

Slide 38 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 37 of 30 CR-CIM Transistor Implementation 10T CR-CIM cell IN 6T SRAM Vcomp C W Computation ADC&Reset 10T CR-CIM cell IN DAC/Reset 6T SRAM Vcomp C VSS VDD VSS VDD MP1 MP2 MP3 MN1 ☺Support both bit-parallel/bit-serial operation ☺Small 10T cell

Slide 39

Slide 39 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 38 of 30 - + - + - + - + - + Resource Efficient Multi-bit Driver • Bit-parallel operation greatly improves the efficiency under CNN mode – [Lee, VLSI2021] uses 16 off-chip generated ref. voltage →Large voltage generation overhead - + Vref x16 CIM row array W0 W1 Multi-bit Driver

Slide 40

Slide 40 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 39 of 30 Resource Efficient Multi-bit Driver • Capacitive dividing to generate ref. voltage – ☺ Only 1 ref. voltage required –  Voltage errors due to varied capacitive load IN[4:0] 0 15 30 0 10 20 30 40 50 60 Sum of row weights wo/compensation Vref Vref CMSB CMSB-1 5-bit C-DAC CIM row array W0 W1 DAC LSB

Slide 41

Slide 41 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 40 of 30 Resource Efficient Multi-bit Driver • Capacitive dividing to generate ref. voltage – ☺ Employ 4-b load-compensation C-DAC to cancel variation – Error decreases from 25 LSB → 1.2 LSB IN[4:0] 0 15 30 0 10 20 30 40 50 60 Sum of row weights w/compensation wo/compensation Error: +1.2/-0.5 LSB Vref Vref CMSB CMSB-1 5-bit C-DAC CIM row array W0 W1 Precomp. Weight sum CcMSB CcMSB-1 DAC LSB

Slide 42

Slide 42 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 41 of 30 CSNR boost technique Vector-mixing a.k.a. Attention Vector-wise MLP Image patch Transformers repeat this block multiple times CSNR req. High CSNR req. Low →Use CSNR boost (Majority voting) →Normal operation Palace 98% Station 2% Feats. Vector Feats. Vector Vision Transformerアーキテクチャ

Slide 43

Slide 43 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 42 of 30 CSNR boost technique Vector-mixing a.k.a. Attention Vector-wise MLP Image patch Transformers repeat this block multiple times CSNR req. High CSNR req. Low →Use CSNR boost (Majority voting) →Normal operation Palace 98% Station 2% Feats. Vector Feats. Vector 1 2 3 4 5 6 7 8 9 10 11 Num. Comp. CSNR [dB] Power Normal S S S S S S S S S S - 10 25.8 1 w/ CSNR Boost S S S S S S S S MV MV MV 26 31.3 1.9 S: Single comparison MV: Majority voting Quantizer with CSNR boost (CB) Quantizer Operation Principal 10b SA Logic 6x Majority Voting EN DCMP - + SNRが必要なレイヤにて、平均化でADCノイズ低減

Slide 44

Slide 44 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 43 of 30 Outline ◼ Background ◼ CNN/Transformer unified acceleration challenges ◼ Our CIM Macro Concept ⚫ Capacitor-Reconfigured CIM ⚫ Resource-efficient Multi-bit Driver ◼ Measurement results ◼ Conclusion

Slide 45

Slide 45 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 44 of 30 Prototype chip in 65nm CMOS 1088x78 CR-CIM CTRL Driver Successive Approximation 1270um 320um 60um

Slide 46

Slide 46 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 45 of 30 CIM column characteristics 0 0.5 1 1.5 2 0 200 400 600 800 1000 Noise (LSBrms ) ADC code Readout Noise(w/CB) -2 -1 0 1 2 0 200 400 600 800 1000 LSB ADC Code Column INL All measurements@VDD =0.6V & 10-bit ADC

Slide 47

Slide 47 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 46 of 30 CIM column characteristics 10 30 50 SQNR[dB] Jia Lee Ours(CNN) Ours(Trans) +22dB [3] [5] 10 20 30 40 4 5 6 7 8 CSNR[dB] Input&Weight bit precision Jia Lee Ours(CNN) Ours(Trans.) [3]* [5]* ** ** +13dB All measurements@VDD =0.6V & 10-bit ADC Our CSNR target for Transformers

Slide 48

Slide 48 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 47 of 30 Comparison This work [3] JSSC 2020 [4] ISSCC 2023 [5] VLSI 2021 CIM type Charge Charge Charge Charge Process 65nm 65nm 12nm 28nm Bit Precision 4-8b 1-8b 1-8b 1-5b Application CNN CNN CNN CNN Peak TOPS Normalized to 1-b 6 2.1 6.4 6.1 Peak TOPS/W Norm. to 65nm** 4094 400 837 2496 ADC bit 8 8 8 8 SQNR [dB] 26.7 22 N.A. 17.5 CSNR [dB] 16.8 17 N.A. 10.5 CIFAR-10 Accuracy 91.7 92.4 N.A. 91.1

Slide 49

Slide 49 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 48 of 30 Comparison This work [3] JSSC 2020 [4] ISSCC 2023 [5] VLSI 2021 CIM type Charge Charge Charge Charge Process 65nm 65nm 12nm 28nm Bit Precision 4-8b 1-8b 1-8b 1-5b Application CNN Trans- former CNN CNN CNN Peak TOPS Normalized to 1-b 6 1.2 2.1 6.4 6.1 Peak TOPS/W Norm. to 65nm** 4094 818 400 837 2496 ADC bit 8 10 8 8 8 SQNR [dB] 26.7 45.3 22 N.A. 17.5 CSNR [dB] 16.8 31.3 17 N.A. 10.5 CIFAR-10 Accuracy 91.7 95.8 92.4 N.A. 91.1 First ACIM to achieve unified operation of CNN and Transformers

Slide 50

Slide 50 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 49 of 30 Conclusion • First realization of an ACIM capable of efficient Transformer & CNN inference –Transformer mode: Accurate bit-serial operation for enhanced CSNR – CNN mode: Efficient bit-parallel operation • Key circuit innovations: – CR-CIM to achieve area-efficient 10-b ADC –Resource-efficient multi-bit driver

Slide 51

Slide 51 text

Research question.. • さらにアナログCIMを高精度化できる? • そもそも常に高い演算精度は必要? • →動的に重要なデータのみ高精度演算を行う ハイブリッドCIMを提案

Slide 52

Slide 52 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 51 of 30 51 OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration Yung-Chin Chen1,2, Shimpei Ando1, Daichi Fujiki1, Shinya-Takamaeda Yamazaki3, Kentaro Yoshioka1 Keio Computing and Sensing Group 1 2 3

Slide 53

Slide 53 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 52 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 52

Slide 54

Slide 54 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 53 of 30 Saliency-Aware Computation ⚫Salient: critical for outputs; need high accuracy ⚫Non-Salient: unimportant; can tolerate errors 53 Salient Pixels Non-Salient Pixels (Compute precisely) (Compute efficiently)

Slide 55

Slide 55 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 54 of 30 Conventional CIM Topology ⚫ Both DCIM and ACIM have low flexibility due to hard-wired circuit topology  54 Digital CIM (DCIM) Analog CIM (ACIM) Digital Logic ➔ Accuracy ☺ Bulky Circuit ➔ Efficiency  PVT Variation ➔ Accuracy  Compact; high throughput ➔ Efficiency ☺

Slide 56

Slide 56 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 55 of 30 CIM Challenge: Low Flexibility 55 All pixels are computed equally

Slide 57

Slide 57 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 56 of 30 Goal: Saliency-Aware CIM ⚫Step1: Identify input saliency 56

Slide 58

Slide 58 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 57 of 30 Goal: Saliency-Aware CIM ⚫Step 2: allocate computation resources accordingly 57

Slide 59

Slide 59 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 58 of 30 OSA-HCIM ⚫ On-the-fly Saliency-Aware Hybrid SRAM CIM ⚫ A HYBRID, DYNAMIC CIM solution ⚫ Hybrid: accurate & efficient ⚫ Dynamic: flexible for different saliencies & different tasks 58

Slide 60

Slide 60 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 59 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 59

Slide 61

Slide 61 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 60 of 30 Decompose to 1bx1b MACs ⚫Break a k-bit x k-bit MAC down into k2 1bx1b MACs 60 MSB LSB

Slide 62

Slide 62 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 61 of 30 Partition MACs based on Their Orders ⚫DCIM: Computed in Digital Mode ⚫ACIM: Computed in Analog Mode 61 Requirements 1. Flexible and Input-dependent 2. Can be Configured On-the-fly

Slide 63

Slide 63 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 62 of 30 How to Determine the Boundaries? ⚫Heuristic: The input value is positively correlated to its saliency 62 High flexibility for accuracy-efficiency tradeoff

Slide 64

Slide 64 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 63 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 63

Slide 65

Slide 65 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 64 of 30 Architecture Overview ⚫ OSA-HCIM Macro ⚫8 Hybrid MAC Units ⚫1 On-the-fly Saliency Evaluator ⚫ HMU ⚫144 Hybrid CIM Arrays ⚫1 Digital Adder Tree ⚫1 Norm. & Quant. Unit ⚫1 3-bit SAR-ADC 64

Slide 66

Slide 66 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 65 of 30 Hybrid CIM Array (HCIMA) ⚫Split-port 6T SRAM ⚫LBLB reads Wb for digital CIM ⚫LBL reads W for analog CIM ⚫DCIM: Bit-serial ⚫ACIM: Bit-parallel 65

Slide 67

Slide 67 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 66 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 66

Slide 68

Slide 68 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 67 of 30 D/A Boundary ⚫ Determine D/A boundary for performance tradeoff ⚫ Goal: Higher Saliency ➔More Digital Compute 67

Slide 69

Slide 69 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 68 of 30 On-the-fly Saliency Evaluator (OSE) ⚫ Architecture of OSE 68 Use saliency Thresholds [T0,…,Tn] to determine th e D/A boundary: if S

Slide 70

Slide 70 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 69 of 30 OSE Example 1 69 A pixel in Output Activation (OA)

Slide 71

Slide 71 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 70 of 30 OSE Example 1 70 Lowest boundary value: Few digital bits Not a salient pixel ➔ don’t need to comput e precisely

Slide 72

Slide 72 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 71 of 30 OSE Example 2 71 More digital bits A bit more salient ➔ use larger boundary v alue

Slide 73

Slide 73 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 72 of 30 OSE Example 3 72 Conduct precise computation High-Saliency Pixels!

Slide 74

Slide 74 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 73 of 30 Accuracy-Efficiency Tradeoff Flexibility ⚫Adjust Thresholds for the desired operation point ⚫Desire Higher Accuracy ➔ Set smaller thresholds ⚫Desire Higher Efficiency ➔ Set higher thresholds 73 High adaptability to a wide ra nge of tasks

Slide 75

Slide 75 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 74 of 30 OSA-HCIM under different tasks When facing harder tasks (e.g., ImageNet), OSA-HCIM can tradeoff efficiency to ensure accuracy 74 Task SNR Requirement Accuracy Accuracy Drop Energy Eff. (TOPS/W) ResNet18@ CIFAR100 Low 67.4%~72.1% 4.8%~0.1% 5.33~5.79 ResNet18@ ImageNet High 65.2%~70.8% 6.3%~0.8% 3.83~4.66

Slide 76

Slide 76 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 75 of 30 OSA-HCIM Summary ⚫OSE ⚫only 1% power and 1% area ⚫3-bit ADC ⚫Only 17% power and 6% area 75

Slide 77

Slide 77 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 76 of 30 Distribution of BD/A Value ⚫ Within Layers ⚫ OSE effectively identifies salient pixels 76 ⚫ Across Layers ⚫ OSE adapts to layers’ pre cision requirements

Slide 78

Slide 78 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 77 of 30 Accuracy-Efficiency Tradeoff ⚫Hybrid CIM (fixed boundary) ⚫1.56x efficiency gain ⚫OSA-HCIM (dynamic boundary) ⚫Another 1.25x efficiency gain + tradeoff flexibility 77

Slide 79

Slide 79 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 78 of 30 Comparison w/ SOTA 78

Slide 80

Slide 80 text

34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 79 of 30 Conclusion ⚫ OSA-HCIM features ⚫SW: Saliency-Aware Precision Configuration Scheme ⚫HW: Hybrid CIM Array with concurrent digital and analog operations ⚫SW/HW co-design: high versatility for accuracy-efficiency tradeoff ⚫ OSA-HCIM reaches 5.33-5.79TOPS/W with robust accuracy (ResNet18@CIFAR100, 65nm) ⚫ OSA-HCIM is the first Saliency-Aware CIM ⚫ OSA-HCIM is the first Dynamic Hybrid CIM 79

Slide 81

Slide 81 text

Compute-in-Memory(CIM)回路 ◼ Computing In-Memory(CIM)は通信ボトルネック解消可能? ◆ 半分本当半分嘘 ◆ MAC = ΣIn[n]xW[n] ◆ Weightはメモリ内 ◆ Inデータはチップを動き回る ◆ →減らせる通信量はWeightのみ 80

Slide 82

Slide 82 text

Compute-in-Memory(CIM)回路 ◼ Computing In-Memory(CIM)は通信ボトルネック解消可能? ◆ 半分本当半分嘘 ◆ MAC = ΣIn[n]xW[n] ◆ Weightはメモリ内 ◆ Inデータはチップを動き回る ◆ →減らせる通信量はWeightのみ ◆In通信も減らせるような アーキテクチャが必要 81

Slide 83

Slide 83 text

まとめ • 高効率なDNNアクセラレータとしてCIM回路が注目 • 高精度なCIMアーキテクチャを探求し、Transformer動作も確認し た • Saliencyを用いたHybrid CIMを提案し、動的に演算精度を可変す る技術を確立した • 今後はメモリ通信全体を低減できるようなCIMプロセッサを探求 集積回路研究会(ICD)2024