Upgrade to Pro — share decks privately, control downloads, hide ads and more …

高精度、高効率アナログCompute-in-Memory回路に向けて

 高精度、高効率アナログCompute-in-Memory回路に向けて

自己紹介
DNNアクセラレータ
なぜ、いつアナログコンピューティング?
Computing-in-Memory(CIM)回路
吉岡研のアナログCIM研究紹介
An 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers(ISSCC’24)
OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration(ASP-DAC’24)

Yoshioka Lab (Keio CSG)

April 11, 2024
Tweet

More Decks by Yoshioka Lab (Keio CSG)

Other Decks in Research

Transcript

  1. 今日の流れ • 自己紹介 • DNNアクセラレータ − なぜ、いつアナログコンピューティング? − Computing-in-Memory(CIM)回路 •

    吉岡研のアナログCIM研究紹介 − An 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers(ISSCC’24) − OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration(ASP-DAC’24) 集積回路研究会(ICD)2024
  2. 自己紹介 • 2014 慶應大 • 2014-2021 株式会社東芝 • 2017-2018 Stanford

    Visiting Scholar • 2021- 慶應大専任講師着任 吉岡研究室PI • 集積回路(LSI) − 高前田JST CREST(主共同) 2021- アナログCIM回路 • 3Dセンシング(LiDAR) − JSTさきがけ(代表) 2022- 自動運転LiDARセキュリティ − 森JST CREST(主共同) 2023- 自動運転システムセキュリティ − 科研費基盤B(代表) 2024- LiDAR設計 集積回路研究会(ICD)2024 研究室マスコット:CSG君
  3. ・2014 Graduate@Keio Univ. ・2014-2021 Toshiba Research ・2017-2018 Stanford Visiting Scholar

    @Mark Horwitz Group ・2021-Keio Univ. Assistant Prof. ・Expertise: Mixed-signal circuit design, LiDAR design, ML accelerators, LiDAR Security 自己紹介 WiFi/ADCs VLSI 2020 ISSCC2018 JSSC 2018 ISSCC 2020 JSSC 2020 LiDAR SoCs ISSCC2017 ISSCC2018 JSSC2018 TVLSI2019 CIM/AIアクセラレータ ISSCC2024 ASP-DAC2024 Slide 3
  4. ・2014 Graduate@Keio Univ. ・2014-2021 Toshiba Research ・2017-2018 Stanford Visiting Scholar

    @Mark Horwitz Group ・2021-Keio Univ. Assistant Prof. ・Expertise: Mixed-signal circuit design, LiDAR design, ML accelerators, LiDAR Security 自己紹介 VLSI 2020 ISSCC2018 JSSC 2018 ISSCC 2020 JSSC 2020 LiDAR SoCs Slide 4 Rambus創業者 とてもメモリ通信に 詳しい。 コンピュータ・ アーキテクチャも強い。
  5. 集積回路研究会(ICD)2024 DNNアクセラレータの重要な研究課題 • ②演算回路の低電力化 – デジタル演算回路は既に最適化済み – さらなる低電力化は難しい – Extreme

    option: Analog computing!? – Required DNN arithmetic precision is low (INT2-INT8) – 低精度演算ではアナログコンピューティングにより低電力化が可能 Full(FP32) INT2 INT3 INT4 INT5 Resnet50 ImageNet top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085
  6. CIM Topology 11 Digital CIM (DCIM) Analog CIM (ACIM) Digital

    Logic ➔ Accuracy ☺ Bulky Circuit ➔ Efficiency  PVT Variation ➔ Accuracy  Compact; high throughput ➔ Efficiency ☺ アナログCIM: アナログ領域で積分 デジタルCIM: デジタルAdder treeで積分 Adder treeはアレー数の2乗 で面積・電力増加。 列数の増加には制限。 数1000といった列数も実現可能。 一方で巨大な列数を活かしきれる大きなDNNが必要
  7. デジタル回路 vs アナログ回路 • デジタル回路による演算 −ゲート(論理回路)に1/0信号を伝搬し計算処理 −Pros: ノイズの影響なし;高精度な計算可能(e.g. FP128) −Cons:

    低電力化ポテンシャルはない • アナログ回路による演算 −0.1, 0.3..といった連続値を扱い、積分といった動作に向く −Pros: 連続値活用による高効率化 −Cons:ノイズに弱く、高精度計算には不向き 集積回路研究会(ICD)2024
  8. アナログ/デジタル、どっちがいいの? • When is analog computation efficient? − At high

    precision (>ENOB 10b), energy exponentially increase due to thermal noise constrains − Digital is efficient for binary precision; not much advantage 集積回路研究会(ICD)2024 [Ref] B.Murmann, “Mixed-Signal Co mputing for Deep Neural Network In ference” TVLSI 2021. Binary ~10b?
  9. アナログ/デジタル、どっちがいいの? • When is analog computation efficient? − Sweet spot

    is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region 集積回路研究会(ICD)2024 Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.
  10. Charge-based CIM • “Multiply” is done by digital, and “accumulation”

    of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” 集積回路研究会(ICD)2024 [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] Accumulate via charge 電荷領域(Q=ΣCV)で演算 ・2000要素のベクトル加算を1サイクルで実施 ・必要回路要素がデジタル回路に比べ少なく、 低電力化を実現 ADC
  11. CIMのMulti-bit extension • How can we extend to multi-bit MACs?

    − Binary computation can extend to arbitrary precision by “bit- serial” processing 集積回路研究会(ICD)2024 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.
  12. CIMのMulti-bit extension • How can we extend this to multi-bit

    MACs? − Binary computation can extend to arbitrary precision by “bit- serial” processing 集積回路研究会(ICD)2024 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.
  13. 吉岡研のCIM研究 集積回路研究会(ICD)2024 Gen.1 CIM技術の確立 問い:高精度アナログCIMは 実現できるか? ISSCC‘24 ASP-DAC’24 Gen.2 Saliency-based

    CIM 問い:精度可変型 CIMは実現できる? Under review. Gen.3 CIMプロセッサへ 問い:システム全体の 通信量も加味した CIMプロセッサの実現 Under review.
  14. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 19 of 30 An 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration of CNNs and Transformers Kentaro Yoshioka Keio University Compute and Sensing Group (CSG) https://sites.google.com/keio.jp/keio-csg/
  15. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 20 of 30 Background ◼ Modern edge ML workloads are diverse ⚫ CNN, Transformer, Hybrid that blends both → Can analog CIM handle all workloads? Wikipedia: Convolutional neural network A. Dosovitskiy, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR 2021
  16. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 21 of 30 CNN/Transformer acceleration challenges ◼ Unique computational requirements ⚫ Transformer require high compute precision ⚫ CNNs are fine with low precision Denote precision as compute SNR (CSNR) 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 )
  17. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 22 of 30 CNN/Transformer acceleration challenges ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy NOT a primary concern in ACIMs 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) [Jia, JSSC2020] [Lee, VLSI2021] Our CSNR target
  18. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 23 of 30 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) [Jia, JSSC2020] [Lee, VLSI2021] ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy not a primary concern in ACIMs ◼ DCIMs: Excel in CSNR, but require bulky adders Our CSNR target CNN/Transformer acceleration challenges
  19. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 24 of 30 70 80 90 100 10 15 20 25 30 CIFAR-10 Acc. Compute SNR (CSNR) [dB] Transformer(ViT-s) CNN(Resnet20) 𝑪𝑺𝑵𝑹 = 𝟐𝟎 𝐥𝐨𝐠𝟏𝟎 ( 𝑺𝒊𝒈𝒏𝒂𝒍 𝑵𝒐𝒊𝒔𝒆 ) [Jia, JSSC2020] [Lee, VLSI2021] CNN/Transformer acceleration challenges ◼ ACIMs: Challenging to achieve high CSNR ⚫ Conventional works target CNNs with relaxed CSNR ⚫ Computational accuracy not a primary concern in ACIMs ◼ DCIMs: Excel in CSNR, but require bulky adders Can we design an ACIM supporting both CNN/Transformer? Our CSNR target
  20. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 25 of 30 Review of ACIM Architectures ◼ Current-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Transistor current inherently non-linear ◼ Time-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Delay prone to mismatch W[0] *I[0] W[1] *I[1] W[0] *I[0] W[1] *I[1]
  21. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 26 of 30 Review of ACIM Architectures ◼ Current-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Transistor current inherently non-linear ◼ Time-domain ACIMs ⚫ ☺ Area, energy efficiency ⚫  Delay prone to mismatch ◼ Charge-domain ACIMs ⚫ ☺ High-linearity of MOM caps ⚫  High ADC resolution → Worsened area efficiency ⚫  Noise resilience → Worsened energy efficiency W[0] *I[0] W[1] *I[1]
  22. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 27 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: • Low-CSNR CNN mode: Our CIM Macro Concept SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell
  23. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 28 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: – High-precision bit-serial CIM – 10-b ADC resolution and noise • Low-CSNR CNN mode: – Low-precision bit-parallel CIM – Achieve high-efficiency Our CIM Macro Concept SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell
  24. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 29 of 30 • Basic idea: – Dynamically configure macro reflecting the processing DNN • High-CSNR Transformer mode: – High-precision bit-serial CIM – 10-b ADC resolution and noise • Low-CSNR CNN mode: – Low-precision bit-parallel CIM – Achieve high-efficiency SAR Logic DDAC /Reset[9:0] 6T SRAM DDAC [0] IN[0] fComp 5-bit Driver 6T SRAM DDAC [9] 6T SRAM x256 DDAC [8] 6T SRAM x128 DDAC [7] x1 x512 1088x78 CIM Array CR-CIM cell CR-CIM cell CR-CIM cell CR-CIM cell Our CIM Macro Concept A. Area efficient capacitor-reconfigured (CR) CIM B. Resource efficient multi-bit driver C. Transformer レイヤに応じてノ イズ改善
  25. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 30 of 30 Outline ◼ Background ◼ CNN/Transformer unified acceleration challenges ◼ Our CIM Macro Concept ⚫ Capacitor-Reconfigured CIM (CR-CIM) ⚫ Resource-efficient Multi-bit Driver ◼ Measurement results ◼ Conclusion
  26. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 31 of 30 CR-CIM concept Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic
  27. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 32 of 30 Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic CR-CIM concept Area consuming 
  28. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 33 of 30 Conventional PE[0] PE[1] PE[2] PE[N] CIM Array x1000 Comp. Cap. Array 10b ADC Cap. Array 10b SAR Logic CR-CIM concept Proposed CR-CIM PE[1] PE[2] PE[N] SAR Feedback 10b SAR Logic Successive Approx. Quantizer Comp. & ADC Cap. Array CR-CIM Array PE[0] Reconfigure cap array for dual use: Computation/ADC → Area efficient 10-b ADC!
  29. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 34 of 30 CR-CIM Array Implementation Computation 6T SRAM W I 6T SRAM SAR Logic Vcomp SAR Quantizer CIM Cap. Array
  30. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 35 of 30 CR-CIM Array Implementation Computation ADC 6T SRAM W I 6T SRAM SAR Logic Vcomp SAR Quantizer CIM Cap. Array SAR Quantizer 6T SRAM SAR Logic CIM Cap. Array 6T SRAM Vcomp fComp RST RST DDAC /Reset[9:0]
  31. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 36 of 30 CR-CIM Transistor Implementation 10T CR-CIM cell IN 6T SRAM Vcomp C W Computation ADC&Reset 10T CR-CIM cell IN DAC/Reset 6T SRAM Vcomp C VSS VDD VSS VDD MP1 MP2 MP3 MN1
  32. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 37 of 30 CR-CIM Transistor Implementation 10T CR-CIM cell IN 6T SRAM Vcomp C W Computation ADC&Reset 10T CR-CIM cell IN DAC/Reset 6T SRAM Vcomp C VSS VDD VSS VDD MP1 MP2 MP3 MN1 ☺Support both bit-parallel/bit-serial operation ☺Small 10T cell
  33. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 38 of 30 - + - + - + - + - + Resource Efficient Multi-bit Driver • Bit-parallel operation greatly improves the efficiency under CNN mode – [Lee, VLSI2021] uses 16 off-chip generated ref. voltage →Large voltage generation overhead - + Vref x16 CIM row array W0 W1 Multi-bit Driver
  34. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 39 of 30 Resource Efficient Multi-bit Driver • Capacitive dividing to generate ref. voltage – ☺ Only 1 ref. voltage required –  Voltage errors due to varied capacitive load IN[4:0] 0 15 30 0 10 20 30 40 50 60 Sum of row weights wo/compensation Vref Vref CMSB CMSB-1 5-bit C-DAC CIM row array W0 W1 DAC LSB
  35. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 40 of 30 Resource Efficient Multi-bit Driver • Capacitive dividing to generate ref. voltage – ☺ Employ 4-b load-compensation C-DAC to cancel variation – Error decreases from 25 LSB → 1.2 LSB IN[4:0] 0 15 30 0 10 20 30 40 50 60 Sum of row weights w/compensation wo/compensation Error: +1.2/-0.5 LSB Vref Vref CMSB CMSB-1 5-bit C-DAC CIM row array W0 W1 Precomp. Weight sum CcMSB CcMSB-1 DAC LSB
  36. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 41 of 30 CSNR boost technique Vector-mixing a.k.a. Attention Vector-wise MLP Image patch Transformers repeat this block multiple times CSNR req. High CSNR req. Low →Use CSNR boost (Majority voting) →Normal operation Palace 98% Station 2% Feats. Vector Feats. Vector Vision Transformerアーキテクチャ
  37. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 42 of 30 CSNR boost technique Vector-mixing a.k.a. Attention Vector-wise MLP Image patch Transformers repeat this block multiple times CSNR req. High CSNR req. Low →Use CSNR boost (Majority voting) →Normal operation Palace 98% Station 2% Feats. Vector Feats. Vector 1 2 3 4 5 6 7 8 9 10 11 Num. Comp. CSNR [dB] Power Normal S S S S S S S S S S - 10 25.8 1 w/ CSNR Boost S S S S S S S S MV MV MV 26 31.3 1.9 S: Single comparison MV: Majority voting Quantizer with CSNR boost (CB) Quantizer Operation Principal 10b SA Logic 6x Majority Voting EN DCMP - + SNRが必要なレイヤにて、平均化でADCノイズ低減
  38. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 43 of 30 Outline ◼ Background ◼ CNN/Transformer unified acceleration challenges ◼ Our CIM Macro Concept ⚫ Capacitor-Reconfigured CIM ⚫ Resource-efficient Multi-bit Driver ◼ Measurement results ◼ Conclusion
  39. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 44 of 30 Prototype chip in 65nm CMOS 1088x78 CR-CIM CTRL Driver Successive Approximation 1270um 320um 60um
  40. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 45 of 30 CIM column characteristics 0 0.5 1 1.5 2 0 200 400 600 800 1000 Noise (LSBrms ) ADC code Readout Noise(w/CB) -2 -1 0 1 2 0 200 400 600 800 1000 LSB ADC Code Column INL All measurements@VDD =0.6V & 10-bit ADC
  41. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 46 of 30 CIM column characteristics 10 30 50 SQNR[dB] Jia Lee Ours(CNN) Ours(Trans) +22dB [3] [5] 10 20 30 40 4 5 6 7 8 CSNR[dB] Input&Weight bit precision Jia Lee Ours(CNN) Ours(Trans.) [3]* [5]* ** ** +13dB All measurements@VDD =0.6V & 10-bit ADC Our CSNR target for Transformers
  42. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 47 of 30 Comparison This work [3] JSSC 2020 [4] ISSCC 2023 [5] VLSI 2021 CIM type Charge Charge Charge Charge Process 65nm 65nm 12nm 28nm Bit Precision 4-8b 1-8b 1-8b 1-5b Application CNN CNN CNN CNN Peak TOPS Normalized to 1-b 6 2.1 6.4 6.1 Peak TOPS/W Norm. to 65nm** 4094 400 837 2496 ADC bit 8 8 8 8 SQNR [dB] 26.7 22 N.A. 17.5 CSNR [dB] 16.8 17 N.A. 10.5 CIFAR-10 Accuracy 91.7 92.4 N.A. 91.1
  43. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 48 of 30 Comparison This work [3] JSSC 2020 [4] ISSCC 2023 [5] VLSI 2021 CIM type Charge Charge Charge Charge Process 65nm 65nm 12nm 28nm Bit Precision 4-8b 1-8b 1-8b 1-5b Application CNN Trans- former CNN CNN CNN Peak TOPS Normalized to 1-b 6 1.2 2.1 6.4 6.1 Peak TOPS/W Norm. to 65nm** 4094 818 400 837 2496 ADC bit 8 10 8 8 8 SQNR [dB] 26.7 45.3 22 N.A. 17.5 CSNR [dB] 16.8 31.3 17 N.A. 10.5 CIFAR-10 Accuracy 91.7 95.8 92.4 N.A. 91.1 First ACIM to achieve unified operation of CNN and Transformers
  44. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 49 of 30 Conclusion • First realization of an ACIM capable of efficient Transformer & CNN inference –Transformer mode: Accurate bit-serial operation for enhanced CSNR – CNN mode: Efficient bit-parallel operation • Key circuit innovations: – CR-CIM to achieve area-efficient 10-b ADC –Resource-efficient multi-bit driver
  45. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 51 of 30 51 OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration Yung-Chin Chen1,2, Shimpei Ando1, Daichi Fujiki1, Shinya-Takamaeda Yamazaki3, Kentaro Yoshioka1 Keio Computing and Sensing Group 1 2 3
  46. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 52 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 52
  47. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 53 of 30 Saliency-Aware Computation ⚫Salient: critical for outputs; need high accuracy ⚫Non-Salient: unimportant; can tolerate errors 53 Salient Pixels Non-Salient Pixels (Compute precisely) (Compute efficiently)
  48. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 54 of 30 Conventional CIM Topology ⚫ Both DCIM and ACIM have low flexibility due to hard-wired circuit topology  54 Digital CIM (DCIM) Analog CIM (ACIM) Digital Logic ➔ Accuracy ☺ Bulky Circuit ➔ Efficiency  PVT Variation ➔ Accuracy  Compact; high throughput ➔ Efficiency ☺
  49. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 55 of 30 CIM Challenge: Low Flexibility 55 All pixels are computed equally
  50. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 56 of 30 Goal: Saliency-Aware CIM ⚫Step1: Identify input saliency 56
  51. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 57 of 30 Goal: Saliency-Aware CIM ⚫Step 2: allocate computation resources accordingly 57
  52. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 58 of 30 OSA-HCIM ⚫ On-the-fly Saliency-Aware Hybrid SRAM CIM ⚫ A HYBRID, DYNAMIC CIM solution ⚫ Hybrid: accurate & efficient ⚫ Dynamic: flexible for different saliencies & different tasks 58
  53. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 59 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 59
  54. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 60 of 30 Decompose to 1bx1b MACs ⚫Break a k-bit x k-bit MAC down into k2 1bx1b MACs 60 MSB LSB
  55. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 61 of 30 Partition MACs based on Their Orders ⚫DCIM: Computed in Digital Mode ⚫ACIM: Computed in Analog Mode 61 Requirements 1. Flexible and Input-dependent 2. Can be Configured On-the-fly
  56. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 62 of 30 How to Determine the Boundaries? ⚫Heuristic: The input value is positively correlated to its saliency 62 High flexibility for accuracy-efficiency tradeoff
  57. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 63 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 63
  58. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 64 of 30 Architecture Overview ⚫ OSA-HCIM Macro ⚫8 Hybrid MAC Units ⚫1 On-the-fly Saliency Evaluator ⚫ HMU ⚫144 Hybrid CIM Arrays ⚫1 Digital Adder Tree ⚫1 Norm. & Quant. Unit ⚫1 3-bit SAR-ADC 64
  59. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 65 of 30 Hybrid CIM Array (HCIMA) ⚫Split-port 6T SRAM ⚫LBLB reads Wb for digital CIM ⚫LBL reads W for analog CIM ⚫DCIM: Bit-serial ⚫ACIM: Bit-parallel 65
  60. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 66 of 30 Outline ⚫Motivation ⚫OSA-HCIM Architecture ⚫ Software: On-the-fly Saliency Aware Precision Configuration Scheme ⚫ Hardware: OSA-HCIM Macro Architecture ⚫ Software/Hardware Co-design: OSA-HCIM Framework ⚫Results 66
  61. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 67 of 30 D/A Boundary ⚫ Determine D/A boundary for performance tradeoff ⚫ Goal: Higher Saliency ➔More Digital Compute 67
  62. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 68 of 30 On-the-fly Saliency Evaluator (OSE) ⚫ Architecture of OSE 68 Use saliency Thresholds [T0,…,Tn] to determine th e D/A boundary: if S<T0 : BD/A=B0 elif T0≤S<T1: BD/A=B1 elif T1≤S<T2: BD/A=B2 …
  63. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 69 of 30 OSE Example 1 69 A pixel in Output Activation (OA)
  64. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 70 of 30 OSE Example 1 70 Lowest boundary value: Few digital bits Not a salient pixel ➔ don’t need to comput e precisely
  65. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 71 of 30 OSE Example 2 71 More digital bits A bit more salient ➔ use larger boundary v alue
  66. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 72 of 30 OSE Example 3 72 Conduct precise computation High-Saliency Pixels!
  67. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 73 of 30 Accuracy-Efficiency Tradeoff Flexibility ⚫Adjust Thresholds for the desired operation point ⚫Desire Higher Accuracy ➔ Set smaller thresholds ⚫Desire Higher Efficiency ➔ Set higher thresholds 73 High adaptability to a wide ra nge of tasks
  68. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 74 of 30 OSA-HCIM under different tasks When facing harder tasks (e.g., ImageNet), OSA-HCIM can tradeoff efficiency to ensure accuracy 74 Task SNR Requirement Accuracy Accuracy Drop Energy Eff. (TOPS/W) ResNet18@ CIFAR100 Low 67.4%~72.1% 4.8%~0.1% 5.33~5.79 ResNet18@ ImageNet High 65.2%~70.8% 6.3%~0.8% 3.83~4.66
  69. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 75 of 30 OSA-HCIM Summary ⚫OSE ⚫only 1% power and 1% area ⚫3-bit ADC ⚫Only 17% power and 6% area 75
  70. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 76 of 30 Distribution of BD/A Value ⚫ Within Layers ⚫ OSE effectively identifies salient pixels 76 ⚫ Across Layers ⚫ OSE adapts to layers’ pre cision requirements
  71. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 77 of 30 Accuracy-Efficiency Tradeoff ⚫Hybrid CIM (fixed boundary) ⚫1.56x efficiency gain ⚫OSA-HCIM (dynamic boundary) ⚫Another 1.25x efficiency gain + tradeoff flexibility 77
  72. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 78 of 30 Comparison w/ SOTA 78
  73. 34.5: A 818-4094 TOPS/W Capacitor-Reconfigured CIM Macro for Unified Acceleration

    of CNNs and Transformers © 2024 IEEE International Solid-State Circuits Conference 79 of 30 Conclusion ⚫ OSA-HCIM features ⚫SW: Saliency-Aware Precision Configuration Scheme ⚫HW: Hybrid CIM Array with concurrent digital and analog operations ⚫SW/HW co-design: high versatility for accuracy-efficiency tradeoff ⚫ OSA-HCIM reaches 5.33-5.79TOPS/W with robust accuracy (ResNet18@CIFAR100, 65nm) ⚫ OSA-HCIM is the first Saliency-Aware CIM ⚫ OSA-HCIM is the first Dynamic Hybrid CIM 79
  74. Compute-in-Memory(CIM)回路 ◼ Computing In-Memory(CIM)は通信ボトルネック解消可能? ◆ 半分本当半分嘘 ◆ MAC = ΣIn[n]xW[n]

    ◆ Weightはメモリ内 ◆ Inデータはチップを動き回る ◆ →減らせる通信量はWeightのみ 80
  75. Compute-in-Memory(CIM)回路 ◼ Computing In-Memory(CIM)は通信ボトルネック解消可能? ◆ 半分本当半分嘘 ◆ MAC = ΣIn[n]xW[n]

    ◆ Weightはメモリ内 ◆ Inデータはチップを動き回る ◆ →減らせる通信量はWeightのみ ◆In通信も減らせるような アーキテクチャが必要 81