アナログが世界を救う? アナログコンピューティングの応用と課題

アナログが世界を救う? アナログコンピューティングの応用と課題 Kentaro Yoshioka 最適輸送研究会OT2023 Assistant Professor, Keio University, Japan

今日の流れ • 自己紹介 • 汎用からアクセラレータの時代へ • DNNアクセラレータと最適輸送の接点 • なぜ、そしていつアナログコンピューティング？ •
DNN向けアナログコンピューティング研究の紹介 −電荷領域を用いるアナログコンピューティング −時間領域を用いるアナログコンピューティング最適輸送研究会OT2023

自己紹介 • 2014 慶應大卒 • 2014-2021 株式会社東芝 • 2017-2018 Stanford
Visiting Scholar • 2021- 慶應大専任講師着任吉岡研究室PI • 集積回路（LSI） − 高前田CREST(分担) 2021- アナログCIM回路 − ムーンショット6(連携) 2021- 量子コンピュータ用アナログ回路 • 3Dセンシング(LiDAR） − さきがけ（ICT) 2022- 自動運転LiDARセキュリティ最適輸送研究会OT2023 Twitter: Kaggle:arutema47 研究室マスコット：CSG君

5mm 2.5mm PLL+ BGR 22ch TIA for TDC 22ch TIA
for ADC 22ch TDC 11ch ADC 11ch ADC Digital Circuits •研究の軸足：集積回路設計自己紹介

ムーアの法則～集積回路の発展～ Figure in courtesy of K. Rupp, “42 years of
Microprocessor Trend Data”, https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ .

Microprocessor Trend Data”, https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ . 世界初のCPU Intel 4004 トランジスタ数:2250

Microprocessor Trend Data”, https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ . 世界初のCPU Intel 4004 トランジスタ数:2250 CMOSプロセス：10um ムーアの法則：集積されるトランジスタ数は2年で倍に →CPUのトランジスタ数は1000万倍向上 Apple M2 Pro トランジスタ数:400億 CMOSプロセス：5nm

ムーアの法則とトランジスタ（出典）日経エレクトロニクス 2017年9月号現在の先端LSIで使われているトランジスタ構造。物理限界に逼迫し性能限界。（原子1つ=0.1nm)

ムーアの法則と限界 Figure in courtesy of K. Rupp, “42 years of
Microprocessor Trend Data”, https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/ . ムーアの法則の鈍化：・CPU性能自体はここ10年で飽和・CPU動作周波数も変わっていない

ビヨンドムーア～ドメイン特化～ • 集積回路は新しい時代に突入 – トランジスタの性能に頼り切ったムーア時代から脱却 – ビヨンドムーアの時代へ https://www.joc.or.jp/sports/athletics_combined.html オリンピック・ディスタンスのレースは、合計51.5km
（スイム1.5km・バイク40km・ラン10km）アイアンマン・ディスタンスのレースは合計約226km （スイム3.8km・バイク180km・ラン42.195km）

ビヨンドムーア～ドメイン特化～ • 集積回路は新しい時代に突入 – トランジスタの性能に頼り切ったムーア時代から脱却 – ビヨンドムーアの時代へ • 汎用計算機（CPU）から専用計算機へ
– 特定処理を加速するアクセラレータ型プロセッサの台頭 – グラフィックアクセラレータ（GPU) • 汎用性のためDNN学習に活用 – DNNアクセラレータ • TPU、NPU等多数 https://www.joc.or.jp/sports/athletics_combined.html

DNNアクセラレータ例 • 並列計算に特化したアーキテクチャ – 並列計算（画像処理等）に特化するため、小型ALUを多数配置 – 汎用的な機能は捨てる(Windowsは走らない） – TPUに至ってはキャッシュ機能すらない CPU
core

DNNアクセラレータ例 • 並列計算に特化したアーキテクチャ – 並列計算（画像処理等）に特化するため、小型ALUを多数配置 – 汎用的な機能は捨てる(Windowsは走らない） – TPUに至ってはキャッシュ機能すらない CPU
core TPU core

アクセラレータの重要な研究課題 • ①チップ内データ移動の最小化 – （一番最適輸送に近いですが、本題でなくm(_ _)m） – Dataflow問題やスケジューリング問題としてコンピュータアーキテクチャ分野で活発 ⇨チップ内メモリ、演算器に入り切らないほどのデータをどのように処理するか？データ分割、チャネル分割、レイヤ分割・・etc.

アクセラレータの重要な研究課題 • ②演算回路の低電力化 – デジタル演算回路は既に最適化済み – さらなる低電力化は難しい – Extreme option:
Analog computing!? – Required DNN arithmetic precision is low (INT2-INT8) – 低精度演算ではアナログコンピューティングにより低電力化が可能 Full(FP32) INT2 INT3 INT4 INT5 Resnet50 ImageNet top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085

デジタル回路 vs アナログ回路 • 電気回路⇔電子回路 − 電気回路：抵抗、容量、インダクタといった受動素子 − 電子回路：トランジスタを始めとする能動素子 •
情報の増幅、記憶が可能最適輸送研究会OT2023 Low (GND) 入力（A） VDD GND 出力（X） High (VDD ) Low (GND) 入力（A） VDD GND 出力（X） High (VDD ) Low (GND) 入力（A） VDD GND 出力（X） High (VDD )

デジタル回路 vs アナログ回路 • デジタル回路 −ゲート（論理回路）に1/0信号を伝搬し計算処理 −Pros: ノイズの影響なし；高精度な計算可能(e.g. FP128) −Cons:
低電力化ポテンシャルはない • アナログ回路 −0.1, 0.3..といった連続値を扱う −Pros: 連続値活用による高効率化 −Cons:ノイズに弱く、高精度計算には不向き最適輸送研究会OT2023

When to analog? • When is analog computation efficient? −
At high precision (>9-10b), energy exponentially increase due to kT/C noise − Digital is efficient for binary precision; not much advantage 最適輸送研究会OT2023 [Ref] B.Murmann, “Mixed-Signal Co mputing for Deep Neural Network In ference” TVLSI 2021. Binary ~9-10b

When to analog? • When is analog computation efficient? −
Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region 最適輸送研究会OT2023 Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.

How to analog? • We cover multi-bit analog computation methods
that can cover the INT3-6 sweet spot: − Charge-based computing • Aiming to replace the Multiply-and-Accumulate (MAC) circuit 最適輸送研究会OT2023 W[N] IN[N] +

Charge-based computing • “Multiply” is done by digital, and “accumulation”
of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” 最適輸送研究会OT2023 [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] Accumulate via charge 電荷領域（Q=ΣCV)で演算・2000要素のベクトル加算を1サイクルで実施・必要回路要素がデジタル回路に比べ少なく、低電力化を実現 ADC

Multi-bit extension • How can we extend to multi-bit MACs?
− Binary computation can extend to arbitrary precision by “bit- serial” processing 最適輸送研究会OT2023 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.

Multi-bit extension • How can we extend this to multi-bit
MACs? − Binary computation can extend to arbitrary precision by “bit- serial” processing 最適輸送研究会OT2023 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.

Pros/Cons of Charge-based computing • Pros: − Realize extremally small
“in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 • Cons: − Arithmetic precision limited by ADC resolution • Tradeoff between precision vs readout energy 最適輸送研究会OT2023 IO/Register circuits 1088x78 AR-CIM CTRL WL/IN ADC Output Misc. 1270um 320um 60um Register wiring

Time/Phase domain Computing 最適輸送研究会OT2023 • Target low-area and 8-bit MAC
resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017]

Time/Phase domain Computing 最適輸送研究会OT2023 • Target low-area and 8-bit MAC
resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017] Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

PMAC: Phase domain MAC 最適輸送研究会OT2023 Phase Domain Digital MAC Resolution
1~8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

Fabricated chip 最適輸送研究会OT2023 35um 35um GRO MAC core DTC GRO
Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs

Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation
within the analog computation sweet spot (INT3-6) − Does not require high-precision ADC • Cons: − Only supports output-stationary dataflows • Cannot adapt in-memory architectures − Only proven with a single MAC circuit • Entire analog accelerator efficiency is unknown − 逐次演算のみ対応、並列計算は出来ずスループットは電荷型に劣る最適輸送研究会OT2023

最適輸送研究会OT2023 アナログコンピューティングの将来動向・演算精度向上とTransformer対応 Resnetを始めとするCNNは演算精度の感度が低いものの、 Transformerといったアーキテクチャは一桁優れた演算精度を求める 60 70 80 90 100
10 20 30 CIFAR-10 Acc. Compute Accuracy(CSNR) [dB] Conv. Analog CIMs[2-5] Transformer(ViT-tiny) Transformers poses a compute accuracy challenge. Both ADC resolution and noise must be addressed.

アナログコンピューティングの将来動向・演算精度向上とTransformer対応 Resnetを始めとするCNNは演算精度の感度が低いものの、 Transformerといったアーキテクチャは一桁優れた演算精度を求める PE[0] PE[1] PE[2] PE[N] Area consuming
 CCIM [0] CDAC [0] CCIM [1] CCIM [999] CDAC [999] CDAC [1]

• ソフトウェアスタックが不在 − ハードウェアだけあってもアルゴリズム（DNN）は実行できない・・ − 優れたソフトウェアスタックが必要 • コンパイラ、スケジューラ、コントローラ − （誰がやるの？）
− ハードウェアとソフトウェア専門家の連携が必要最適輸送研究会OT2023 アナログコンピューティングの将来動向

Conclusions and remarks • アクセラレータがDNNといったアプリケーションで注目を集める • アナログコンピューティングはDNNアクセラレータの演算電力を減らすポテンシャルを持つ − 電荷型アナログコンピュータについてカバー
• 将来研究課題 − 高精度化、Transformer対応 − ソフトウェアスタック • ほぼ手つかず。大きな課題。最適輸送研究会OT2023

アナログが世界を救う? アナログコンピューティングの応用と課題

アナログが世界を救う? アナログコンピューティングの応用と課題

Yoshioka Lab (Keio CSG)

More Decks by Yoshioka Lab (Keio CSG)

Other Decks in Research

Featured

Transcript

アナログが世界を救う? アナログコンピューティングの応用と課題 Kentaro Yoshioka 最適輸送研究会OT2023 Assistant Professor, Keio University, Japan

今日の流れ • 自己紹介 • 汎用からアクセラレータの時代へ • DNNアクセラレータと最適輸送の接点 • なぜ、そしていつアナログコンピューティング？ •

自己紹介 • 2014 慶應大卒 • 2014-2021 株式会社東芝 • 2017-2018 Stanford

5mm 2.5mm PLL+ BGR 22ch TIA for TDC 22ch TIA

ムーアの法則～集積回路の発展～ Figure in courtesy of K. Rupp, “42 years of

ムーアの法則～集積回路の発展～ Figure in courtesy of K. Rupp, “42 years of

ムーアの法則～集積回路の発展～ Figure in courtesy of K. Rupp, “42 years of

ムーアの法則とトランジスタ（出典）日経エレクトロニクス 2017年9月号現在の先端LSIで使われているトランジスタ構造。物理限界に逼迫し性能限界。（原子1つ=0.1nm)

ムーアの法則と限界 Figure in courtesy of K. Rupp, “42 years of

ビヨンドムーア～ドメイン特化～ • 集積回路は新しい時代に突入 – トランジスタの性能に頼り切ったムーア時代から脱却 – ビヨンドムーアの時代へ • 汎用計算機（CPU）から専用計算機へ

DNNアクセラレータ例 • 並列計算に特化したアーキテクチャ – 並列計算（画像処理等）に特化するため、小型ALUを多数配置 – 汎用的な機能は捨てる(Windowsは走らない） – TPUに至ってはキャッシュ機能すらない CPU

DNNアクセラレータ例 • 並列計算に特化したアーキテクチャ – 並列計算（画像処理等）に特化するため、小型ALUを多数配置 – 汎用的な機能は捨てる(Windowsは走らない） – TPUに至ってはキャッシュ機能すらない CPU

アクセラレータの重要な研究課題 • ②演算回路の低電力化 – デジタル演算回路は既に最適化済み – さらなる低電力化は難しい – Extreme option:

デジタル回路 vs アナログ回路 • 電気回路⇔電子回路 − 電気回路：抵抗、容量、インダクタといった受動素子 − 電子回路：トランジスタを始めとする能動素子 •

デジタル回路 vs アナログ回路 • デジタル回路 −ゲート（論理回路）に1/0信号を伝搬し計算処理 −Pros: ノイズの影響なし；高精度な計算可能(e.g. FP128) −Cons:

When to analog? • When is analog computation efficient? −

When to analog? • When is analog computation efficient? −

How to analog? • We cover multi-bit analog computation methods

Charge-based computing • “Multiply” is done by digital, and “accumulation”

Multi-bit extension • How can we extend to multi-bit MACs?

Multi-bit extension • How can we extend this to multi-bit

Pros/Cons of Charge-based computing • Pros: − Realize extremally small

Time/Phase domain Computing 最適輸送研究会OT2023 • Target low-area and 8-bit MAC

Time/Phase domain Computing 最適輸送研究会OT2023 • Target low-area and 8-bit MAC

PMAC: Phase domain MAC 最適輸送研究会OT2023 Phase Domain Digital MAC Resolution

Fabricated chip 最適輸送研究会OT2023 35um 35um GRO MAC core DTC GRO

Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation

最適輸送研究会OT2023 アナログコンピューティングの将来動向・演算精度向上とTransformer対応 Resnetを始めとするCNNは演算精度の感度が低いものの、 Transformerといったアーキテクチャは一桁優れた演算精度を求める 60 70 80 90 100

アナログコンピューティングの将来動向・演算精度向上とTransformer対応 Resnetを始めとするCNNは演算精度の感度が低いものの、 Transformerといったアーキテクチャは一桁優れた演算精度を求める PE[0] PE[1] PE[2] PE[N] Area consuming

• ソフトウェアスタックが不在 − ハードウェアだけあってもアルゴリズム（DNN）は実行できない・・ − 優れたソフトウェアスタックが必要 • コンパイラ、スケジューラ、コントローラ − （誰がやるの？）