Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

Slide 1

Slide 1 text

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges Kentaro Yoshioka IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Assistant Professor, Keio University, Japan

Slide 2

Slide 2 text

Outlines • Backgrounds • When and why should we go analog? • Charge-based computing • Phase-based computing RiSE(Rising Star Express) Forum Slide 1 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)

Slide 3

Slide 3 text

Backgrounds RiSE(Rising Star Express) Forum IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Waymo.com Image Processing Amazon.com Speech Recognition Apple.com Sensor Data Analysis Deep Learning Algorithms

Slide 4

Slide 4 text

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 3 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 3 • Edge-computing pros: • Privacy + energy • DRAM-Chip ~200pJ* • Chip-Cloud (via BLE) ~ 10uJ!* *Energy per 16bit data

Slide 5

Slide 5 text

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 4 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 4 Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower. Normal/Anomaly

Slide 6

Slide 6 text

Study of digital DNN Accelerators • Digital Accelerators − DaDianNao − Eyeriss − Edge-TPU • Mainly focused on data-reuse for high-efficiency − Minimize DRAM access energy • Data-flow processing • Systolic arrays RiSE(Rising Star Express) Forum Slide 5 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Chen, 2016] [Chen, 2014]

Slide 7

Slide 7 text

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 6 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Maximize weight reuse Maximize output reuse

Slide 8

Slide 8 text

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 7 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] X.Yang, “Interstellar: Using halide's scheduling l anguage to analyze DNN ac celerators” ASPLOS 2020. Maximize weight reuse Maximize output reuse

Slide 9

Slide 9 text

Analog to the rescue? • Digital architectures are systematically optimized − How can we go further? • One extreme option: Analog computing − Required DNN arithmetic precision is low (INT2~INT8) − Analog computation can achieve higher efficiency, if not limited by noise RiSE(Rising Star Express) Forum Slide 8 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Full(FP32) INT2 INT3 INT4 INT5 Resnet50 Image Net top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085 arXiv:1805.06085 arXiv:1805.06085

Slide 10

Slide 10 text

When to analog? • When is analog computation efficient? − At high precision (>8b), energy exponentially increase due to kT/C noise − Digital is efficient for binary precision; not much advantage RiSE(Rising Star Express) Forum Slide 9 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021. Binary ~8b

Slide 11

Slide 11 text

When to analog? • When is analog computation efficient? − Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region RiSE(Rising Star Express) Forum Slide 10 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.

Slide 12

Slide 12 text

How to analog? • In this talk, we would cover multi-bit analog computation methods that can cover the INT3-6 sweet spot: − Charge-based computing − Phase-based computing • Aiming to replace the Multiply-and-Accumulate (MAC) circuit RiSE(Rising Star Express) Forum Slide 11 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) W[N] IN[N] +

Slide 13

Slide 13 text

Charge-based computing • “Multiply” is done by digital, and “accumulation” of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 12 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] 8b SAR Accumulate via charge N=2304

Slide 14

Slide 14 text

Charge-based computing • “Multiply” is done by digital, and “accumulation” of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 13 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] • 1. Process IN[i]*W[i] • 2. Store outputs as charge • 3. Colum caps are shorted to realize analog accumulation • 4. Readout by ADC 8b SAR Accumulate via charge N=2304

Slide 15

Slide 15 text

Multi-bit extension • How can we extend to multi-bit MACs? − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 14 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.

Slide 16

Slide 16 text

Multi-bit extension • How can we extend this to multi-bit MACs? − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 15 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.

Slide 17

Slide 17 text

Pros/Cons of Charge-based computing • Pros: − Realize extremally small “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array RiSE(Rising Star Express) Forum Slide 16 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]

Slide 18

Slide 18 text

Pros/Cons of Charge-based computing • Pros: − Realize extremally small “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array • Tradeoff between precision vs readout energy RiSE(Rising Star Express) Forum Slide 17 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]

Slide 19

Slide 19 text

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 18 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017]

Slide 20

Slide 20 text

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 19 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017] Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

Slide 21

Slide 21 text

PMAC: Phase domain MAC RiSE(Rising Star Express) Forum Slide 20 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Phase Domain Digital MAC Resolution 1~8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

Slide 22

Slide 22 text

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 21 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 21 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter

Slide 23

Slide 23 text

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 22 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 22 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter

Slide 24

Slide 24 text

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 23 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

Slide 25

Slide 25 text

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 24 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎

Slide 26

Slide 26 text

PhaseMAC: The Operation RiSE(Rising Star Express) Forum Slide 25 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings.

Slide 27

Slide 27 text

GRO Circuit Design RiSE(Rising Star Express) Forum Slide 26 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007]

Slide 28

Slide 28 text

Fabricated chip RiSE(Rising Star Express) Forum Slide 27 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs

Slide 29

Slide 29 text

Comparison Results RiSE(Rising Star Express) Forum Slide 28 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) PMAC Time JSSC 2017 Charge ISSCC 2018 Digital @28nmCMOS Resolution 8 1 1 8 MAC Area [um2] 1200 13000 4600 900 MAC Area/Bit [um2] 150 13000 4600 112 MNIST test accuracy 98.1% 98.4% N.A. 98.2% MAC rate 780 MHz N.A. 10 MHz 800 MHz Efficiency [TOPS/W] 14 77 532 2.8 Efficiency [TOPS/W*Bit] 112 77 532 22.4

Slide 30

Slide 30 text

Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation within the analog computation sweet spot (INT3-6) − Does not require high-precision ADC • Cons: − Only supports output-stationary dataflows • Cannot adapt in-memory architectures − Only proven with a single MAC circuit • Entire analog accelerator efficiency is unknown RiSE(Rising Star Express) Forum Slide 29 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)

Slide 31

Slide 31 text

Conclusions and remarks • Analog computing can be superior with INT3-6 precision. In this talk, we covered analog computing methods handling multi-bit operations: − Charge-based computing − Phase-based computing • While proven to be power efficient than digital, challenges remain − Flexibility, reliability, noise issues.. − We need to get together with the software guys! • Framework integration • Open-source RiSE(Rising Star Express) Forum Slide 30 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)