Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and
Challenges Kentaro Yoshioka IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Assistant Professor, Keio University, Japan

Outlines • Backgrounds • When and why should we go
analog? • Charge-based computing • Phase-based computing RiSE(Rising Star Express) Forum Slide 1 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)

Backgrounds RiSE(Rising Star Express) Forum IEEE Asian Solid-State Circuits Conference
(A-SSCC 2021) Waymo.com Image Processing Amazon.com Speech Recognition Apple.com Sensor Data Analysis Deep Learning Algorithms

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 3 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 3 • Edge-computing pros: • Privacy + energy • DRAM-Chip ~200pJ* • Chip-Cloud (via BLE) ~ 10uJ!* *Energy per 16bit data

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 4 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 4 Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower. Normal/Anomaly

Study of digital DNN Accelerators • Digital Accelerators − DaDianNao
− Eyeriss − Edge-TPU • Mainly focused on data-reuse for high-efficiency − Minimize DRAM access energy • Data-flow processing • Systolic arrays RiSE(Rising Star Express) Forum Slide 5 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Chen, 2016] [Chen, 2014]

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse
− Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 6 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Maximize weight reuse Maximize output reuse

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse
− Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 7 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] X.Yang, “Interstellar: Using halide's scheduling l anguage to analyze DNN ac celerators” ASPLOS 2020. Maximize weight reuse Maximize output reuse

Analog to the rescue? • Digital architectures are systematically optimized
− How can we go further? • One extreme option: Analog computing − Required DNN arithmetic precision is low (INT2~INT8) − Analog computation can achieve higher efficiency, if not limited by noise RiSE(Rising Star Express) Forum Slide 8 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Full(FP32) INT2 INT3 INT4 INT5 Resnet50 Image Net top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085 arXiv:1805.06085 arXiv:1805.06085

When to analog? • When is analog computation efficient? −
At high precision (>8b), energy exponentially increase due to kT/C noise − Digital is efficient for binary precision; not much advantage RiSE(Rising Star Express) Forum Slide 9 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021. Binary ~8b

When to analog? • When is analog computation efficient? −
Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region RiSE(Rising Star Express) Forum Slide 10 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.

How to analog? • In this talk, we would cover
multi-bit analog computation methods that can cover the INT3-6 sweet spot: − Charge-based computing − Phase-based computing • Aiming to replace the Multiply-and-Accumulate (MAC) circuit RiSE(Rising Star Express) Forum Slide 11 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) W[N] IN[N] +

Charge-based computing • “Multiply” is done by digital, and “accumulation”
of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 12 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] 8b SAR Accumulate via charge N=2304

Charge-based computing • “Multiply” is done by digital, and “accumulation”
of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 13 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] • 1. Process IN[i]*W[i] • 2. Store outputs as charge • 3. Colum caps are shorted to realize analog accumulation • 4. Readout by ADC 8b SAR Accumulate via charge N=2304

Multi-bit extension • How can we extend to multi-bit MACs?
− Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 14 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.

Multi-bit extension • How can we extend this to multi-bit
MACs? − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 15 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.

Pros/Cons of Charge-based computing • Pros: − Realize extremally small
“in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array RiSE(Rising Star Express) Forum Slide 16 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]

Pros/Cons of Charge-based computing • Pros: − Realize extremally small
“in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array • Tradeoff between precision vs readout energy RiSE(Rising Star Express) Forum Slide 17 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 18 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017]

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 19 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017] Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

PMAC: Phase domain MAC RiSE(Rising Star Express) Forum Slide 20
IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Phase Domain Digital MAC Resolution 1~8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 21 IEEE Asian
Solid-State Circuits Conference (A-SSCC 2021) Slide 21 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter

Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 22 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter

Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎

PhaseMAC: The Operation RiSE(Rising Star Express) Forum Slide 25 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings.

GRO Circuit Design RiSE(Rising Star Express) Forum Slide 26 IEEE
Asian Solid-State Circuits Conference (A-SSCC 2021) Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007]

Fabricated chip RiSE(Rising Star Express) Forum Slide 27 IEEE Asian
Solid-State Circuits Conference (A-SSCC 2021) 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs

Comparison Results RiSE(Rising Star Express) Forum Slide 28 IEEE Asian
Solid-State Circuits Conference (A-SSCC 2021) PMAC Time JSSC 2017 Charge ISSCC 2018 Digital @28nmCMOS Resolution 8 1 1 8 MAC Area [um2] 1200 13000 4600 900 MAC Area/Bit [um2] 150 13000 4600 112 MNIST test accuracy 98.1% 98.4% N.A. 98.2% MAC rate 780 MHz N.A. 10 MHz 800 MHz Efficiency [TOPS/W] 14 77 532 2.8 Efficiency [TOPS/W*Bit] 112 77 532 22.4

Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation
within the analog computation sweet spot (INT3-6) − Does not require high-precision ADC • Cons: − Only supports output-stationary dataflows • Cannot adapt in-memory architectures − Only proven with a single MAC circuit • Entire analog accelerator efficiency is unknown RiSE(Rising Star Express) Forum Slide 29 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)

Conclusions and remarks • Analog computing can be superior with
INT3-6 precision. In this talk, we covered analog computing methods handling multi-bit operations: − Charge-based computing − Phase-based computing • While proven to be power efficient than digital, challenges remain − Flexibility, reliability, noise issues.. − We need to get together with the software guys! • Framework integration • Open-source RiSE(Rising Star Express) Forum Slide 30 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)

Analog to the Rescue? Analog Deep Learning Acc...

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

Yoshioka Lab (Keio CSG)

More Decks by Yoshioka Lab (Keio CSG)

Featured

Transcript

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and

Outlines • Backgrounds • When and why should we go

Backgrounds RiSE(Rising Star Express) Forum IEEE Asian Solid-State Circuits Conference

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 3 IEEE

Why Edge Computing? RiSE(Rising Star Express) Forum Slide 4 IEEE

Study of digital DNN Accelerators • Digital Accelerators − DaDianNao

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

Analog to the rescue? • Digital architectures are systematically optimized

When to analog? • When is analog computation efficient? −

When to analog? • When is analog computation efficient? −

How to analog? • In this talk, we would cover

Charge-based computing • “Multiply” is done by digital, and “accumulation”

Charge-based computing • “Multiply” is done by digital, and “accumulation”

Multi-bit extension • How can we extend to multi-bit MACs?

Multi-bit extension • How can we extend this to multi-bit

Pros/Cons of Charge-based computing • Pros: − Realize extremally small

Pros/Cons of Charge-based computing • Pros: − Realize extremally small

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 18 IEEE

Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 19 IEEE

PMAC: Phase domain MAC RiSE(Rising Star Express) Forum Slide 20

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 21 IEEE Asian

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 22 IEEE Asian

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 23 IEEE Asian

PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 24 IEEE Asian

PhaseMAC: The Operation RiSE(Rising Star Express) Forum Slide 25 IEEE

GRO Circuit Design RiSE(Rising Star Express) Forum Slide 26 IEEE

Fabricated chip RiSE(Rising Star Express) Forum Slide 27 IEEE Asian

Comparison Results RiSE(Rising Star Express) Forum Slide 28 IEEE Asian

Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation

Conclusions and remarks • Analog computing can be superior with