Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

9cc4ca0b6f51673c096e1588cab68832?s=47 Kentaro Yoshioka
October 26, 2021
41

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

A-SSCC2021 Rising Star Express Forum (RiSE Forum)

9cc4ca0b6f51673c096e1588cab68832?s=128

Kentaro Yoshioka

October 26, 2021
Tweet

Transcript

  1. Analog to the Rescue? Analog Deep Learning Accelerator Aspects and

    Challenges Kentaro Yoshioka IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Assistant Professor, Keio University, Japan
  2. Outlines • Backgrounds • When and why should we go

    analog? • Charge-based computing • Phase-based computing RiSE(Rising Star Express) Forum Slide 1 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)
  3. Backgrounds RiSE(Rising Star Express) Forum IEEE Asian Solid-State Circuits Conference

    (A-SSCC 2021) Waymo.com Image Processing Amazon.com Speech Recognition Apple.com Sensor Data Analysis Deep Learning Algorithms
  4. Why Edge Computing? RiSE(Rising Star Express) Forum Slide 3 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 3 • Edge-computing pros: • Privacy + energy • DRAM-Chip ~200pJ* • Chip-Cloud (via BLE) ~ 10uJ!* *Energy per 16bit data
  5. Why Edge Computing? RiSE(Rising Star Express) Forum Slide 4 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 4 Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower. Normal/Anomaly
  6. Study of digital DNN Accelerators • Digital Accelerators − DaDianNao

    − Eyeriss − Edge-TPU • Mainly focused on data-reuse for high-efficiency − Minimize DRAM access energy • Data-flow processing • Systolic arrays RiSE(Rising Star Express) Forum Slide 5 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Chen, 2016] [Chen, 2014]
  7. Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

    − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 6 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Maximize weight reuse Maximize output reuse
  8. Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

    − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 7 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] X.Yang, “Interstellar: Using halide's scheduling l anguage to analyze DNN ac celerators” ASPLOS 2020. Maximize weight reuse Maximize output reuse
  9. Analog to the rescue? • Digital architectures are systematically optimized

    − How can we go further? • One extreme option: Analog computing − Required DNN arithmetic precision is low (INT2~INT8) − Analog computation can achieve higher efficiency, if not limited by noise RiSE(Rising Star Express) Forum Slide 8 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Full(FP32) INT2 INT3 INT4 INT5 Resnet50 Image Net top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085 arXiv:1805.06085 arXiv:1805.06085
  10. When to analog? • When is analog computation efficient? −

    At high precision (>8b), energy exponentially increase due to kT/C noise − Digital is efficient for binary precision; not much advantage RiSE(Rising Star Express) Forum Slide 9 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021. Binary ~8b
  11. When to analog? • When is analog computation efficient? −

    Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region RiSE(Rising Star Express) Forum Slide 10 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.
  12. How to analog? • In this talk, we would cover

    multi-bit analog computation methods that can cover the INT3-6 sweet spot: − Charge-based computing − Phase-based computing • Aiming to replace the Multiply-and-Accumulate (MAC) circuit RiSE(Rising Star Express) Forum Slide 11 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) W[N] IN[N] +
  13. Charge-based computing • “Multiply” is done by digital, and “accumulation”

    of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 12 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] 8b SAR Accumulate via charge N=2304
  14. Charge-based computing • “Multiply” is done by digital, and “accumulation”

    of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 13 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] • 1. Process IN[i]*W[i] • 2. Store outputs as charge • 3. Colum caps are shorted to realize analog accumulation • 4. Readout by ADC 8b SAR Accumulate via charge N=2304
  15. Multi-bit extension • How can we extend to multi-bit MACs?

    − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 14 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.
  16. Multi-bit extension • How can we extend this to multi-bit

    MACs? − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 15 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.
  17. Pros/Cons of Charge-based computing • Pros: − Realize extremally small

    “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array RiSE(Rising Star Express) Forum Slide 16 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]
  18. Pros/Cons of Charge-based computing • Pros: − Realize extremally small

    “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array • Tradeoff between precision vs readout energy RiSE(Rising Star Express) Forum Slide 17 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]
  19. Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 18 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017]
  20. Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 19 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017] Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]
  21. PMAC: Phase domain MAC RiSE(Rising Star Express) Forum Slide 20

    IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Phase Domain Digital MAC Resolution 1~8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]
  22. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 21 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Slide 21 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter
  23. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 22 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 22 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter
  24. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 23 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter
  25. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 24 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎
  26. PhaseMAC: The Operation RiSE(Rising Star Express) Forum Slide 25 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings.
  27. GRO Circuit Design RiSE(Rising Star Express) Forum Slide 26 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007]
  28. Fabricated chip RiSE(Rising Star Express) Forum Slide 27 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs
  29. Comparison Results RiSE(Rising Star Express) Forum Slide 28 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) PMAC Time JSSC 2017 Charge ISSCC 2018 Digital @28nmCMOS Resolution 8 1 1 8 MAC Area [um2] 1200 13000 4600 900 MAC Area/Bit [um2] 150 13000 4600 112 MNIST test accuracy 98.1% 98.4% N.A. 98.2% MAC rate 780 MHz N.A. 10 MHz 800 MHz Efficiency [TOPS/W] 14 77 532 2.8 Efficiency [TOPS/W*Bit] 112 77 532 22.4
  30. Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation

    within the analog computation sweet spot (INT3-6) − Does not require high-precision ADC • Cons: − Only supports output-stationary dataflows • Cannot adapt in-memory architectures − Only proven with a single MAC circuit • Entire analog accelerator efficiency is unknown RiSE(Rising Star Express) Forum Slide 29 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)
  31. Conclusions and remarks • Analog computing can be superior with

    INT3-6 precision. In this talk, we covered analog computing methods handling multi-bit operations: − Charge-based computing − Phase-based computing • While proven to be power efficient than digital, challenges remain − Flexibility, reliability, noise issues.. − We need to get together with the software guys! • Framework integration • Open-source RiSE(Rising Star Express) Forum Slide 30 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)