Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analog to the Rescue? Analog Deep Learning Acc...

Yoshioka Lab (Keio CSG)
October 26, 2021
210

Analog to the Rescue? Analog Deep Learning Accelerator Aspects and Challenges

A-SSCC2021 Rising Star Express Forum (RiSE Forum)

Yoshioka Lab (Keio CSG)

October 26, 2021
Tweet

Transcript

  1. Analog to the Rescue? Analog Deep Learning Accelerator Aspects and

    Challenges Kentaro Yoshioka IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Assistant Professor, Keio University, Japan
  2. Outlines • Backgrounds • When and why should we go

    analog? • Charge-based computing • Phase-based computing RiSE(Rising Star Express) Forum Slide 1 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)
  3. Backgrounds RiSE(Rising Star Express) Forum IEEE Asian Solid-State Circuits Conference

    (A-SSCC 2021) Waymo.com Image Processing Amazon.com Speech Recognition Apple.com Sensor Data Analysis Deep Learning Algorithms
  4. Why Edge Computing? RiSE(Rising Star Express) Forum Slide 3 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 3 • Edge-computing pros: • Privacy + energy • DRAM-Chip ~200pJ* • Chip-Cloud (via BLE) ~ 10uJ!* *Energy per 16bit data
  5. Why Edge Computing? RiSE(Rising Star Express) Forum Slide 4 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Slide 4 Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower. Normal/Anomaly
  6. Study of digital DNN Accelerators • Digital Accelerators − DaDianNao

    − Eyeriss − Edge-TPU • Mainly focused on data-reuse for high-efficiency − Minimize DRAM access energy • Data-flow processing • Systolic arrays RiSE(Rising Star Express) Forum Slide 5 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Chen, 2016] [Chen, 2014]
  7. Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

    − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 6 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Maximize weight reuse Maximize output reuse
  8. Data-reuse Techniques • Goal: Minimize Off-chip access and maximize data-reuse

    − Can maximize reuse of “inputs” or “weights” or “outputs” − Systematic analysis show that similar efficiency can be achieved, regardless of the data-reuse strategy RiSE(Rising Star Express) Forum Slide 7 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] X.Yang, “Interstellar: Using halide's scheduling l anguage to analyze DNN ac celerators” ASPLOS 2020. Maximize weight reuse Maximize output reuse
  9. Analog to the rescue? • Digital architectures are systematically optimized

    − How can we go further? • One extreme option: Analog computing − Required DNN arithmetic precision is low (INT2~INT8) − Analog computation can achieve higher efficiency, if not limited by noise RiSE(Rising Star Express) Forum Slide 8 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Full(FP32) INT2 INT3 INT4 INT5 Resnet50 Image Net top-1 0.769 0.722 0.753 0.765 0.767 Weight+Activation quantized network with PACT J. Choi, “PACT: Parameterized Clipping Activation for Quantized Neural Networks” arXiv:1805.06085 arXiv:1805.06085 arXiv:1805.06085
  10. When to analog? • When is analog computation efficient? −

    At high precision (>8b), energy exponentially increase due to kT/C noise − Digital is efficient for binary precision; not much advantage RiSE(Rising Star Express) Forum Slide 9 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021. Binary ~8b
  11. When to analog? • When is analog computation efficient? −

    Sweet spot is INT3-6, where analog is not limited by noise − Ideally, analog MAC’s energy increases linearly in this region RiSE(Rising Star Express) Forum Slide 10 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Sweet spot INT3~6 [Ref] B.Murmann, “Mixed-Signal Computing for Deep Neural Netw ork Inference” TVLSI 2021.
  12. How to analog? • In this talk, we would cover

    multi-bit analog computation methods that can cover the INT3-6 sweet spot: − Charge-based computing − Phase-based computing • Aiming to replace the Multiply-and-Accumulate (MAC) circuit RiSE(Rising Star Express) Forum Slide 11 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) W[N] IN[N] +
  13. Charge-based computing • “Multiply” is done by digital, and “accumulation”

    of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 12 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory- Computing CNN Accelerator Employing Charge -Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] 8b SAR Accumulate via charge N=2304
  14. Charge-based computing • “Multiply” is done by digital, and “accumulation”

    of vector N is done in the analog domain → realize binary MAC − Can integrate weights memory and process as “in-memory computing” RiSE(Rising Star Express) Forum Slide 13 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) [Ref] H. Valavi, “A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute”, JSSC 2019. Inputs [1:N] W[0] IN[0] W[N] IN[N] • 1. Process IN[i]*W[i] • 2. Store outputs as charge • 3. Colum caps are shorted to realize analog accumulation • 4. Readout by ADC 8b SAR Accumulate via charge N=2304
  15. Multi-bit extension • How can we extend to multi-bit MACs?

    − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 14 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 4b x 4b broken up to 16 binary multiple&adds C.Eckert, “Neural cache: Bit-serial in-cache acceleration of deep neural networks” ISCA 2018.
  16. Multi-bit extension • How can we extend this to multi-bit

    MACs? − Binary computation can extend to arbitrary precision by “bit-serial” processing RiSE(Rising Star Express) Forum Slide 15 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) 1010 x 0101 1010 0000 1010 0000 110010 Vectorize bit-serial operation [Ref] H. Jia, “A Programmable Hete rogeneous Microprocessor Based on Bit-Scalable In-Memory Computi ng”, JSSC 2020.
  17. Pros/Cons of Charge-based computing • Pros: − Realize extremally small

    “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array RiSE(Rising Star Express) Forum Slide 16 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]
  18. Pros/Cons of Charge-based computing • Pros: − Realize extremally small

    “in-memory computing” cell − Amortize ADC cost by increasing column size to >2000 @1bit, peak energy eff: 192TOPS/W @4bit, estimated efficiency ~= 12TOPS/W @8bit, estimated efficiency ~= 3TOPS/W • Cons: − Arithmetic precision limited by ADC resolution • 13bit ADC required for 2304 array • Tradeoff between precision vs readout energy RiSE(Rising Star Express) Forum Slide 17 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) SQNR[dB]
  19. Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 18 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017]
  20. Time/Phase domain Computing RiSE(Rising Star Express) Forum Slide 19 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Time domain approach → Accumulates pulse length → Multiple DTC required  DTC: Digital-to-time-converter DTC DTC DTC DTC [Miyashita, ASSCC2017] Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]
  21. PMAC: Phase domain MAC RiSE(Rising Star Express) Forum Slide 20

    IEEE Asian Solid-State Circuits Conference (A-SSCC 2021) Phase Domain Digital MAC Resolution 1~8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 • Target low-area and 8-bit MAC resolution − Realize analog computation for wide application with low cost Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output [Yoshioka, VLSI2018][Toyama, ASSCC2018]
  22. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 21 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Slide 21 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter
  23. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 22 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 22 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter
  24. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 23 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter
  25. PhaseMAC: Operation RiSE(Rising Star Express) Forum Slide 24 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎
  26. PhaseMAC: The Operation RiSE(Rising Star Express) Forum Slide 25 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Gated Ring Oscillator (GRO) Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings.
  27. GRO Circuit Design RiSE(Rising Star Express) Forum Slide 26 IEEE

    Asian Solid-State Circuits Conference (A-SSCC 2021) Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007]
  28. Fabricated chip RiSE(Rising Star Express) Forum Slide 27 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs
  29. Comparison Results RiSE(Rising Star Express) Forum Slide 28 IEEE Asian

    Solid-State Circuits Conference (A-SSCC 2021) PMAC Time JSSC 2017 Charge ISSCC 2018 Digital @28nmCMOS Resolution 8 1 1 8 MAC Area [um2] 1200 13000 4600 900 MAC Area/Bit [um2] 150 13000 4600 112 MNIST test accuracy 98.1% 98.4% N.A. 98.2% MAC rate 780 MHz N.A. 10 MHz 800 MHz Efficiency [TOPS/W] 14 77 532 2.8 Efficiency [TOPS/W*Bit] 112 77 532 22.4
  30. Pros/Cons of PMAC • Pros: − Achieves high-accuracy MAC operation

    within the analog computation sweet spot (INT3-6) − Does not require high-precision ADC • Cons: − Only supports output-stationary dataflows • Cannot adapt in-memory architectures − Only proven with a single MAC circuit • Entire analog accelerator efficiency is unknown RiSE(Rising Star Express) Forum Slide 29 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)
  31. Conclusions and remarks • Analog computing can be superior with

    INT3-6 precision. In this talk, we covered analog computing methods handling multi-bit operations: − Charge-based computing − Phase-based computing • While proven to be power efficient than digital, challenges remain − Flexibility, reliability, noise issues.. − We need to get together with the software guys! • Framework integration • Open-source RiSE(Rising Star Express) Forum Slide 30 IEEE Asian Solid-State Circuits Conference (A-SSCC 2021)