PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators

Symposia on VLSI Technology and Circuits PhaseMAC: A 14 TOPS/W
8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators Kentaro Yoshioka1 2, Yosuke Toyama1, Koichiro Ban1, Daisuke Yashima1, Shigeru Maya1, Akihide Sai1, Kohei Onizuka1 1 Toshiba Corporation, Kawasaki, Japan 2 Stanford Univ., CA

Symposia on VLSI Technology and Circuits Outline • Application –
Anomaly Detection • Accelerator design consideration – Goal: Clarifying the bottleneck • PhaseMAC: Phase based MAC circuit – Goal: Enhancing computation efficiency Slide 1

Symposia on VLSI Technology and Circuits Deep Learning Slide 2
Waymo.com Image Processing Motivation Amazon.com Speech Recognition Apple.com Sensor Data Analysis

Symposia on VLSI Technology and Circuits Modern Factories (Toyota, Mexico)
Slide 3 http://motorcars.jp/ Motivation

Symposia on VLSI Technology and Circuits Application: Anomaly Detection Slide
4 • DRAM-Chip ~200pJ* • Hilton – Airport • Chip-Cloud (via BLE) ~ 10uJ!* • Hawaii-Japan! *Energy per 16bit data Motivation

Symposia on VLSI Technology and Circuits Application: Anomaly Detection Slide
5 Normal/Anomaly Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower! Battery operation pro/con: ☺ Simplified wiring ☺ Cheaper sensor installation  Limited operation time → Low-power operation demanded Motivation

Symposia on VLSI Technology and Circuits Algorithm of Anomaly Detection
• Gathering anomaly data is hard! • Can be multiple fault patterns – Coping with all fault patterns are high cost Slide 6 Classifier Model Normal :xx% Anomaly1: yy% Anomaly2: zz% . . Algorithm

Slide 7 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Algorithm • Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error

• Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error Slide 8 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Anomaly Score (=prediction error) Normal Abnormal Time steps Anomaly score = ෍(𝑴𝒆𝒂𝒔. 𝑰𝒏𝒑𝒖𝒕[𝒏] − 𝑷𝒓𝒆𝒅. [𝒏])𝟐 Deep Learning Generative Model Algorithm

Symposia on VLSI Technology and Circuits Input H1 Output 400
252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 9 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Models Training time Representation LSTM  Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min.  Only current sequence Algorithm

Symposia on VLSI Technology and Circuits Input H1 Output 400
252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 10 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Algorithm Models Training time Representation LSTM  Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min.  Only current sequence Operations are periodic in industrial machines →Utilized in this work

Symposia on VLSI Technology and Circuits Hardware Acceleration of FC
Network Slide 11 Input H1 Output 400 252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Virtually design a accelerator (DLA) for such algorithms to clarify: Where is the bottleneck? What can we do about it? Hardware

Symposia on VLSI Technology and Circuits Slide 12 DLA System
Design Consideration PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA • L2 sized to store all of the weights on-chip (no DRAM) • L1, Reg. capacity, scheduling optimized with framework https://github.com/xuanyoya/CNN-blocking Normalized Energy MAC cost 1 256B RF 1 128kB SRAM 6 Hardware [X.Yang, arXiv:1606.04209]

Symposia on VLSI Technology and Circuits Slide 13 DLA System
Design Consideration PE array Input Matrix Batch size=N L1 L2 Weights stored MAC REG PE PE PE PE PE PE PE PE PE PE PE PE Virtual DLA • Completely memory dominant – 10x larger than computation • Without batching, no data reuse – -> Must trade with latency. 0 5 10 15 20 25 30 0 200000 400000 600000 Memory/Compute ratio Total Parameters in model Anomaly model ResNet FC Results without batching Hardware Num. Parameters Inputs 1700 Outputs 1700 Weights 180800

Symposia on VLSI Technology and Circuits DLA System Design Consideration
Slide 14 PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA Data reuse achieved with batching, trading with latency. Now, can we reduce the computation? →Improving computation energy by 8x would reduce 66% DLA power Hardware Increased memory only 5% 0.1 1 10 0 20 40 60 Memory-Computation Energy Ratio Batch size @Batch size=64: Comp. 3x larger than memory

Symposia on VLSI Technology and Circuits Can we enhance computation
energy 10x? • >95% computation done in DNN is Multiply-and-Accumulate (MAC) • Digital MACs are already highly optimized • Further improvements cannot be expected Slide 15 W[N] IN[N] + Circuit

Symposia on VLSI Technology and Circuits Use of Analog Computation
• Analog computation achieves higher power efficiency than digital – Can further scale accelerator power Slide 16 Time based computation Charge based computation [Lee, ISSCC2016] [Bankman, ASSCC2016] [Miyashita JSSC2014, 2017] DTC DTC DTC DTC Circuit

Symposia on VLSI Technology and Circuits Use of Analog Computation
• Analog computation achieves high power efficiency • Issues: Area, accuracy Slide 17 Charge domain Time domain Digital MAC Power/Bit 0.3 0.2 1 Resolution 1~3 bit 1 bit 1~64 bit Area/Bit 21 250 1 High cost MAC array Resolution 1~3bit not enough for anomaly detection! 0 2 4 6 8 10 12 32FP 8B 7B 6B 5B 4B 3B 2B 1B Target Computational resolution Anomaly score Circuit

Symposia on VLSI Technology and Circuits PMAC: Phase domain MAC
• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 18 Circuit W[N] IN[N] + Mult. 40% Add. 10% Reg. 50% Power breakdown of digital MAC ~1pJ/ops

• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 19 Time domain approach → Accumulates pulse length → Multiple DTC required DTC: Digital-to-time-converter DTC DTC DTC DTC Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output Circuit

• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 20 Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable Phase Domain Digital MAC Resolution 8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 Circuit DTC Gated Ring Oscillator (GRO) IN Weight Output

Symposia on VLSI Technology and Circuits PhaseMAC: Operation Slide 21
Circuit 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter

Symposia on VLSI Technology and Circuits Gated Ring Oscillator (GRO)
PhaseMAC: Operation Slide 22 Circuit 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter

PhaseMAC: Operation Slide 23 Circuit Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

PhaseMAC: Operation Slide 24 Circuit Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

PhaseMAC: The Operation Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings. Circuit

Symposia on VLSI Technology and Circuits GRO Circuit Design Slide
26 Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007] Circuit

Symposia on VLSI Technology and Circuits GRO Circuit Design Slide
27 Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC 0 5 10 15 20 0 2 4 6 8 10 12 14 16 Norm. Frequency W[code] 4b GRO frequency characteristic Sufficient for 8b resolution [A.V. Rylyakov, ISSCC 2007] Circuit

Symposia on VLSI Technology and Circuits Fabricated chip Slide 28
35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs Circuit

Symposia on VLSI Technology and Circuits Fabricated chip Slide 29
35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs Training done by Keras with L2 regression and dropouts Circuit

Symposia on VLSI Technology and Circuits Anomaly Detection Results •
Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.) Slide 30 0 5 10 15 32FP 8B 3B Anomaly score Target Abnormal Normal Anomaly score 0 2 4 6 8 10 12 32FP 8B 7B 6B 5B 4B 3B 2B 1B Target MAC resolution Circuit • Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.)

Symposia on VLSI Technology and Circuits Power Efficiency Characteristics Slide
31 1) High power efficiency with inputs with low mean value; Oscillator operates less. 2) Measured with data having sparsity = 0%. (Has zero-skipping) 0 4 8 12 16 0 10 20 30 40 50 Mean Value of Input Data in regard to Max. Value[%] PMAC Efficiency [TOPS/W] Datasets: MNIST (DNN) CIFAR10 (LeNet-like CNN) ImageNet(Inception-v3) *Trained with L2 norm regression *Mean value calculated by averaging the test data activations. Anomaly detection 11.6 TOPS/W Mean value of non-zero input data In regard to max. value [%] Circuit

Symposia on VLSI Technology and Circuits Comparison Results Slide 32
Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3

Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3 26.6x improvement

Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3 48% improved

Symposia on VLSI Technology and Circuits Conclusion • Clarified that
for anomaly detection, computation power can take over memory power – Sufficient input batching must be done, trading with latency • Proposed PhaseMAC to further scale accelerator power – Accumulation done by GRO; the area is 26.6x smaller than conv. arts. – Power efficiency 48% higher than conventional arts. – Demonstrated anomaly detection results for edge computing. Slide 35

Symposia on VLSI Technology and Circuits Acknowledgement • The authors
would like to thank Xuan Yang, Edward Lee, Danny Bankman, Boris Murmann and Mark Horowitz for the valuable discussions. • The authors would like to thank Daisuke Miyashita and Jun Deguchi for the valuable discussions on time domain computing. Slide 36

PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Doma...

PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators

More Decks by Yoshioka Lab (Keio CSG)

Other Decks in Research

Featured

Transcript