PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators

Slide 1

Slide 1 text

Symposia on VLSI Technology and Circuits PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators Kentaro Yoshioka1 2, Yosuke Toyama1, Koichiro Ban1, Daisuke Yashima1, Shigeru Maya1, Akihide Sai1, Kohei Onizuka1 1 Toshiba Corporation, Kawasaki, Japan 2 Stanford Univ., CA

Slide 2

Slide 2 text

Symposia on VLSI Technology and Circuits Outline • Application – Anomaly Detection • Accelerator design consideration – Goal: Clarifying the bottleneck • PhaseMAC: Phase based MAC circuit – Goal: Enhancing computation efficiency Slide 1

Slide 3

Slide 3 text

Symposia on VLSI Technology and Circuits Deep Learning Slide 2 Waymo.com Image Processing Motivation Amazon.com Speech Recognition Apple.com Sensor Data Analysis

Slide 4

Slide 4 text

Symposia on VLSI Technology and Circuits Modern Factories (Toyota, Mexico) Slide 3 http://motorcars.jp/ Motivation

Slide 5

Slide 5 text

Symposia on VLSI Technology and Circuits Application: Anomaly Detection Slide 4 • DRAM-Chip ~200pJ* • Hilton – Airport • Chip-Cloud (via BLE) ~ 10uJ!* • Hawaii-Japan! *Energy per 16bit data Motivation

Slide 6

Slide 6 text

Symposia on VLSI Technology and Circuits Application: Anomaly Detection Slide 5 Normal/Anomaly Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower! Battery operation pro/con: ☺ Simplified wiring ☺ Cheaper sensor installation  Limited operation time → Low-power operation demanded Motivation

Slide 7

Slide 7 text

Symposia on VLSI Technology and Circuits Algorithm of Anomaly Detection • Gathering anomaly data is hard! • Can be multiple fault patterns – Coping with all fault patterns are high cost Slide 6 Classifier Model Normal :xx% Anomaly1: yy% Anomaly2: zz% . . Algorithm

Slide 8

Slide 8 text

Symposia on VLSI Technology and Circuits Algorithm of Anomaly Detection Slide 7 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Algorithm • Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error

Slide 9

Slide 9 text

Symposia on VLSI Technology and Circuits Algorithm of Anomaly Detection • Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error Slide 8 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Anomaly Score (=prediction error) Normal Abnormal Time steps Anomaly score = ෍(𝑴𝒆𝒂𝒔. 𝑰𝒏𝒑𝒖𝒕[𝒏] − 𝑷𝒓𝒆𝒅. [𝒏])𝟐 Deep Learning Generative Model Algorithm

Slide 10

Slide 10 text

Symposia on VLSI Technology and Circuits Input H1 Output 400 252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 9 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Models Training time Representation LSTM  Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min.  Only current sequence Algorithm

Slide 11

Slide 11 text

Symposia on VLSI Technology and Circuits Input H1 Output 400 252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 10 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Algorithm Models Training time Representation LSTM  Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min.  Only current sequence Operations are periodic in industrial machines →Utilized in this work

Slide 12

Slide 12 text

Symposia on VLSI Technology and Circuits Hardware Acceleration of FC Network Slide 11 Input H1 Output 400 252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Virtually design a accelerator (DLA) for such algorithms to clarify: Where is the bottleneck? What can we do about it? Hardware

Slide 13

Slide 13 text

Symposia on VLSI Technology and Circuits Slide 12 DLA System Design Consideration PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA • L2 sized to store all of the weights on-chip (no DRAM) • L1, Reg. capacity, scheduling optimized with framework https://github.com/xuanyoya/CNN-blocking Normalized Energy MAC cost 1 256B RF 1 128kB SRAM 6 Hardware [X.Yang, arXiv:1606.04209]

Slide 14

Slide 14 text

Symposia on VLSI Technology and Circuits Slide 13 DLA System Design Consideration PE array Input Matrix Batch size=N L1 L2 Weights stored MAC REG PE PE PE PE PE PE PE PE PE PE PE PE Virtual DLA • Completely memory dominant – 10x larger than computation • Without batching, no data reuse – -> Must trade with latency. 0 5 10 15 20 25 30 0 200000 400000 600000 Memory/Compute ratio Total Parameters in model Anomaly model ResNet FC Results without batching Hardware Num. Parameters Inputs 1700 Outputs 1700 Weights 180800

Slide 15

Slide 15 text

Symposia on VLSI Technology and Circuits DLA System Design Consideration Slide 14 PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA Data reuse achieved with batching, trading with latency. Now, can we reduce the computation? →Improving computation energy by 8x would reduce 66% DLA power Hardware Increased memory only 5% 0.1 1 10 0 20 40 60 Memory-Computation Energy Ratio Batch size @Batch size=64: Comp. 3x larger than memory

Slide 16

Slide 16 text

Symposia on VLSI Technology and Circuits Can we enhance computation energy 10x? • >95% computation done in DNN is Multiply-and-Accumulate (MAC) • Digital MACs are already highly optimized • Further improvements cannot be expected Slide 15 W[N] IN[N] + Circuit

Slide 17

Slide 17 text

Symposia on VLSI Technology and Circuits Use of Analog Computation • Analog computation achieves higher power efficiency than digital – Can further scale accelerator power Slide 16 Time based computation Charge based computation [Lee, ISSCC2016] [Bankman, ASSCC2016] [Miyashita JSSC2014, 2017] DTC DTC DTC DTC Circuit

Slide 18

Slide 18 text

Symposia on VLSI Technology and Circuits Use of Analog Computation • Analog computation achieves high power efficiency • Issues: Area, accuracy Slide 17 Charge domain Time domain Digital MAC Power/Bit 0.3 0.2 1 Resolution 1~3 bit 1 bit 1~64 bit Area/Bit 21 250 1 High cost MAC array Resolution 1~3bit not enough for anomaly detection! 0 2 4 6 8 10 12 32FP 8B 7B 6B 5B 4B 3B 2B 1B Target Computational resolution Anomaly score Circuit

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Symposia on VLSI Technology and Circuits PMAC: Phase domain MAC • Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 19 Time domain approach → Accumulates pulse length → Multiple DTC required DTC: Digital-to-time-converter DTC DTC DTC DTC Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output Circuit

Slide 21

Slide 21 text

Symposia on VLSI Technology and Circuits PMAC: Phase domain MAC • Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 20 Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable Phase Domain Digital MAC Resolution 8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 Circuit DTC Gated Ring Oscillator (GRO) IN Weight Output

Slide 22

Slide 22 text

Symposia on VLSI Technology and Circuits PhaseMAC: Operation Slide 21 Circuit 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter

Slide 23

Slide 23 text

Symposia on VLSI Technology and Circuits Gated Ring Oscillator (GRO) PhaseMAC: Operation Slide 22 Circuit 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter

Slide 24

Slide 24 text

Symposia on VLSI Technology and Circuits Gated Ring Oscillator (GRO) PhaseMAC: Operation Slide 23 Circuit Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 2. GRO phase advances while DTC pulse is high 3. Phase is saved by gating → Accumulation realized Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

Slide 25

Slide 25 text

Symposia on VLSI Technology and Circuits Gated Ring Oscillator (GRO) PhaseMAC: Operation Slide 24 Circuit Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter

Slide 26

Slide 26 text

Symposia on VLSI Technology and Circuits Gated Ring Oscillator (GRO) PhaseMAC: The Operation Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings. Circuit

Slide 27

Slide 27 text

Symposia on VLSI Technology and Circuits GRO Circuit Design Slide 26 Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC Positive and negative accumulators exist. Linear frequency tuning achieved by choosing number of activated inverters. [A.V. Rylyakov, ISSCC 2007] Circuit

Slide 28

Slide 28 text

Symposia on VLSI Technology and Circuits GRO Circuit Design Slide 27 Positive Accumulator LSB GRO W[3:0] MSB GRO W[6:4] Negative Accumulator Frequency Configurable GRO W[3:0] DTCOUT GRO[3:0] GRO[3] GRO[2] GRO[1] GRO[0] RSTB Same circuitry DTC 0 5 10 15 20 0 2 4 6 8 10 12 14 16 Norm. Frequency W[code] 4b GRO frequency characteristic Sufficient for 8b resolution [A.V. Rylyakov, ISSCC 2007] Circuit

Slide 29

Slide 29 text

Symposia on VLSI Technology and Circuits Fabricated chip Slide 28 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs Circuit

Slide 30

Slide 30 text

Symposia on VLSI Technology and Circuits Fabricated chip Slide 29 35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs Training done by Keras with L2 regression and dropouts Circuit

Slide 31

Slide 31 text

Symposia on VLSI Technology and Circuits Anomaly Detection Results • Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.) Slide 30 0 5 10 15 32FP 8B 3B Anomaly score Target Abnormal Normal Anomaly score 0 2 4 6 8 10 12 32FP 8B 7B 6B 5B 4B 3B 2B 1B Target MAC resolution Circuit • Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.)

Slide 32

Slide 32 text

Symposia on VLSI Technology and Circuits Power Efficiency Characteristics Slide 31 1) High power efficiency with inputs with low mean value; Oscillator operates less. 2) Measured with data having sparsity = 0%. (Has zero-skipping) 0 4 8 12 16 0 10 20 30 40 50 Mean Value of Input Data in regard to Max. Value[%] PMAC Efficiency [TOPS/W] Datasets: MNIST (DNN) CIFAR10 (LeNet-like CNN) ImageNet(Inception-v3) *Trained with L2 norm regression *Mean value calculated by averaging the test data activations. Anomaly detection 11.6 TOPS/W Mean value of non-zero input data In regard to max. value [%] Circuit

Slide 33

Slide 33 text

Symposia on VLSI Technology and Circuits Comparison Results Slide 32 Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3

Slide 34

Slide 34 text

Symposia on VLSI Technology and Circuits Comparison Results Slide 33 Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3 26.6x improvement

Slide 35

Slide 35 text

Symposia on VLSI Technology and Circuits Comparison Results Slide 34 Circuit PMAC Time JSSC 2017 Charge ISSCC 2016 Resolution 8 1 3 MAC Area [um2] 1200 13000* 12000 MAC Area/Bit [um2] 150 13000* 4000 MNIST test accuracy 98.1% 98.4% N.A. Power [uW] 152 N.A. 228 MAC rate 780 MHz N.A. 1 GHz Efficiency [TOPS/W] 14 77 8.77 Efficiency [TOPS/W*Bit] 112 77 26.3 48% improved

Slide 36

Slide 36 text

Symposia on VLSI Technology and Circuits Conclusion • Clarified that for anomaly detection, computation power can take over memory power – Sufficient input batching must be done, trading with latency • Proposed PhaseMAC to further scale accelerator power – Accumulation done by GRO; the area is 26.6x smaller than conv. arts. – Power efficiency 48% higher than conventional arts. – Demonstrated anomaly detection results for edge computing. Slide 35

Slide 37

Slide 37 text

Symposia on VLSI Technology and Circuits Acknowledgement • The authors would like to thank Xuan Yang, Edward Lee, Danny Bankman, Boris Murmann and Mark Horowitz for the valuable discussions. • The authors would like to thank Daisuke Miyashita and Jun Deguchi for the valuable discussions on time domain computing. Slide 36