5 Normal/Anomaly Edge Computing Use case Energy Sending 400 samples to the cloud 4mJ Sending just anomaly detection results 1uJ Order of Magnitudes Lower! Battery operation pro/con: ☺ Simplified wiring ☺ Cheaper sensor installation Limited operation time → Low-power operation demanded Motivation
• Gathering anomaly data is hard! • Can be multiple fault patterns – Coping with all fault patterns are high cost Slide 6 Classifier Model Normal :xx% Anomaly1: yy% Anomaly2: zz% . . Algorithm
Slide 7 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Algorithm • Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error
• Construct the model based on normal data only • Model predicts the next 400 data plots – State is normal: small prediction error – State is anomaly: large prediction error Slide 8 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Deep Learning Generative Model Anomaly Score (=prediction error) Normal Abnormal Time steps Anomaly score = (𝑴𝒆𝒂𝒔. 𝑰𝒏𝒑𝒖𝒕[𝒏] − 𝑷𝒓𝒆𝒅. [𝒏])𝟐 Deep Learning Generative Model Algorithm
252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 9 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Models Training time Representation LSTM Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min. Only current sequence Algorithm
252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Algorithm of Anomaly Detection • Use an Autoencoder to generate prediction Slide 10 S. Maya, “dLSTM: A New Approach for Anomaly Detection Using Deep Learning with Delayed Prediction ” KDD 2017 WS Abnormal operation Normal operation 400 samples Prediction Sensor data Predicts next 400 samples Algorithm Models Training time Representation LSTM Hours ☺ Remembers previous sequences Fully-connected (FC) ☺ 1 min. Only current sequence Operations are periodic in industrial machines →Utilized in this work
Network Slide 11 Input H1 Output 400 252 H2 152 H4 52 H6 252 400 Model：8 layer FC NN H3 152 H5 152 H7 352 Virtually design a accelerator (DLA) for such algorithms to clarify: Where is the bottleneck? What can we do about it? Hardware
Design Consideration PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA • L2 sized to store all of the weights on-chip (no DRAM) • L1, Reg. capacity, scheduling optimized with framework https://github.com/xuanyoya/CNN-blocking Normalized Energy MAC cost 1 256B RF 1 128kB SRAM 6 Hardware [X.Yang, arXiv:1606.04209]
Design Consideration PE array Input Matrix Batch size=N L1 L2 Weights stored MAC REG PE PE PE PE PE PE PE PE PE PE PE PE Virtual DLA • Completely memory dominant – 10x larger than computation • Without batching, no data reuse – -> Must trade with latency. 0 5 10 15 20 25 30 0 200000 400000 600000 Memory/Compute ratio Total Parameters in model Anomaly model ResNet FC Results without batching Hardware Num. Parameters Inputs 1700 Outputs 1700 Weights 180800
Slide 14 PE array Input Matrix Batch size=N PE PE PE L1 L2 PE PE PE Weights stored PE PE PE PE PE MAC REG PE Virtual DLA Data reuse achieved with batching, trading with latency. Now, can we reduce the computation? →Improving computation energy by 8x would reduce 66% DLA power Hardware Increased memory only 5% 0.1 1 10 0 20 40 60 Memory-Computation Energy Ratio Batch size @Batch size=64: Comp. 3x larger than memory
energy 10x? • >95% computation done in DNN is Multiply-and-Accumulate (MAC) • Digital MACs are already highly optimized • Further improvements cannot be expected Slide 15 W[N] IN[N] + Circuit
• Analog computation achieves higher power efficiency than digital – Can further scale accelerator power Slide 16 Time based computation Charge based computation [Lee, ISSCC2016] [Bankman, ASSCC2016] [Miyashita JSSC2014, 2017] DTC DTC DTC DTC Circuit
• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 18 Circuit W[N] IN[N] + Mult. 40% Add. 10% Reg. 50% Power breakdown of digital MAC ~1pJ/ops
• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 19 Time domain approach → Accumulates pulse length → Multiple DTC required DTC: Digital-to-time-converter DTC DTC DTC DTC Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable DTC Gated Ring Oscillator (GRO) IN Weight Output Circuit
• Target low-area and 8-bit MAC resolution – Realize analog computation for wide application with low cost Slide 20 Proposed phase domain approach → Accumulates phase → Only single DTC + Gated Ring Oscillator Require digital cells only; small area and scalable Phase Domain Digital MAC Resolution 8 bit 1~64 bit Norm. Area /Bit 1.2 1 Norm. Power 0.125 1 Circuit DTC Gated Ring Oscillator (GRO) IN Weight Output
Circuit 1. DTC outputs a pulse corresponding to Din Din *tinv DTC Din W Gated Ring Oscillator (GRO) GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 W “1” “0.5” Counter
PhaseMAC: Operation Slide 22 Circuit 2. GRO phase advances while DTC pulse is high Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” Counter
PhaseMAC: Operation Slide 24 Circuit Din *tinv DTC Din W GRO Phase 2π DTC Din “3” “24” Phase saved by gating Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” 1~3 repeated for number of MACs. When phase reaches 2p, detected by counter Phase=Prev. Phase + 𝑫𝒊𝒏 𝑾 𝟐𝝅 𝟏𝟎 Counter
PhaseMAC: The Operation Slide 25 Din *tinv DTC Din W Readout Logic + x10 Phase to digital OUT =15 Counter MSB GRO Phase LSB Din “3” “24” Counter “0” “1” Seq. 1 Seq. 2 W “1” “0.5” During readout, GRO phase and counter value is summed with proper weightings. Circuit
35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs Circuit
35um 35um GRO MAC core DTC GRO Read out Asyn. Timing circuit Output circuits LOGO FPGA Board ARTY Artix 7 MAC input Trigger MAC output Finish signal PC to monitor outputs 32FP 8b PMAC Validation Results[%] 98.2 98.1 MNIST classification (10000 test data) results Input Layer Output Layer 784 10 MNIST DNN H1 256 H2 256 H3 256 H4 256 *98.2% was the limit for fully-connected NNs Training done by Keras with L2 regression and dropouts Circuit
Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.) Slide 30 0 5 10 15 32FP 8B 3B Anomaly score Target Abnormal Normal Anomaly score 0 2 4 6 8 10 12 32FP 8B 7B 6B 5B 4B 3B 2B 1B Target MAC resolution Circuit • Results are from artificially-generated sin-based data – (This is not quantitative.. Would be great if we can share the source code and data.. I’m working on it.)
31 1) High power efficiency with inputs with low mean value; Oscillator operates less. 2) Measured with data having sparsity = 0%. (Has zero-skipping) 0 4 8 12 16 0 10 20 30 40 50 Mean Value of Input Data in regard to Max. Value[%] PMAC Efficiency [TOPS/W] Datasets: MNIST (DNN) CIFAR10 (LeNet-like CNN) ImageNet(Inception-v3) *Trained with L2 norm regression *Mean value calculated by averaging the test data activations. Anomaly detection 11.6 TOPS/W Mean value of non-zero input data In regard to max. value [%] Circuit
for anomaly detection, computation power can take over memory power – Sufficient input batching must be done, trading with latency • Proposed PhaseMAC to further scale accelerator power – Accumulation done by GRO; the area is 26.6x smaller than conv. arts. – Power efficiency 48% higher than conventional arts. – Demonstrated anomaly detection results for edge computing. Slide 35
would like to thank Xuan Yang, Edward Lee, Danny Bankman, Boris Murmann and Mark Horowitz for the valuable discussions. • The authors would like to thank Daisuke Miyashita and Jun Deguchi for the valuable discussions on time domain computing. Slide 36