Slide 1

Slide 1 text

2019 DevDay Face Recognition Check-in Mechanism > Seungyoun Yi > NAVER AI Production Team

Slide 2

Slide 2 text

CONTENTS 1. Faster, optimized Check In 2. Face Engine 3. Engine to Product (Face Sign) 4. Review

Slide 3

Slide 3 text

1. Faster, Optimized Check in

Slide 4

Slide 4 text

1.1 Face Recognition >november_20_first day >09:00-18:00 >november_20_first day >09:00-18:00 >date/november 20-21 >place/grand nikko tokyo Entrance Take card out Tag card Face recognition Entrance Choose Floor Choose Floor Take card back 1. Face recognition replaces authentication 2. 1:1, Verification Elevator

Slide 5

Slide 5 text

1.1 Face Recognition >november_20_first day >09:00-18:00 >november_20_first day >09:00-18:00 >date/november 20-21 >place/grand nikko tokyo 1. Face recognition replaces payment 2. 1:N, Recognition Vending Machine Choose product Take cash out Put cash Face recognition Choose product Get Product Get Product Take change back

Slide 6

Slide 6 text

1.1 Face Recognition Before recognition Take card out After recognition Take card or change back Face recognition Before recognition After recognition Face recognition After recognition Before recognition Process with physical payment method 1. Simplify UX using face recognition 2. Insert second-verification flow when result is risky Wrap-Up

Slide 7

Slide 7 text

1.2 Face Check-in 1. Long lines of developers waiting to see keynotes 2. Personal information verification process by e-mail and name Check-in System Online Registration Line Up Check Personal Information Get Goods Watch Session Increased bottlenecks

Slide 8

Slide 8 text

1.2 Face Check-in 1. Reduction of waiting time with less than 1 second recognition speed 2. Remove process to verify name or personal information Check-in System Online Registration Line Up Face Recognition Get Goods Watch Session Reduced bottlenecks

Slide 9

Slide 9 text

1.2 Face Check-in Can AI solve all problems using only model ?! Engineering in the Engine Engineering in the Service NEVER, NEVER, NEVER

Slide 10

Slide 10 text

2. Face Engine

Slide 11

Slide 11 text

Face Recognition Pipeline Face Feature 1) Detection 2) Face alignment 3) Compute transform params Reference coordinate 4) Warp 5) Canonical face 6) Extract feature Feature extraction

Slide 12

Slide 12 text

Face Recognition Pipeline Face Identification 2) Get a shortlist for the probe 3) Recognizing identity via simple thresholding Gallery (A Set of face features) Probe ( A face feature ) 1) Nearest-Neighbor Retrieval It’s face feature, not a face image.!! It’s face feature, not a face image.!!

Slide 13

Slide 13 text

Applying Face Recognition to Service Support Multiplatform High Accuracy Fast Speed Deep Learning Framework Face Detection Face Alignment Face Recognition TensorFlow CoreML MLKit iOS Vision Framework iOS Android MacOS Windows
 Linux 99% Accuracy 0.1 Second Speed

Slide 14

Slide 14 text

Face Engine …. …. …. User Applications Face Engine C++ Frontend Python Frontend …. Frontend Function Backend Framework Detection Facial Landmark Recognition Face Pose Smoother Tracking For FP32, INT8 (CPU) Android iOS Linux MacOS Windows Passing Passing Passing Passing Passing Supporting

Slide 15

Slide 15 text

Face Engine Input: Image Output: Face (people) information Detector & Tracking Pose Estimation Low-Resolution Image Resize High-Resolution Image Post Processing Alignment Recognition Feature Z X Y Human Info Bounding Box Facial Landmark Face Feature Euler Angle

Slide 16

Slide 16 text

Sharing Optimization Experience In Face Engine Lightweight model for high accuracy and fast Inference engine optimization for more speed Layer optimization for more, more speed more, more, more..?

Slide 17

Slide 17 text

2.1 Lightweight Model

Slide 18

Slide 18 text

2.1.1 Lightweight Deep Learning Model Trade off of accuracy and speed Https://Www.Researchgate.Net/Publication/328017644_Benchmark_Analysis_of_Representative_Deep_Neural_Network

Slide 19

Slide 19 text

2.1.2 Face Detector CPU Real-Time Vs GPU Real-Time +: Backbone network only ++: Backbone + head Model Latency [ms] Model size[MB] Input Device Pelee [1] 43.86+ ~5.00 320x320 iPhone8
 (GPU) Ours 5.74+ 0.14++ 320x320 iPhone7
 (CPU) [1] Robert J. Wang et. al, Pelee: A Real-Time Object Detection System on Mobile Devices, NeurIPS, 2018

Slide 20

Slide 20 text

Our standard is real-time in mobile CPU environment.

Slide 21

Slide 21 text

2.1.2 Face Detector [1] S. Zhang et. al, FaceBoxes: A CPU Real-time Face Detector with High Accuracy, IJCB, 2017 Model mAP+ Latency [ms] Model size [MB] Input Device
 (engine) FaceBoxes[1] 96.0 21.40 3.83 320x240 Xeon
 (pytorch) Ours 96.0 22.50 0.29 320x240 Xeon
 (pytorch) Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz +: Mean averaged precision is a measure of how accurately the position of the object is detected. (The higher the better) Equal level of mAP 10x reduction in model size‑ FDDB ROC Curve The model located in upper left

Slide 22

Slide 22 text

2.1.3 Facial Landmark Detector Metric: Mean squared error Dataset: 3000W-test Model MSE(Fullset) DSRN[3] 5.21 SBR[1] 4.99 RCN-L+ELT-all[4] 4.90 PCD-CNN[2] 4.44 Ours (Teacher) 3.73 ResNet50+PDB+Wing[5] 3.60 LAB[0] 3.49 Lower is better [0] Look at Boundary: A Boundary-Aware Face Alignment Algorithm, CVPR, 2018 [1] Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors, CVPR, 2018 [2] Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment, CVPR, 2018 [3] Direct Shape Regression Networks for End-to-End Face Alignment, CVPR, 2018 [4] Improving Landmark Localization with Semi-Supervised Learning, CVPR, 2018 [5] Wing Loss for Robust Facial Landmark Localisation with Convolutional Nerual Netowrks, CVPR, 2018 
 [6] https://ibug.doc.ic.ac.uk/resources/300-W/ Latency(CPU) : 1.2 sec Model size : 93.4MB

Slide 23

Slide 23 text

2.1.3 Facial Landmark Detector Metric: Mean Squared Error (Lower Is Better) Model Ours-WM0.125 Ours-WM0.25 Ours-WM0.5 Ours-teacher 300W- Challenge [NME] 5.08 4.89 4.67 4.34 300W-test-fullset [NME] 4.53 4.36 4.28 3.73 Latency-factor 1.0 1.96 5.04 46.57 Model-Size [MB] 0.4 1.5 6.1 93.4 Some increase in MSE‐ 46x reduction in latency‑ 200x reduction in model size‑

Slide 24

Slide 24 text

2.1.4 Face Recognizer Baseline model : MobileFaceNet [1] Model mAP Latency [ms] Model size [MB] MobileFaceNet [1] 99.18 11.59 4.00 Ours 99.33 11.49 3.96 [1] MobileFaceNets: Efficient CNNs for Accurate Real-time Face Verification on Mobile Devices, arXiv preprint 1804.07573, 2018 Qualcomm Snapdragon 820 with 4 threads

Slide 25

Slide 25 text

2.1.5 What To Optimize? Face Detector 1% 1% 1% 1% 2% 2% 7% 85% Convolution TSODDetectionOutput ReLU Concat BinaryOp Pooling Interp Softmax Permute Flatten Slice TSODPriorBox Eltwise Reshape Split 1% 1% 2% 2% 2% 92% Convolution Eltwise ReLU Concat Pooling DeconvolutionDepthWise Split Facial Landmark Detector 1% 6% 14% 79% Convolution ConvolutionDepthWise PreLu Eltwise Split Normalize Face Recognizer

Slide 26

Slide 26 text

2.2 Inference Engine

Slide 27

Slide 27 text

2.2.1 Engine Performance Measurement Convolution VS Depthwise separable convolution 3x3 Convolution Input channel : 16 Output channel : 16 # of Params : 3 x 3 x 16 x 16 3x3 Depthwise Conv Input channel : 16 Output channel : 16 # of Params : 3 x 3 x 16 1x1 Pointwise Conv # of Params : 1 x 1 x 16 x 16 Total : (3 x 3 x 16) + (1 x 1 x 16 x 16)

Slide 28

Slide 28 text

2.2.1 Engine Performance Measurement Convolution VS Depthwise separable convolution Image : 10 x 10 x 3 Input channel : 3 Output channel : 3 Convolution [ 3 x 3 x 3 x 3] Total parameters : 81 (3x3x3x3) Total multiplications : 24300 (Image x Total Params) Depthwise convolution [ 3 x 3 x 3 x 3] # of Parameters : 27 (3x3x3) # of multiplications : 8100 (Image x Params) Pointwise convolution [ 1 x 1 x 3 x 3 ] # of parameters : 9 (1x1x3x3) # of multiplications : 2700 (Image x Params) Total parameters : 36 Total multiplications : 10800 3 x 3 x c x c > 3 x 3 x c + (c x c)

Slide 29

Slide 29 text

2.2.2 Performance Comparison Convolution is faster than Depthwise separable convolution..? Convolution is faster than depthwise separable convolution idx Pytorch Tensorflow 2.0 y: latency (log scale) Lower is faster!

Slide 30

Slide 30 text

Fewer parameters Less computation Faster speed ≠

Slide 31

Slide 31 text

2.2.2 Performance Comparison 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 iPhone 7 Conv DWC Width Height Channel Conv DWC 4 4 4 0.00033 0.00027 4 4 8 0.00058 0.00037 4 4 16 0.00194 0.00061 4 4 32 0.00701 0.00141 4 4 64 0.02838 0.00210 8 8 4 0.00068 0.00042 8 8 8 0.00213 0.00067 8 8 16 0.00824 0.00145 8 8 32 0.03210 0.00366 8 8 64 0.12765 0.00967 16 16 4 0.00232 0.00113 16 16 8 0.00874 0.00238 16 16 16 0.03407 0.00581 16 16 32 0.13429 0.01642 16 16 64 0.53917 0.06199 32 32 4 0.00942 0.00400 32 32 8 0.03621 0.00915 32 32 16 0.14359 0.02347 32 32 32 0.57033 0.06866 32 32 64 2.26934 0.24841 64 64 4 0.03730 0.01758 64 64 8 0.14957 0.03801 64 64 16 0.59770 0.09850 64 64 32 2.29650 0.29216 64 64 64 9.22913 0.99012 128 128 4 0.16395 0.07094 128 128 8 0.62052 0.15367 128 128 16 2.43079 0.41725 128 128 32 9.83968 1.20801 128 128 64 43.26673 4.53675 256 256 4 0.63575 0.27929 256 256 8 2.43605 0.61418 256 256 16 10.07863 1.93911 256 256 32 44.37163 5.52157 256 256 64 197.65414 21.28737 1 1 Width Height Channel Conv DWC 4 4 4 0.00056 0.00089 4 4 8 0.00126 0.00100 4 4 16 0.00413 0.00139 4 4 32 0.01567 0.00167 4 4 64 0.06023 0.00303 8 8 4 0.00209 0.00146 8 8 8 0.00726 0.00211 8 8 16 0.02689 0.00310 8 8 32 0.10713 0.00579 8 8 64 0.43642 0.01254 16 16 4 0.00926 0.00454 16 16 8 0.03473 0.00685 16 16 16 0.14251 0.01274 16 16 32 0.58063 0.02516 16 16 64 2.23574 0.06161 32 32 4 0.03867 0.01509 32 32 8 0.15526 0.02587 32 32 16 0.62187 0.05456 32 32 32 2.46555 0.11272 32 32 64 9.98108 0.25675 64 64 4 0.16621 0.06130 64 64 8 0.66195 0.10723 64 64 16 2.64051 0.21645 64 64 32 10.58819 0.49159 64 64 64 42.13637 1.14665 128 128 4 0.68576 0.24418 128 128 8 2.73333 0.46256 128 128 16 10.90763 0.87343 128 128 32 43.65631 2.02102 128 128 64 181.53884 5.41057 256 256 4 2.82602 1.00030 256 256 8 11.65558 1.89273 256 256 16 45.63633 3.89032 256 256 32 182.38495 8.68091 256 256 64 718.17794 20.69565 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 i7-4770hq Conv DWC 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Galaxy s7 Conv DWC 1 1 Width Height Channel Conv DWC 4 4 4 0.00424 0.00707 4 4 8 0.01458 0.01162 4 4 16 0.00899 0.01730 4 4 32 0.05309 0.02340 4 4 64 0.12638 0.02427 8 8 4 0.00830 0.00853 8 8 8 0.02690 0.01248 8 8 16 0.05422 0.01827 8 8 32 0.13496 0.03045 8 8 64 0.41399 0.07120 16 16 4 0.02392 0.01383 16 16 8 0.06420 0.02331 16 16 16 0.15563 0.04192 16 16 32 0.46971 0.08803 16 16 64 1.71633 0.18070 32 32 4 0.06783 0.03369 32 32 8 0.16102 0.05938 32 32 16 0.50996 0.10876 32 32 32 1.89805 0.23682 32 32 64 7.81756 0.63089 64 64 4 0.17441 0.09449 64 64 8 0.55505 0.15865 64 64 16 2.07405 0.34244 64 64 32 8.50137 0.86897 64 64 64 36.87440 3.00902 128 128 4 0.53880 0.23594 128 128 8 1.96343 0.45360 128 128 16 8.25908 1.31925 128 128 32 35.51822 4.09999 128 128 64 153.75306 12.80227 256 256 4 2.40637 1.06782 256 256 8 10.39279 3.11999 256 256 16 41.58188 9.03881 256 256 32 169.87069 28.01278 256 256 64 722.37550 56.06882 Lower is faster!

Slide 32

Slide 32 text

2.3 Layer Optimization

Slide 33

Slide 33 text

2.3 Three Ways To Implement Convolution 1. For Loop (Direct Convolution) 3. Winograd Convolution 2. Matrix Multiplication (GEMM)

Slide 34

Slide 34 text

2.3.1 for Loop(Direct Convolution) Simply implemented - Overlapping multiple for-loops (with AVX/Neon) For i: = 1 to D For j := 1 to C For K := 1 to W For 1 := 1 to H For m := 1 to K*K …

Slide 35

Slide 35 text

2.3.2 GEMM-Based Convolution Conventional implementation VS Optimized implementation . . . K K C Convolution weights: D filters Feature map D x (K"C) (K"C) x N * Matrix multiply = Reshape im2col W H N ≈ (H x W) / (Stride) (Accelerated by optimized GEMM) # of multiplication : D x (%&') x N Memory Usage‐ K K C Convolution weights: D filters Feature map (K"C) cut into the vector size of CPU $ Multiply-and-Add = im2row W H (BLAS Free!) (K"C) cut into the vector size of CPU Repeat (D x N) times! # of multiplication : D x (%&') x N

Slide 36

Slide 36 text

2.3.3 Winograd Convolution # of multiplication : D x (!"#) x N Reference: Song Han, cs231n lecture note, Stanford University # of multiplication decreases! # of multiplication decreases!

Slide 37

Slide 37 text

Direct / GEMM / Winograd Convolution Which layer should we use? 2.3.4 Compare

Slide 38

Slide 38 text

2.3.4 Compare Convolution 3x3 (Direct vs Winograd) operation speed (log scale) according to input image (Width, Height, Channel) 0.001 0.010 0.100 1.000 10.000 100.000 1000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 iPhone 7 Conv Winograd Log latency (ms) 0.010 0.100 1.000 10.000 100.000 1000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Galaxy s7 (T=1) Conv Winograd Log latency (ms) Case: [Channel > 16], [16 < Width, Height < 128] Using Winograd convolution is faster than convolution 3x3 Convolution: Direct Conv vs Winograd Conv Latency (ms) Input Input Channel Conv Winograd 4 4 16 0.002 0.025 4 4 32 0.008 0.044 4 4 64 0.029 0.114 4 4 128 0.112 0.331 8 8 16 0.008 0.020 8 8 32 0.032 0.043 8 8 64 0.127 0.099 8 8 128 0.509 0.317 16 16 16 0.034 0.078 16 16 32 0.134 0.152 16 16 64 0.534 0.360 16 16 128 2.164 1.042 32 32 16 0.142 0.141 32 32 32 0.567 0.305 32 32 64 2.257 0.748 32 32 128 9.174 2.240 64 64 16 0.592 0.656 64 64 32 2.276 1.475 64 64 64 9.111 3.724 64 64 128 39.543 10.553 96 96 16 1.356 1.606 96 96 32 5.337 3.457 96 96 64 22.859 9.319 96 96 128 99.884 26.341 118 118 16 2.057 2.244 118 118 32 8.711 5.183 118 118 64 37.518 12.870 118 118 128 158.712 37.014 Input Input Channel Conv Winograd 4 4 16 0.013 0.052 4 4 32 0.039 0.062 4 4 64 0.089 0.159 4 4 128 0.337 0.574 8 8 16 0.026 0.028 8 8 32 0.095 0.058 8 8 64 0.394 0.151 8 8 128 1.410 0.566 16 16 16 0.117 0.138 16 16 32 0.442 0.340 16 16 64 1.678 0.829 16 16 128 6.519 2.476 32 32 16 0.473 0.369 32 32 32 1.862 0.847 32 32 64 7.445 2.230 32 32 128 37.709 6.572 64 64 16 2.260 1.719 64 64 32 8.118 4.738 64 64 64 35.105 12.812 64 64 128 169.529 35.936 96 96 16 5.315 5.021 96 96 32 18.155 12.065 96 96 64 76.374 29.897 96 96 128 363.084 78.233 118 118 16 5.036 9.419 118 118 32 21.389 21.273 118 118 64 112.084 40.140 118 118 128 447.813 120.284 # of Threads = 1

Slide 39

Slide 39 text

2.3.4 Compare 3x3 Convolution: Direct Conv vs GEMM-based Conv 1 1 Width Height Channel Conv GEMM 4 4 4 0.00056 0.00217 4 4 8 0.00126 0.00265 4 4 16 0.00413 0.00283 4 4 32 0.01567 0.00447 4 4 64 0.06023 0.00847 8 8 4 0.00209 0.00292 8 8 8 0.00726 0.00421 8 8 16 0.02689 0.00668 8 8 32 0.10713 0.01686 8 8 64 0.43642 0.04719 16 16 4 0.00926 0.00685 16 16 8 0.03473 0.01226 16 16 16 0.14251 0.02770 16 16 32 0.58063 0.06957 16 16 64 2.23574 0.23985 32 32 4 0.03867 0.02165 32 32 8 0.15526 0.04605 32 32 16 0.62187 0.11299 32 32 32 2.46555 0.34119 32 32 64 9.98108 1.02423 64 64 4 0.16621 0.08484 64 64 8 0.66195 0.17678 64 64 16 2.64051 0.47944 64 64 32 10.58819 1.38857 64 64 64 42.13637 4.49060 128 128 4 0.68576 0.34465 128 128 8 2.73333 0.73755 128 128 16 10.90763 2.07025 128 128 32 43.65631 5.86608 128 128 64 181.53884 18.73521 256 256 4 2.82602 1.39904 256 256 8 11.65558 3.05527 256 256 16 45.63633 8.61684 256 256 32 182.38495 25.70292 256 256 64 718.17794 77.88327 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 i7-4770hq Conv GEMM Case: [Width, Height > 8] Using GEMM-based convolution is faster than convolution Convolution 3x3 (Direct vs Winograd) operation speed (log scale) according to input image (Width, Height, Channel) Log latency (ms)

Slide 40

Slide 40 text

2.3.4 Compare 1x1 Convolution: Direct Conv vs GEMM-based Conv Width Height Channel Conv GEMM 4 4 64 0.004 0.003 4 4 128 0.017 0.011 4 4 256 0.065 0.049 4 4 512 0.258 0.173 8 8 64 0.014 0.020 8 8 128 0.055 0.053 8 8 256 0.226 0.182 8 8 512 0.934 0.674 16 16 64 0.056 0.063 16 16 128 0.226 0.202 16 16 256 0.938 0.712 16 16 512 3.686 2.745 32 32 64 0.224 0.241 32 32 128 0.896 0.813 32 32 256 3.615 3.002 32 32 512 15.488 12.455 64 64 64 4.414 0.960 64 64 128 18.826 4.160 64 64 256 73.898 13.708 64 64 512 298.805 54.001 128 128 64 18.640 4.589 128 128 128 75.117 15.895 128 128 256 299.652 59.546 128 128 512 1195.597 251.955 256 256 64 66.607 19.038 256 256 128 258.776 63.814 256 256 256 1070.993 256.399 256 256 512 4381.642 1539.726 # of Threads = 1 Latency (ms) 0.001 0.010 0.100 1.000 10.000 100.000 1000.000 10000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 iPhone 7 Conv GEMM Log latency (ms) 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 10000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 i7-4770hq Conv GEMM Desktop Width Height Channel Conv GEMM 4 4 64 0.017 0.004 4 4 128 0.070 0.010 4 4 256 0.243 0.045 4 4 512 0.921 0.164 8 8 64 0.029 0.011 8 8 128 0.118 0.038 8 8 256 0.502 0.147 8 8 512 1.987 0.560 16 16 64 0.143 0.048 16 16 128 0.385 0.151 16 16 256 1.516 0.591 16 16 512 7.803 2.605 32 32 64 0.378 0.227 32 32 128 1.802 0.735 32 32 256 7.572 2.596 32 32 512 34.361 11.497 64 64 64 1.777 0.947 64 64 128 8.340 3.202 64 64 256 33.438 11.721 64 64 512 202.632 49.073 128 128 64 9.975 4.405 128 128 128 54.193 13.878 128 128 256 233.069 46.819 128 128 512 1085.760 211.534 256 256 64 67.563 19.370 256 256 128 288.263 55.190 256 256 256 1021.372 191.702 256 256 512 4976.360 812.025 # of Threads = 1 Latency (ms) Case: [Channel > 64] Using GEMM-based convolution is faster than convolution 0.010 0.100 1.000 10.000 100.000 1000.000 10000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Galaxy s7 Conv GEMM Width Height Channel Conv GEMM 4 4 64 0.028 0.018 4 4 128 0.082 0.038 4 4 256 0.206 0.123 4 4 512 0.769 0.507 8 8 64 0.074 0.035 8 8 128 0.185 0.126 8 8 256 0.680 0.480 8 8 512 2.500 2.037 16 16 64 0.187 0.136 16 16 128 0.606 0.520 16 16 256 2.274 2.062 16 16 512 9.613 8.627 32 32 64 1.278 0.574 32 32 128 4.939 2.279 32 32 256 20.686 8.985 32 32 512 84.172 38.773 64 64 64 5.149 2.693 64 64 128 21.785 10.592 64 64 256 93.905 40.704 64 64 512 424.173 169.289 128 128 64 32.471 12.533 128 128 128 101.706 45.978 128 128 256 455.670 175.253 128 128 512 1920.443 752.894 256 256 64 132.855 50.844 256 256 128 606.822 182.193 256 256 256 2362.191 806.543 256 256 512 9522.537 3102.843 # of Threads = 1 Latency (ms) Log latency (ms) Log latency (ms)

Slide 41

Slide 41 text

2.3.4 Compare Our Model 97.767 64.998 14.981 34.346 38.218 81.869 58.373 14.876 34.336 32.617 0 20 40 60 80 100 120 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer Galaxy s7 Before Optimize After 40.262 28.38 6.086 16.482 16.836 29.215 22.577 5.618 14.647 13.106 0 5 10 15 20 25 30 35 40 45 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer iPhone 7 Before Optimize After 78.617 45.513 11.209 25.224 25.371 29.147 26.202 6.619 13.001 12.437 0 10 20 30 40 50 60 70 80 90 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer i7-4770hq Without AVX With AVX Up to 1.2x faster‐ Up to 2.7x faster‐ Up to 1.3x faster‐

Slide 42

Slide 42 text

2.3.5 Other Batch Normalization Folding BN ACT Conv Conv BN Pre-act Post-act ACT Conv + BN Post-act folded ACT Measured Re # of threads Post-a (ms) 1 104.58 6 29.436 * Measured on M

Slide 43

Slide 43 text

2.3.5 Other Batch Normalization Folding Post-act ACT BN Conv BN folding ACT Conv + BN Layer fusion Conv + Act

Slide 44

Slide 44 text

2.3.5 Other Our Model Up to 1.19x Up to 1.05x faster‐ Up to 1.1x faster‐ 28.592 22.141 7.1 14.822 12.625 27.508 20.826 6.069 13.99 12.378 25.508 18.826 6.069 13.99 12.378 0 5 10 15 20 25 30 35 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer iPhone 7 Original BN folding Layer Fusion 83.28 60.664 16.1 36.407 32.583 81.44 58.217 15.1 36.325 32.299 81.262 57.091 15.1 36.325 32.299 0 10 20 30 40 50 60 70 80 90 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer Galaxy s7 Original BN folding Layer Fusion 35.98 30.454 7.01 12.379 14.241 31.295 27.912 6.494 12.313 11.021 29.065 23.702 6.494 12.313 11.021 0 5 10 15 20 25 30 35 40 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer i7-4770hq Original BN folding Layer Fusion

Slide 45

Slide 45 text

2.4 Bonus

Slide 46

Slide 46 text

In the face detector paper, there is a hidden layer that is not included in the speed measurement index.

Slide 47

Slide 47 text

2.4.1 Optimization Not Found in the Paper Face Detector (Post Processing) 1. Priorbox 2. Detection output layer https://arxiv.org/pdf/1705.02950.pdf

Slide 48

Slide 48 text

2.4.1 Optimization Not Found in the Paper NMS_TOP_K TOP_K Confidence Accuracy Drop [%] 5000(base) 750 0.1 - 2500 750 0.1 - 2500 750 0.05 0.0019% 1250 750 0.05 0.0099% Little difference in accuracy (reasonable) iPhone7 NMS_TOP_K TOP_K Confidence Thread = 1 [ms] Detector 5000 1000 0.1 41.09 2500 750 0.1 12.49 1250 750 0.05 6.11 41.09 ms ➡ 6.11 ms (Up to 6x faster‐) Galaxy s7 NMS_TOP_ K TOP_K Confidence Threads = 1 [ms] Threads = 2 [ms] Detector 5000 1000 0.1 46.52 23.05 2500 750 0.1 24.94 12.92 1250 750 0.05 17.25 10.33 i7-4770hq NMS_TOP_ K TOP_K Confidence Threads = 1 [ms] Threads = 2 [ms] Detector 5000 1000 0.1 13.49 11.69 2500 750 0.1 8.28 6.61 1250 750 0.05 6.82 4.83 11.69 ms ➡ 4.83 ms (Up to 2.4x faster‐) 13.49 ms ➡ 6.82 ms (Up to 2x faster‐) 46.518 ms ➡ 17.253 ms (Up to 2.6x faster‐) 23.052 ms ➡ 10.334 ms (Up to 2.2x faster‐)

Slide 49

Slide 49 text

There is space for speed improvement even at floating point.

Slide 50

Slide 50 text

2.4.2 Floating Point IEEE Standard 754 Floating Point Numbers 1bit 8bit 23bit Sign Bit Exponent Bit Fraction, Significant, Mantissa Bit N = (−1)s × (1.m) × 2(e−127)

Slide 51

Slide 51 text

2.4.2 Floating Point IEEE Standard 754 Floating Point Numbers 1bit 8bit 23bit Sign Bit Exponent Bit Fraction, Significant, Mantissa Bit N = (−1)s × (1.m) × 2(e−127) Https://Www.Ntu.Edu.Sg/Home/Ehchua/Programming/Java/DataRepresentation.html Some special cases • Zero • Exponent = 0, Significant = 0 • NaN(Not a number) • Exponent = 255 • Denormalized (Subnormal) numbers • Exponent = 0, Significant ≠ 0

Slide 52

Slide 52 text

2.4.2 Floating Point Denormalized (Subnormal) Numbers Looking at the weight of the model, there were a lot of subnormal values.

Slide 53

Slide 53 text

Experiment Is floating point operation slow if there is a subnormal value in the chipset of a specific device? Real. Slow :(

Slide 54

Slide 54 text

2.4.2 Floating Point This value is eventually close to zero. The method we can take is to replace the subnormal number with the number of normalized express

Slide 55

Slide 55 text

2.4.2 Floating Point After Replacing the Values, the Accuracy of the Model Was Not Affected. However, Device Galaxy s7 Pixel3 Galaxy note 10+ iPhone7 Geekbench (Singlecore) 341 489 756 744 Model Latency [ms] 159.10 29.71 57.90 13.71 Replace Denormals [ms] 32.97 29.71 11.73 13.71 159.10 ms ➡ 32.97 ms (Up to 5x faster‐) 57.90 ms ➡ 11.73 ms (Up to 5x faster‐) ! !

Slide 56

Slide 56 text

3. Engine to Product

Slide 57

Slide 57 text

3. Making AI To Product? + = + = ? Face Sign Application Developer Planner Designer Developer Planner Designer

Slide 58

Slide 58 text

3.1 Flowchart Seungyoun Yi Face Detect Face Alignment Recognize Landmark Recognize Detect

Slide 59

Slide 59 text

3.1 Flowchart 1)Register with accurate face data 2)Recognition with minimal reference time 3)Seamless interact between server and client Registration Recognit : Client : Server Detect Landmark Recognize Detect Landmark Recognize

Slide 60

Slide 60 text

3.2 Client Detect Preprocessing Landmark Postprocessing User Recognition (Entrance) Face size Face Alignment Network Frame Processing Postprocess, UX Design and Development Preprocess, UX Design and Development Frame Processing Device type Device tilt

Slide 61

Slide 61 text

1. OpenCV (Laplacian) 2. Gyroscope Frame Handling 3.2.1 Detect Preprocessing bool isBlurryImage(const cv::Mat &grayImage, const int threshold) { cv::Mat laplacian; cv::Laplacian(grayImage, laplacian, CV_64F); cv::Scalar mean; cv::Scalar stddev; meanStdDev(laplacian, mean, stddev, cv::Mat()); double variance = pow(stddev.val[0], 2); return (variance <= threshold); }

Slide 62

Slide 62 text

3.2.1 Detect Preprocessing 1. Reduce eye strain 2. Change camera recognition range Preprocess, UX Design and Development

Slide 63

Slide 63 text

3.2.1 Detect Preprocessing 3. Insert Identity 4. Put Face in the Guideline Preprocess, UX Design and Development

Slide 64

Slide 64 text

3.2.2 Landmark Postprocessing Face Size " " ✓ " ✓ " 1. Set minimum face size for the engine 2. Limit too small face size 3. Limit too big face size

Slide 65

Slide 65 text

3.2.2 Landmark Postprocessing # " ✓ ✓ " " "# # Face Size 4. Limit cut face 5. Optional limit when multiple people are recognized

Slide 66

Slide 66 text

3.2.2 Landmark Postprocessing Face Status 1. Call API when movement is small 2. Set face recognition location ✓ ✓ ✓ " " " "

Slide 67

Slide 67 text

3.2.2 Landmark Postprocessing 3. Face Alignment 4. Euler Angle Face Status # compute the center of mass for each eye leftEyeCenter = leftEyePts.mean(axis=0).astype("int") rightEyeCenter = rightEyePts.mean(axis=0).astype("int") # compute the angle between the eye centroids dY = rightEyeCenter[1] - leftEyeCenter[1] dX = rightEyeCenter[0] - leftEyeCenter[0] angle = np.degrees(np.arctan2(dY, dX)) - 180 https://www.pyimagesearch.com/2017/05/22/face-alignment-with-opencv-and-python/ ▲ ▲ ✓ ✓

Slide 68

Slide 68 text

3.2.2 Landmark Postprocessing 5. Eye Blink 6. Wink Face Status ▲ ▲ ▲ https://www.pyimagesearch.com/2017/04/24/eye-blink-detection-opencv-python-dlib/

Slide 69

Slide 69 text

3.2.3 Speed Optimization Is the Result More Accurate Than the Threshold Value? Is the Detected Result Encrypted? Is the Recognized Person Same? Network 1. Consideration for encryption 2. Find optimal results in multi-requests Encrypt Recognized Face Encrypt Features Encrypt Network Encrypt Result

Slide 70

Slide 70 text

3.2.3 Speed Optimization 3. Inference Size Optimization Network Camera Image (3MB) Engine Recognition Image (100KB) Server Transport Binary (30KB)

Slide 71

Slide 71 text

3.2.3 Speed Optimization 4. Network Size Optimization (change data type) Network [UInt8] Binary 15 times network API test 4.18 ms ➡ 2.60 ms (Up to 1.6x faster‐) 4.02 ms ➡ 2.41 ms (Up to 1.6x faster‐)

Slide 72

Slide 72 text

3.2.3 Speed Optimization 4. Network Size Optimization (gzip) Network Before 15 times network API test 2.60 ms ➡ 1.85 ms (Up to 1.7x faster‐) 2.41 ms ➡ 1.75 ms (Up to 1.7x faster‐) After

Slide 73

Slide 73 text

3.2.4 Landmark Postprocessing 1. Face Size 2. Face Status Postprocess, UX Design, and Development

Slide 74

Slide 74 text

3.2.4 Landmark Postprocessing 3. Correct UX flow based on accuracy Postprocess, UX Design, and Development + =

Slide 75

Slide 75 text

3.2.5 Customization Finding Optimal Value in Application 1. Face Related Setting 2. Network Related Setting 3. Process Related Setting

Slide 76

Slide 76 text

4. Review

Slide 77

Slide 77 text

4.1 Engine Optimization Knowledge Distillation Neural Architecture search Latency-aware Design Lightweight model Low resolution Image Engine optimization Optimization to reduce even 1ms Layer Fusion Parallel Framework SIMD (AVX/Neon) Memory Reuse Layer Optimization Hidden Bottleneck Optimization Replace Subnormal Numbers

Slide 78

Slide 78 text

4.2 Service Optimization Detect preprocessing Landmark postprocessing Network Frame Processing Postprocess, UX Design and Development Face Alignment Face size Device tilt Device type Frame Processing Preprocess, UX Design and Development

Slide 79

Slide 79 text

4.3 Mobile AI in Production Field Test 1) TechTalk 2) Internal Cafe 3) Balance Festival 4) Practice, DEVIEW 2019 5) Practice, LINE DEV DAY 2019

Slide 80

Slide 80 text

4.3 Mobile AI in Production Flows

Slide 81

Slide 81 text

4.3 Mobile AI in Production Network

Slide 82

Slide 82 text

4.4 Goal of AI Engine Development Wow User Convenience Service Development + =

Slide 83

Slide 83 text

Thank You