Face Recognition Check-in Mechanism

Face Recognition Check-in Mechanism

Seungyoun Yi
NAVER AI Production Team
https://linedevday.linecorp.com/jp/2019/sessions/D2-5

Be4518b119b8eb017625e0ead20f8fe7?s=128

LINE DevDay 2019

November 21, 2019
Tweet

Transcript

  1. 2019 DevDay Face Recognition Check-in Mechanism > Seungyoun Yi >

    NAVER AI Production Team
  2. CONTENTS 1. Faster, optimized Check In 2. Face Engine 3.

    Engine to Product (Face Sign) 4. Review
  3. 1. Faster, Optimized Check in

  4. 1.1 Face Recognition >november_20_first day >09:00-18:00 >november_20_first day >09:00-18:00 >date/november

    20-21 >place/grand nikko tokyo Entrance Take card out Tag card Face recognition Entrance Choose Floor Choose Floor Take card back 1. Face recognition replaces authentication 2. 1:1, Verification Elevator
  5. 1.1 Face Recognition >november_20_first day >09:00-18:00 >november_20_first day >09:00-18:00 >date/november

    20-21 >place/grand nikko tokyo 1. Face recognition replaces payment 2. 1:N, Recognition Vending Machine Choose product Take cash out Put cash Face recognition Choose product Get Product Get Product Take change back
  6. 1.1 Face Recognition Before recognition Take card out After recognition

    Take card or change back Face recognition Before recognition After recognition Face recognition After recognition Before recognition Process with physical payment method 1. Simplify UX using face recognition 2. Insert second-verification flow when result is risky Wrap-Up
  7. 1.2 Face Check-in 1. Long lines of developers waiting to

    see keynotes 2. Personal information verification process by e-mail and name Check-in System Online Registration Line Up Check Personal Information Get Goods Watch Session Increased bottlenecks
  8. 1.2 Face Check-in 1. Reduction of waiting time with less

    than 1 second recognition speed 2. Remove process to verify name or personal information Check-in System Online Registration Line Up Face Recognition Get Goods Watch Session Reduced bottlenecks
  9. 1.2 Face Check-in Can AI solve all problems using only

    model ?! Engineering in the Engine Engineering in the Service NEVER, NEVER, NEVER
  10. 2. Face Engine

  11. Face Recognition Pipeline Face Feature 1) Detection 2) Face alignment

    3) Compute transform params Reference coordinate 4) Warp 5) Canonical face 6) Extract feature Feature extraction
  12. Face Recognition Pipeline Face Identification 2) Get a shortlist for

    the probe 3) Recognizing identity via simple thresholding Gallery (A Set of face features) Probe ( A face feature ) 1) Nearest-Neighbor Retrieval It’s face feature, not a face image.!! It’s face feature, not a face image.!!
  13. Applying Face Recognition to Service Support Multiplatform High Accuracy Fast

    Speed Deep Learning Framework Face Detection Face Alignment Face Recognition TensorFlow CoreML MLKit iOS Vision Framework iOS Android MacOS Windows
 Linux 99% Accuracy 0.1 Second Speed
  14. Face Engine …. …. …. User Applications Face Engine C++

    Frontend Python Frontend …. Frontend Function Backend Framework Detection Facial Landmark Recognition Face Pose Smoother Tracking For FP32, INT8 (CPU) Android iOS Linux MacOS Windows Passing Passing Passing Passing Passing Supporting
  15. Face Engine Input: Image Output: Face (people) information Detector &

    Tracking Pose Estimation Low-Resolution Image Resize High-Resolution Image Post Processing Alignment Recognition Feature Z X Y Human Info Bounding Box Facial Landmark Face Feature Euler Angle
  16. Sharing Optimization Experience In Face Engine Lightweight model for high

    accuracy and fast Inference engine optimization for more speed Layer optimization for more, more speed more, more, more..?
  17. 2.1 Lightweight Model

  18. 2.1.1 Lightweight Deep Learning Model Trade off of accuracy and

    speed Https://Www.Researchgate.Net/Publication/328017644_Benchmark_Analysis_of_Representative_Deep_Neural_Network
  19. 2.1.2 Face Detector CPU Real-Time Vs GPU Real-Time +: Backbone

    network only ++: Backbone + head Model Latency [ms] Model size[MB] Input Device Pelee [1] 43.86+ ~5.00 320x320 iPhone8
 (GPU) Ours 5.74+ 0.14++ 320x320 iPhone7
 (CPU) [1] Robert J. Wang et. al, Pelee: A Real-Time Object Detection System on Mobile Devices, NeurIPS, 2018
  20. Our standard is real-time in mobile CPU environment.

  21. 2.1.2 Face Detector [1] S. Zhang et. al, FaceBoxes: A

    CPU Real-time Face Detector with High Accuracy, IJCB, 2017 Model mAP+ Latency [ms] Model size [MB] Input Device
 (engine) FaceBoxes[1] 96.0 21.40 3.83 320x240 Xeon
 (pytorch) Ours 96.0 22.50 0.29 320x240 Xeon
 (pytorch) Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz +: Mean averaged precision is a measure of how accurately the position of the object is detected. (The higher the better) Equal level of mAP 10x reduction in model size‑ FDDB ROC Curve The model located in upper left
  22. 2.1.3 Facial Landmark Detector Metric: Mean squared error Dataset: 3000W-test

    Model MSE(Fullset) DSRN[3] 5.21 SBR[1] 4.99 RCN-L+ELT-all[4] 4.90 PCD-CNN[2] 4.44 Ours (Teacher) 3.73 ResNet50+PDB+Wing[5] 3.60 LAB[0] 3.49 Lower is better [0] Look at Boundary: A Boundary-Aware Face Alignment Algorithm, CVPR, 2018 [1] Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors, CVPR, 2018 [2] Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment, CVPR, 2018 [3] Direct Shape Regression Networks for End-to-End Face Alignment, CVPR, 2018 [4] Improving Landmark Localization with Semi-Supervised Learning, CVPR, 2018 [5] Wing Loss for Robust Facial Landmark Localisation with Convolutional Nerual Netowrks, CVPR, 2018 
 [6] https://ibug.doc.ic.ac.uk/resources/300-W/ Latency(CPU) : 1.2 sec Model size : 93.4MB
  23. 2.1.3 Facial Landmark Detector Metric: Mean Squared Error (Lower Is

    Better) Model Ours-WM0.125 Ours-WM0.25 Ours-WM0.5 Ours-teacher 300W- Challenge [NME] 5.08 4.89 4.67 4.34 300W-test-fullset [NME] 4.53 4.36 4.28 3.73 Latency-factor 1.0 1.96 5.04 46.57 Model-Size [MB] 0.4 1.5 6.1 93.4 Some increase in MSE‐ 46x reduction in latency‑ 200x reduction in model size‑
  24. 2.1.4 Face Recognizer Baseline model : MobileFaceNet [1] Model mAP

    Latency [ms] Model size [MB] MobileFaceNet [1] 99.18 11.59 4.00 Ours 99.33 11.49 3.96 [1] MobileFaceNets: Efficient CNNs for Accurate Real-time Face Verification on Mobile Devices, arXiv preprint 1804.07573, 2018 Qualcomm Snapdragon 820 with 4 threads
  25. 2.1.5 What To Optimize? Face Detector 1% 1% 1% 1%

    2% 2% 7% 85% Convolution TSODDetectionOutput ReLU Concat BinaryOp Pooling Interp Softmax Permute Flatten Slice TSODPriorBox Eltwise Reshape Split 1% 1% 2% 2% 2% 92% Convolution Eltwise ReLU Concat Pooling DeconvolutionDepthWise Split Facial Landmark Detector 1% 6% 14% 79% Convolution ConvolutionDepthWise PreLu Eltwise Split Normalize Face Recognizer
  26. 2.2 Inference Engine

  27. 2.2.1 Engine Performance Measurement Convolution VS Depthwise separable convolution 3x3

    Convolution Input channel : 16 Output channel : 16 # of Params : 3 x 3 x 16 x 16 3x3 Depthwise Conv Input channel : 16 Output channel : 16 # of Params : 3 x 3 x 16 1x1 Pointwise Conv # of Params : 1 x 1 x 16 x 16 Total : (3 x 3 x 16) + (1 x 1 x 16 x 16)
  28. 2.2.1 Engine Performance Measurement Convolution VS Depthwise separable convolution Image

    : 10 x 10 x 3 Input channel : 3 Output channel : 3 Convolution [ 3 x 3 x 3 x 3] Total parameters : 81 (3x3x3x3) Total multiplications : 24300 (Image x Total Params) Depthwise convolution [ 3 x 3 x 3 x 3] # of Parameters : 27 (3x3x3) # of multiplications : 8100 (Image x Params) Pointwise convolution [ 1 x 1 x 3 x 3 ] # of parameters : 9 (1x1x3x3) # of multiplications : 2700 (Image x Params) Total parameters : 36 Total multiplications : 10800 3 x 3 x c x c > 3 x 3 x c + (c x c)
  29. 2.2.2 Performance Comparison Convolution is faster than Depthwise separable convolution..?

    Convolution is faster than depthwise separable convolution idx Pytorch Tensorflow 2.0 y: latency (log scale) Lower is faster!
  30. Fewer parameters Less computation Faster speed ≠

  31. 2.2.2 Performance Comparison 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000

    1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 iPhone 7 Conv DWC Width Height Channel Conv DWC 4 4 4 0.00033 0.00027 4 4 8 0.00058 0.00037 4 4 16 0.00194 0.00061 4 4 32 0.00701 0.00141 4 4 64 0.02838 0.00210 8 8 4 0.00068 0.00042 8 8 8 0.00213 0.00067 8 8 16 0.00824 0.00145 8 8 32 0.03210 0.00366 8 8 64 0.12765 0.00967 16 16 4 0.00232 0.00113 16 16 8 0.00874 0.00238 16 16 16 0.03407 0.00581 16 16 32 0.13429 0.01642 16 16 64 0.53917 0.06199 32 32 4 0.00942 0.00400 32 32 8 0.03621 0.00915 32 32 16 0.14359 0.02347 32 32 32 0.57033 0.06866 32 32 64 2.26934 0.24841 64 64 4 0.03730 0.01758 64 64 8 0.14957 0.03801 64 64 16 0.59770 0.09850 64 64 32 2.29650 0.29216 64 64 64 9.22913 0.99012 128 128 4 0.16395 0.07094 128 128 8 0.62052 0.15367 128 128 16 2.43079 0.41725 128 128 32 9.83968 1.20801 128 128 64 43.26673 4.53675 256 256 4 0.63575 0.27929 256 256 8 2.43605 0.61418 256 256 16 10.07863 1.93911 256 256 32 44.37163 5.52157 256 256 64 197.65414 21.28737 1 1 Width Height Channel Conv DWC 4 4 4 0.00056 0.00089 4 4 8 0.00126 0.00100 4 4 16 0.00413 0.00139 4 4 32 0.01567 0.00167 4 4 64 0.06023 0.00303 8 8 4 0.00209 0.00146 8 8 8 0.00726 0.00211 8 8 16 0.02689 0.00310 8 8 32 0.10713 0.00579 8 8 64 0.43642 0.01254 16 16 4 0.00926 0.00454 16 16 8 0.03473 0.00685 16 16 16 0.14251 0.01274 16 16 32 0.58063 0.02516 16 16 64 2.23574 0.06161 32 32 4 0.03867 0.01509 32 32 8 0.15526 0.02587 32 32 16 0.62187 0.05456 32 32 32 2.46555 0.11272 32 32 64 9.98108 0.25675 64 64 4 0.16621 0.06130 64 64 8 0.66195 0.10723 64 64 16 2.64051 0.21645 64 64 32 10.58819 0.49159 64 64 64 42.13637 1.14665 128 128 4 0.68576 0.24418 128 128 8 2.73333 0.46256 128 128 16 10.90763 0.87343 128 128 32 43.65631 2.02102 128 128 64 181.53884 5.41057 256 256 4 2.82602 1.00030 256 256 8 11.65558 1.89273 256 256 16 45.63633 3.89032 256 256 32 182.38495 8.68091 256 256 64 718.17794 20.69565 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 i7-4770hq Conv DWC 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Galaxy s7 Conv DWC 1 1 Width Height Channel Conv DWC 4 4 4 0.00424 0.00707 4 4 8 0.01458 0.01162 4 4 16 0.00899 0.01730 4 4 32 0.05309 0.02340 4 4 64 0.12638 0.02427 8 8 4 0.00830 0.00853 8 8 8 0.02690 0.01248 8 8 16 0.05422 0.01827 8 8 32 0.13496 0.03045 8 8 64 0.41399 0.07120 16 16 4 0.02392 0.01383 16 16 8 0.06420 0.02331 16 16 16 0.15563 0.04192 16 16 32 0.46971 0.08803 16 16 64 1.71633 0.18070 32 32 4 0.06783 0.03369 32 32 8 0.16102 0.05938 32 32 16 0.50996 0.10876 32 32 32 1.89805 0.23682 32 32 64 7.81756 0.63089 64 64 4 0.17441 0.09449 64 64 8 0.55505 0.15865 64 64 16 2.07405 0.34244 64 64 32 8.50137 0.86897 64 64 64 36.87440 3.00902 128 128 4 0.53880 0.23594 128 128 8 1.96343 0.45360 128 128 16 8.25908 1.31925 128 128 32 35.51822 4.09999 128 128 64 153.75306 12.80227 256 256 4 2.40637 1.06782 256 256 8 10.39279 3.11999 256 256 16 41.58188 9.03881 256 256 32 169.87069 28.01278 256 256 64 722.37550 56.06882 Lower is faster!
  32. 2.3 Layer Optimization

  33. 2.3 Three Ways To Implement Convolution 1. For Loop (Direct

    Convolution) 3. Winograd Convolution 2. Matrix Multiplication (GEMM)
  34. 2.3.1 for Loop(Direct Convolution) Simply implemented - Overlapping multiple for-loops

    (with AVX/Neon) For i: = 1 to D For j := 1 to C For K := 1 to W For 1 := 1 to H For m := 1 to K*K …
  35. 2.3.2 GEMM-Based Convolution Conventional implementation VS Optimized implementation . .

    . K K C Convolution weights: D filters Feature map D x (K"C) (K"C) x N * Matrix multiply = Reshape im2col W H N ≈ (H x W) / (Stride) (Accelerated by optimized GEMM) # of multiplication : D x (%&') x N Memory Usage‐ K K C Convolution weights: D filters Feature map (K"C) cut into the vector size of CPU $ Multiply-and-Add = im2row W H (BLAS Free!) (K"C) cut into the vector size of CPU Repeat (D x N) times! # of multiplication : D x (%&') x N
  36. 2.3.3 Winograd Convolution # of multiplication : D x (!"#)

    x N Reference: Song Han, cs231n lecture note, Stanford University # of multiplication decreases! # of multiplication decreases!
  37. Direct / GEMM / Winograd Convolution Which layer should we

    use? 2.3.4 Compare
  38. 2.3.4 Compare Convolution 3x3 (Direct vs Winograd) operation speed (log

    scale) according to input image (Width, Height, Channel) 0.001 0.010 0.100 1.000 10.000 100.000 1000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 iPhone 7 Conv Winograd Log latency (ms) 0.010 0.100 1.000 10.000 100.000 1000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Galaxy s7 (T=1) Conv Winograd Log latency (ms) Case: [Channel > 16], [16 < Width, Height < 128] Using Winograd convolution is faster than convolution 3x3 Convolution: Direct Conv vs Winograd Conv Latency (ms) Input Input Channel Conv Winograd 4 4 16 0.002 0.025 4 4 32 0.008 0.044 4 4 64 0.029 0.114 4 4 128 0.112 0.331 8 8 16 0.008 0.020 8 8 32 0.032 0.043 8 8 64 0.127 0.099 8 8 128 0.509 0.317 16 16 16 0.034 0.078 16 16 32 0.134 0.152 16 16 64 0.534 0.360 16 16 128 2.164 1.042 32 32 16 0.142 0.141 32 32 32 0.567 0.305 32 32 64 2.257 0.748 32 32 128 9.174 2.240 64 64 16 0.592 0.656 64 64 32 2.276 1.475 64 64 64 9.111 3.724 64 64 128 39.543 10.553 96 96 16 1.356 1.606 96 96 32 5.337 3.457 96 96 64 22.859 9.319 96 96 128 99.884 26.341 118 118 16 2.057 2.244 118 118 32 8.711 5.183 118 118 64 37.518 12.870 118 118 128 158.712 37.014 Input Input Channel Conv Winograd 4 4 16 0.013 0.052 4 4 32 0.039 0.062 4 4 64 0.089 0.159 4 4 128 0.337 0.574 8 8 16 0.026 0.028 8 8 32 0.095 0.058 8 8 64 0.394 0.151 8 8 128 1.410 0.566 16 16 16 0.117 0.138 16 16 32 0.442 0.340 16 16 64 1.678 0.829 16 16 128 6.519 2.476 32 32 16 0.473 0.369 32 32 32 1.862 0.847 32 32 64 7.445 2.230 32 32 128 37.709 6.572 64 64 16 2.260 1.719 64 64 32 8.118 4.738 64 64 64 35.105 12.812 64 64 128 169.529 35.936 96 96 16 5.315 5.021 96 96 32 18.155 12.065 96 96 64 76.374 29.897 96 96 128 363.084 78.233 118 118 16 5.036 9.419 118 118 32 21.389 21.273 118 118 64 112.084 40.140 118 118 128 447.813 120.284 # of Threads = 1
  39. 2.3.4 Compare 3x3 Convolution: Direct Conv vs GEMM-based Conv 1

    1 Width Height Channel Conv GEMM 4 4 4 0.00056 0.00217 4 4 8 0.00126 0.00265 4 4 16 0.00413 0.00283 4 4 32 0.01567 0.00447 4 4 64 0.06023 0.00847 8 8 4 0.00209 0.00292 8 8 8 0.00726 0.00421 8 8 16 0.02689 0.00668 8 8 32 0.10713 0.01686 8 8 64 0.43642 0.04719 16 16 4 0.00926 0.00685 16 16 8 0.03473 0.01226 16 16 16 0.14251 0.02770 16 16 32 0.58063 0.06957 16 16 64 2.23574 0.23985 32 32 4 0.03867 0.02165 32 32 8 0.15526 0.04605 32 32 16 0.62187 0.11299 32 32 32 2.46555 0.34119 32 32 64 9.98108 1.02423 64 64 4 0.16621 0.08484 64 64 8 0.66195 0.17678 64 64 16 2.64051 0.47944 64 64 32 10.58819 1.38857 64 64 64 42.13637 4.49060 128 128 4 0.68576 0.34465 128 128 8 2.73333 0.73755 128 128 16 10.90763 2.07025 128 128 32 43.65631 5.86608 128 128 64 181.53884 18.73521 256 256 4 2.82602 1.39904 256 256 8 11.65558 3.05527 256 256 16 45.63633 8.61684 256 256 32 182.38495 25.70292 256 256 64 718.17794 77.88327 0.00010 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 i7-4770hq Conv GEMM Case: [Width, Height > 8] Using GEMM-based convolution is faster than convolution Convolution 3x3 (Direct vs Winograd) operation speed (log scale) according to input image (Width, Height, Channel) Log latency (ms)
  40. 2.3.4 Compare 1x1 Convolution: Direct Conv vs GEMM-based Conv Width

    Height Channel Conv GEMM 4 4 64 0.004 0.003 4 4 128 0.017 0.011 4 4 256 0.065 0.049 4 4 512 0.258 0.173 8 8 64 0.014 0.020 8 8 128 0.055 0.053 8 8 256 0.226 0.182 8 8 512 0.934 0.674 16 16 64 0.056 0.063 16 16 128 0.226 0.202 16 16 256 0.938 0.712 16 16 512 3.686 2.745 32 32 64 0.224 0.241 32 32 128 0.896 0.813 32 32 256 3.615 3.002 32 32 512 15.488 12.455 64 64 64 4.414 0.960 64 64 128 18.826 4.160 64 64 256 73.898 13.708 64 64 512 298.805 54.001 128 128 64 18.640 4.589 128 128 128 75.117 15.895 128 128 256 299.652 59.546 128 128 512 1195.597 251.955 256 256 64 66.607 19.038 256 256 128 258.776 63.814 256 256 256 1070.993 256.399 256 256 512 4381.642 1539.726 # of Threads = 1 Latency (ms) 0.001 0.010 0.100 1.000 10.000 100.000 1000.000 10000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 iPhone 7 Conv GEMM Log latency (ms) 0.00100 0.01000 0.10000 1.00000 10.00000 100.00000 1000.00000 10000.00000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 i7-4770hq Conv GEMM Desktop Width Height Channel Conv GEMM 4 4 64 0.017 0.004 4 4 128 0.070 0.010 4 4 256 0.243 0.045 4 4 512 0.921 0.164 8 8 64 0.029 0.011 8 8 128 0.118 0.038 8 8 256 0.502 0.147 8 8 512 1.987 0.560 16 16 64 0.143 0.048 16 16 128 0.385 0.151 16 16 256 1.516 0.591 16 16 512 7.803 2.605 32 32 64 0.378 0.227 32 32 128 1.802 0.735 32 32 256 7.572 2.596 32 32 512 34.361 11.497 64 64 64 1.777 0.947 64 64 128 8.340 3.202 64 64 256 33.438 11.721 64 64 512 202.632 49.073 128 128 64 9.975 4.405 128 128 128 54.193 13.878 128 128 256 233.069 46.819 128 128 512 1085.760 211.534 256 256 64 67.563 19.370 256 256 128 288.263 55.190 256 256 256 1021.372 191.702 256 256 512 4976.360 812.025 # of Threads = 1 Latency (ms) Case: [Channel > 64] Using GEMM-based convolution is faster than convolution 0.010 0.100 1.000 10.000 100.000 1000.000 10000.000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Galaxy s7 Conv GEMM Width Height Channel Conv GEMM 4 4 64 0.028 0.018 4 4 128 0.082 0.038 4 4 256 0.206 0.123 4 4 512 0.769 0.507 8 8 64 0.074 0.035 8 8 128 0.185 0.126 8 8 256 0.680 0.480 8 8 512 2.500 2.037 16 16 64 0.187 0.136 16 16 128 0.606 0.520 16 16 256 2.274 2.062 16 16 512 9.613 8.627 32 32 64 1.278 0.574 32 32 128 4.939 2.279 32 32 256 20.686 8.985 32 32 512 84.172 38.773 64 64 64 5.149 2.693 64 64 128 21.785 10.592 64 64 256 93.905 40.704 64 64 512 424.173 169.289 128 128 64 32.471 12.533 128 128 128 101.706 45.978 128 128 256 455.670 175.253 128 128 512 1920.443 752.894 256 256 64 132.855 50.844 256 256 128 606.822 182.193 256 256 256 2362.191 806.543 256 256 512 9522.537 3102.843 # of Threads = 1 Latency (ms) Log latency (ms) Log latency (ms)
  41. 2.3.4 Compare Our Model 97.767 64.998 14.981 34.346 38.218 81.869

    58.373 14.876 34.336 32.617 0 20 40 60 80 100 120 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer Galaxy s7 Before Optimize After 40.262 28.38 6.086 16.482 16.836 29.215 22.577 5.618 14.647 13.106 0 5 10 15 20 25 30 35 40 45 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer iPhone 7 Before Optimize After 78.617 45.513 11.209 25.224 25.371 29.147 26.202 6.619 13.001 12.437 0 10 20 30 40 50 60 70 80 90 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer i7-4770hq Without AVX With AVX Up to 1.2x faster‐ Up to 2.7x faster‐ Up to 1.3x faster‐
  42. 2.3.5 Other Batch Normalization Folding BN ACT Conv Conv BN

    Pre-act Post-act ACT Conv + BN Post-act folded ACT Measured Re # of threads Post-a (ms) 1 104.58 6 29.436 * Measured on M
  43. 2.3.5 Other Batch Normalization Folding Post-act ACT BN Conv BN

    folding ACT Conv + BN Layer fusion Conv + Act
  44. 2.3.5 Other Our Model Up to 1.19x Up to 1.05x

    faster‐ Up to 1.1x faster‐ 28.592 22.141 7.1 14.822 12.625 27.508 20.826 6.069 13.99 12.378 25.508 18.826 6.069 13.99 12.378 0 5 10 15 20 25 30 35 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer iPhone 7 Original BN folding Layer Fusion 83.28 60.664 16.1 36.407 32.583 81.44 58.217 15.1 36.325 32.299 81.262 57.091 15.1 36.325 32.299 0 10 20 30 40 50 60 70 80 90 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer Galaxy s7 Original BN folding Layer Fusion 35.98 30.454 7.01 12.379 14.241 31.295 27.912 6.494 12.313 11.021 29.065 23.702 6.494 12.313 11.021 0 5 10 15 20 25 30 35 40 MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer i7-4770hq Original BN folding Layer Fusion
  45. 2.4 Bonus

  46. In the face detector paper, there is a hidden layer

    that is not included in the speed measurement index.
  47. 2.4.1 Optimization Not Found in the Paper Face Detector (Post

    Processing) 1. Priorbox 2. Detection output layer https://arxiv.org/pdf/1705.02950.pdf
  48. 2.4.1 Optimization Not Found in the Paper NMS_TOP_K TOP_K Confidence

    Accuracy Drop [%] 5000(base) 750 0.1 - 2500 750 0.1 - 2500 750 0.05 0.0019% 1250 750 0.05 0.0099% Little difference in accuracy (reasonable) iPhone7 NMS_TOP_K TOP_K Confidence Thread = 1 [ms] Detector 5000 1000 0.1 41.09 2500 750 0.1 12.49 1250 750 0.05 6.11 41.09 ms ➡ 6.11 ms (Up to 6x faster‐) Galaxy s7 NMS_TOP_ K TOP_K Confidence Threads = 1 [ms] Threads = 2 [ms] Detector 5000 1000 0.1 46.52 23.05 2500 750 0.1 24.94 12.92 1250 750 0.05 17.25 10.33 i7-4770hq NMS_TOP_ K TOP_K Confidence Threads = 1 [ms] Threads = 2 [ms] Detector 5000 1000 0.1 13.49 11.69 2500 750 0.1 8.28 6.61 1250 750 0.05 6.82 4.83 11.69 ms ➡ 4.83 ms (Up to 2.4x faster‐) 13.49 ms ➡ 6.82 ms (Up to 2x faster‐) 46.518 ms ➡ 17.253 ms (Up to 2.6x faster‐) 23.052 ms ➡ 10.334 ms (Up to 2.2x faster‐)
  49. There is space for speed improvement even at floating point.

  50. 2.4.2 Floating Point IEEE Standard 754 Floating Point Numbers 1bit

    8bit 23bit Sign Bit Exponent Bit Fraction, Significant, Mantissa Bit N = (−1)s × (1.m) × 2(e−127)
  51. 2.4.2 Floating Point IEEE Standard 754 Floating Point Numbers 1bit

    8bit 23bit Sign Bit Exponent Bit Fraction, Significant, Mantissa Bit N = (−1)s × (1.m) × 2(e−127) Https://Www.Ntu.Edu.Sg/Home/Ehchua/Programming/Java/DataRepresentation.html Some special cases • Zero • Exponent = 0, Significant = 0 • NaN(Not a number) • Exponent = 255 • Denormalized (Subnormal) numbers • Exponent = 0, Significant ≠ 0
  52. 2.4.2 Floating Point Denormalized (Subnormal) Numbers Looking at the weight

    of the model, there were a lot of subnormal values.
  53. Experiment Is floating point operation slow if there is a

    subnormal value in the chipset of a specific device? Real. Slow :(
  54. 2.4.2 Floating Point This value is eventually close to zero.

    The method we can take is to replace the subnormal number with the number of normalized express
  55. 2.4.2 Floating Point After Replacing the Values, the Accuracy of

    the Model Was Not Affected. However, Device Galaxy s7 Pixel3 Galaxy note 10+ iPhone7 Geekbench (Singlecore) 341 489 756 744 Model Latency [ms] 159.10 29.71 57.90 13.71 Replace Denormals [ms] 32.97 29.71 11.73 13.71 159.10 ms ➡ 32.97 ms (Up to 5x faster‐) 57.90 ms ➡ 11.73 ms (Up to 5x faster‐) ! !
  56. 3. Engine to Product

  57. 3. Making AI To Product? + = + = ?

    Face Sign Application Developer Planner Designer Developer Planner Designer
  58. 3.1 Flowchart Seungyoun Yi Face Detect Face Alignment Recognize Landmark

    Recognize Detect
  59. 3.1 Flowchart 1)Register with accurate face data 2)Recognition with minimal

    reference time 3)Seamless interact between server and client Registration Recognit : Client : Server Detect Landmark Recognize Detect Landmark Recognize
  60. 3.2 Client Detect Preprocessing Landmark Postprocessing User Recognition (Entrance) Face

    size Face Alignment Network Frame Processing Postprocess, UX Design and Development Preprocess, UX Design and Development Frame Processing Device type Device tilt
  61. 1. OpenCV (Laplacian) 2. Gyroscope Frame Handling 3.2.1 Detect Preprocessing

    bool isBlurryImage(const cv::Mat &grayImage, const int threshold) { cv::Mat laplacian; cv::Laplacian(grayImage, laplacian, CV_64F); cv::Scalar mean; cv::Scalar stddev; meanStdDev(laplacian, mean, stddev, cv::Mat()); double variance = pow(stddev.val[0], 2); return (variance <= threshold); }
  62. 3.2.1 Detect Preprocessing 1. Reduce eye strain 2. Change camera

    recognition range Preprocess, UX Design and Development
  63. 3.2.1 Detect Preprocessing 3. Insert Identity 4. Put Face in

    the Guideline Preprocess, UX Design and Development
  64. 3.2.2 Landmark Postprocessing Face Size " " ✓ " ✓

    " 1. Set minimum face size for the engine 2. Limit too small face size 3. Limit too big face size
  65. 3.2.2 Landmark Postprocessing # " ✓ ✓ " " "#

    # Face Size 4. Limit cut face 5. Optional limit when multiple people are recognized
  66. 3.2.2 Landmark Postprocessing Face Status 1. Call API when movement

    is small 2. Set face recognition location ✓ ✓ ✓ " " " "
  67. 3.2.2 Landmark Postprocessing 3. Face Alignment 4. Euler Angle Face

    Status # compute the center of mass for each eye leftEyeCenter = leftEyePts.mean(axis=0).astype("int") rightEyeCenter = rightEyePts.mean(axis=0).astype("int") # compute the angle between the eye centroids dY = rightEyeCenter[1] - leftEyeCenter[1] dX = rightEyeCenter[0] - leftEyeCenter[0] angle = np.degrees(np.arctan2(dY, dX)) - 180 https://www.pyimagesearch.com/2017/05/22/face-alignment-with-opencv-and-python/ ▲ ▲ ✓ ✓
  68. 3.2.2 Landmark Postprocessing 5. Eye Blink 6. Wink Face Status

    ▲ ▲ ▲ https://www.pyimagesearch.com/2017/04/24/eye-blink-detection-opencv-python-dlib/
  69. 3.2.3 Speed Optimization Is the Result More Accurate Than the

    Threshold Value? Is the Detected Result Encrypted? Is the Recognized Person Same? Network 1. Consideration for encryption 2. Find optimal results in multi-requests Encrypt Recognized Face Encrypt Features Encrypt Network Encrypt Result
  70. 3.2.3 Speed Optimization 3. Inference Size Optimization Network Camera Image

    (3MB) Engine Recognition Image (100KB) Server Transport Binary (30KB)
  71. 3.2.3 Speed Optimization 4. Network Size Optimization (change data type)

    Network [UInt8] Binary 15 times network API test 4.18 ms ➡ 2.60 ms (Up to 1.6x faster‐) 4.02 ms ➡ 2.41 ms (Up to 1.6x faster‐)
  72. 3.2.3 Speed Optimization 4. Network Size Optimization (gzip) Network Before

    15 times network API test 2.60 ms ➡ 1.85 ms (Up to 1.7x faster‐) 2.41 ms ➡ 1.75 ms (Up to 1.7x faster‐) After
  73. 3.2.4 Landmark Postprocessing 1. Face Size 2. Face Status Postprocess,

    UX Design, and Development
  74. 3.2.4 Landmark Postprocessing 3. Correct UX flow based on accuracy

    Postprocess, UX Design, and Development + =
  75. 3.2.5 Customization Finding Optimal Value in Application 1. Face Related

    Setting 2. Network Related Setting 3. Process Related Setting
  76. 4. Review

  77. 4.1 Engine Optimization Knowledge Distillation Neural Architecture search Latency-aware Design

    Lightweight model Low resolution Image Engine optimization Optimization to reduce even 1ms Layer Fusion Parallel Framework SIMD (AVX/Neon) Memory Reuse Layer Optimization Hidden Bottleneck Optimization Replace Subnormal Numbers
  78. 4.2 Service Optimization Detect preprocessing Landmark postprocessing Network Frame Processing

    Postprocess, UX Design and Development Face Alignment Face size Device tilt Device type Frame Processing Preprocess, UX Design and Development
  79. 4.3 Mobile AI in Production Field Test 1) TechTalk 2)

    Internal Cafe 3) Balance Festival 4) Practice, DEVIEW 2019 5) Practice, LINE DEV DAY 2019
  80. 4.3 Mobile AI in Production Flows

  81. 4.3 Mobile AI in Production Network

  82. 4.4 Goal of AI Engine Development Wow User Convenience Service

    Development + =
  83. Thank You