Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Face Recognition Check-in Mechanism

Face Recognition Check-in Mechanism

Seungyoun Yi
NAVER AI Production Team
https://linedevday.linecorp.com/jp/2019/sessions/D2-5

LINE DevDay 2019

November 21, 2019
Tweet

More Decks by LINE DevDay 2019

Other Decks in Technology

Transcript

  1. 2019 DevDay
    Face Recognition Check-in
    Mechanism
    > Seungyoun Yi
    > NAVER AI Production Team

    View full-size slide

  2. CONTENTS
    1. Faster, optimized Check In
    2. Face Engine
    3. Engine to Product (Face Sign)
    4. Review

    View full-size slide

  3. 1. Faster, Optimized Check in

    View full-size slide

  4. 1.1 Face Recognition
    >november_20_first day
    >09:00-18:00
    >november_20_first day
    >09:00-18:00
    >date/november 20-21
    >place/grand nikko tokyo
    Entrance Take card out Tag card
    Face recognition
    Entrance Choose Floor
    Choose Floor
    Take card back
    1. Face recognition replaces authentication
    2. 1:1, Verification
    Elevator

    View full-size slide

  5. 1.1 Face Recognition
    >november_20_first day
    >09:00-18:00
    >november_20_first day
    >09:00-18:00
    >date/november 20-21
    >place/grand nikko tokyo
    1. Face recognition replaces payment
    2. 1:N, Recognition
    Vending Machine
    Choose product Take cash out Put cash
    Face recognition
    Choose product Get Product
    Get Product
    Take change back

    View full-size slide

  6. 1.1 Face Recognition
    Before recognition Take card out After recognition
    Take card or change back
    Face recognition
    Before recognition After recognition
    Face recognition After recognition
    Before recognition
    Process with physical payment method
    1. Simplify UX using face recognition
    2. Insert second-verification flow when result is risky
    Wrap-Up

    View full-size slide

  7. 1.2 Face Check-in
    1. Long lines of developers waiting to see keynotes
    2. Personal information verification process by e-mail and name
    Check-in System
    Online
    Registration Line Up
    Check
    Personal
    Information
    Get Goods Watch Session
    Increased bottlenecks

    View full-size slide

  8. 1.2 Face Check-in
    1. Reduction of waiting time with less than 1 second recognition speed
    2. Remove process to verify name or personal information
    Check-in System
    Online
    Registration Line Up Face
    Recognition Get Goods Watch Session
    Reduced bottlenecks

    View full-size slide

  9. 1.2 Face Check-in
    Can AI solve all problems using only model ?!
    Engineering in
    the Engine
    Engineering in
    the Service
    NEVER, NEVER, NEVER

    View full-size slide

  10. 2. Face Engine

    View full-size slide

  11. Face Recognition Pipeline
    Face
    Feature
    1) Detection 2) Face alignment
    3) Compute transform params
    Reference coordinate
    4) Warp 5) Canonical face 6) Extract feature
    Feature extraction

    View full-size slide

  12. Face Recognition Pipeline
    Face Identification
    2) Get a shortlist for the probe
    3) Recognizing identity via
    simple thresholding
    Gallery (A Set of face features)
    Probe ( A face feature )
    1) Nearest-Neighbor Retrieval
    It’s face feature, not a face image.!!
    It’s face feature, not a face image.!!

    View full-size slide

  13. Applying Face Recognition to Service
    Support
    Multiplatform
    High Accuracy
    Fast Speed
    Deep Learning
    Framework
    Face Detection
    Face Alignment
    Face Recognition
    TensorFlow
    CoreML
    MLKit
    iOS Vision
    Framework
    iOS
    Android
    MacOS
    Windows

    Linux
    99% Accuracy
    0.1 Second Speed

    View full-size slide

  14. Face Engine
    ….
    ….
    ….
    User Applications
    Face Engine
    C++ Frontend Python Frontend ….
    Frontend
    Function
    Backend
    Framework
    Detection Facial Landmark Recognition
    Face Pose
    Smoother
    Tracking
    For FP32, INT8 (CPU)
    Android
    iOS
    Linux
    MacOS
    Windows
    Passing
    Passing
    Passing
    Passing
    Passing
    Supporting

    View full-size slide

  15. Face Engine
    Input: Image
    Output: Face (people) information
    Detector
    &
    Tracking
    Pose
    Estimation
    Low-Resolution
    Image
    Resize
    High-Resolution
    Image
    Post
    Processing
    Alignment Recognition
    Feature
    Z
    X Y
    Human Info
    Bounding Box
    Facial Landmark
    Face Feature
    Euler Angle

    View full-size slide

  16. Sharing Optimization Experience
    In Face Engine
    Lightweight model for high accuracy and fast
    Inference engine optimization for more speed
    Layer optimization for more, more speed
    more, more, more..?

    View full-size slide

  17. 2.1 Lightweight Model

    View full-size slide

  18. 2.1.1 Lightweight Deep Learning Model
    Trade off of accuracy and speed
    Https://Www.Researchgate.Net/Publication/328017644_Benchmark_Analysis_of_Representative_Deep_Neural_Network

    View full-size slide

  19. 2.1.2 Face Detector
    CPU Real-Time Vs GPU Real-Time
    +: Backbone network only
    ++: Backbone + head
    Model Latency [ms] Model size[MB] Input Device
    Pelee [1] 43.86+ ~5.00 320x320
    iPhone8

    (GPU)
    Ours 5.74+ 0.14++ 320x320
    iPhone7

    (CPU)
    [1] Robert J. Wang et. al, Pelee: A Real-Time Object Detection System on Mobile Devices, NeurIPS, 2018

    View full-size slide

  20. Our standard is real-time in
    mobile CPU environment.

    View full-size slide

  21. 2.1.2 Face Detector
    [1] S. Zhang et. al, FaceBoxes: A CPU Real-time Face Detector with High Accuracy, IJCB, 2017
    Model mAP+
    Latency
    [ms]
    Model size
    [MB]
    Input
    Device

    (engine)
    FaceBoxes[1] 96.0 21.40 3.83 320x240
    Xeon

    (pytorch)
    Ours 96.0 22.50 0.29 320x240
    Xeon

    (pytorch)
    Intel(R) Xeon(R) CPU E5-2630 v3 @
    2.40GHz
    +: Mean averaged precision is a measure of how accurately the position of the object is
    detected. (The higher the better)
    Equal level of mAP
    10x reduction in model size‑
    FDDB ROC Curve
    The model located in upper left

    View full-size slide

  22. 2.1.3 Facial Landmark Detector
    Metric: Mean squared error
    Dataset: 3000W-test
    Model MSE(Fullset)
    DSRN[3] 5.21
    SBR[1] 4.99
    RCN-L+ELT-all[4] 4.90
    PCD-CNN[2] 4.44
    Ours (Teacher) 3.73
    ResNet50+PDB+Wing[5] 3.60
    LAB[0] 3.49
    Lower is better
    [0] Look at Boundary: A Boundary-Aware Face Alignment Algorithm, CVPR, 2018
    [1] Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors, CVPR, 2018
    [2] Disentangling 3D Pose in A Dendritic CNN for Unconstrained 2D Face Alignment, CVPR, 2018
    [3] Direct Shape Regression Networks for End-to-End Face Alignment, CVPR, 2018
    [4] Improving Landmark Localization with Semi-Supervised Learning, CVPR, 2018
    [5] Wing Loss for Robust Facial Landmark Localisation with Convolutional Nerual Netowrks, CVPR, 2018 

    [6] https://ibug.doc.ic.ac.uk/resources/300-W/
    Latency(CPU) : 1.2 sec
    Model size : 93.4MB

    View full-size slide

  23. 2.1.3 Facial Landmark Detector
    Metric: Mean Squared Error (Lower Is Better)
    Model Ours-WM0.125 Ours-WM0.25 Ours-WM0.5 Ours-teacher
    300W- Challenge
    [NME]
    5.08 4.89 4.67 4.34
    300W-test-fullset
    [NME]
    4.53 4.36 4.28 3.73
    Latency-factor 1.0 1.96 5.04 46.57
    Model-Size [MB] 0.4 1.5 6.1 93.4
    Some increase in MSE‐
    46x reduction in latency‑
    200x reduction in model size‑

    View full-size slide

  24. 2.1.4 Face Recognizer
    Baseline model : MobileFaceNet [1]
    Model mAP
    Latency
    [ms]
    Model size
    [MB]
    MobileFaceNet [1] 99.18 11.59 4.00
    Ours 99.33 11.49 3.96
    [1] MobileFaceNets: Efficient CNNs for Accurate Real-time Face Verification on Mobile Devices, arXiv preprint 1804.07573, 2018
    Qualcomm Snapdragon 820 with 4 threads

    View full-size slide

  25. 2.1.5 What To Optimize?
    Face Detector
    1%
    1%
    1%
    1%
    2%
    2%
    7%
    85%
    Convolution
    TSODDetectionOutput
    ReLU
    Concat
    BinaryOp
    Pooling
    Interp
    Softmax
    Permute
    Flatten
    Slice
    TSODPriorBox
    Eltwise
    Reshape
    Split
    1%
    1%
    2%
    2%
    2%
    92%
    Convolution
    Eltwise
    ReLU
    Concat
    Pooling
    DeconvolutionDepthWise
    Split
    Facial Landmark Detector
    1%
    6%
    14%
    79%
    Convolution
    ConvolutionDepthWise
    PreLu
    Eltwise
    Split
    Normalize
    Face Recognizer

    View full-size slide

  26. 2.2 Inference Engine

    View full-size slide

  27. 2.2.1 Engine Performance Measurement
    Convolution VS Depthwise separable convolution
    3x3 Convolution
    Input channel : 16
    Output channel : 16
    # of Params :
    3 x 3 x 16 x 16
    3x3 Depthwise Conv
    Input channel : 16
    Output channel : 16
    # of Params :
    3 x 3 x 16
    1x1 Pointwise Conv
    # of Params :
    1 x 1 x 16 x 16
    Total :
    (3 x 3 x 16) + (1 x 1 x 16 x 16)

    View full-size slide

  28. 2.2.1 Engine Performance Measurement
    Convolution VS Depthwise separable convolution
    Image : 10 x 10 x 3
    Input channel : 3
    Output channel : 3
    Convolution [ 3 x 3 x 3 x 3]
    Total parameters : 81 (3x3x3x3)
    Total multiplications : 24300 (Image x Total Params)
    Depthwise convolution [ 3 x 3 x 3 x 3]
    # of Parameters : 27 (3x3x3)
    # of multiplications : 8100 (Image x Params)
    Pointwise convolution [ 1 x 1 x 3 x 3 ]
    # of parameters : 9 (1x1x3x3)
    # of multiplications : 2700 (Image x Params)
    Total parameters : 36
    Total multiplications : 10800
    3 x 3 x c x c > 3 x 3 x c + (c x c)

    View full-size slide

  29. 2.2.2 Performance Comparison
    Convolution is faster than Depthwise separable convolution..?
    Convolution is faster than depthwise separable convolution idx
    Pytorch Tensorflow 2.0
    y: latency (log scale)
    Lower is faster!

    View full-size slide

  30. Fewer parameters
    Less computation
    Faster speed

    View full-size slide

  31. 2.2.2 Performance Comparison
    0.00010
    0.00100
    0.01000
    0.10000
    1.00000
    10.00000
    100.00000
    1000.00000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
    iPhone 7
    Conv DWC
    Width Height Channel Conv DWC
    4 4 4 0.00033 0.00027
    4 4 8 0.00058 0.00037
    4 4 16 0.00194 0.00061
    4 4 32 0.00701 0.00141
    4 4 64 0.02838 0.00210
    8 8 4 0.00068 0.00042
    8 8 8 0.00213 0.00067
    8 8 16 0.00824 0.00145
    8 8 32 0.03210 0.00366
    8 8 64 0.12765 0.00967
    16 16 4 0.00232 0.00113
    16 16 8 0.00874 0.00238
    16 16 16 0.03407 0.00581
    16 16 32 0.13429 0.01642
    16 16 64 0.53917 0.06199
    32 32 4 0.00942 0.00400
    32 32 8 0.03621 0.00915
    32 32 16 0.14359 0.02347
    32 32 32 0.57033 0.06866
    32 32 64 2.26934 0.24841
    64 64 4 0.03730 0.01758
    64 64 8 0.14957 0.03801
    64 64 16 0.59770 0.09850
    64 64 32 2.29650 0.29216
    64 64 64 9.22913 0.99012
    128 128 4 0.16395 0.07094
    128 128 8 0.62052 0.15367
    128 128 16 2.43079 0.41725
    128 128 32 9.83968 1.20801
    128 128 64 43.26673 4.53675
    256 256 4 0.63575 0.27929
    256 256 8 2.43605 0.61418
    256 256 16 10.07863 1.93911
    256 256 32 44.37163 5.52157
    256 256 64 197.65414 21.28737
    1 1
    Width Height Channel Conv DWC
    4 4 4 0.00056 0.00089
    4 4 8 0.00126 0.00100
    4 4 16 0.00413 0.00139
    4 4 32 0.01567 0.00167
    4 4 64 0.06023 0.00303
    8 8 4 0.00209 0.00146
    8 8 8 0.00726 0.00211
    8 8 16 0.02689 0.00310
    8 8 32 0.10713 0.00579
    8 8 64 0.43642 0.01254
    16 16 4 0.00926 0.00454
    16 16 8 0.03473 0.00685
    16 16 16 0.14251 0.01274
    16 16 32 0.58063 0.02516
    16 16 64 2.23574 0.06161
    32 32 4 0.03867 0.01509
    32 32 8 0.15526 0.02587
    32 32 16 0.62187 0.05456
    32 32 32 2.46555 0.11272
    32 32 64 9.98108 0.25675
    64 64 4 0.16621 0.06130
    64 64 8 0.66195 0.10723
    64 64 16 2.64051 0.21645
    64 64 32 10.58819 0.49159
    64 64 64 42.13637 1.14665
    128 128 4 0.68576 0.24418
    128 128 8 2.73333 0.46256
    128 128 16 10.90763 0.87343
    128 128 32 43.65631 2.02102
    128 128 64 181.53884 5.41057
    256 256 4 2.82602 1.00030
    256 256 8 11.65558 1.89273
    256 256 16 45.63633 3.89032
    256 256 32 182.38495 8.68091
    256 256 64 718.17794 20.69565
    0.00010
    0.00100
    0.01000
    0.10000
    1.00000
    10.00000
    100.00000
    1000.00000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
    i7-4770hq
    Conv DWC
    0.00100
    0.01000
    0.10000
    1.00000
    10.00000
    100.00000
    1000.00000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
    Galaxy s7
    Conv DWC
    1 1
    Width Height Channel Conv DWC
    4 4 4 0.00424 0.00707
    4 4 8 0.01458 0.01162
    4 4 16 0.00899 0.01730
    4 4 32 0.05309 0.02340
    4 4 64 0.12638 0.02427
    8 8 4 0.00830 0.00853
    8 8 8 0.02690 0.01248
    8 8 16 0.05422 0.01827
    8 8 32 0.13496 0.03045
    8 8 64 0.41399 0.07120
    16 16 4 0.02392 0.01383
    16 16 8 0.06420 0.02331
    16 16 16 0.15563 0.04192
    16 16 32 0.46971 0.08803
    16 16 64 1.71633 0.18070
    32 32 4 0.06783 0.03369
    32 32 8 0.16102 0.05938
    32 32 16 0.50996 0.10876
    32 32 32 1.89805 0.23682
    32 32 64 7.81756 0.63089
    64 64 4 0.17441 0.09449
    64 64 8 0.55505 0.15865
    64 64 16 2.07405 0.34244
    64 64 32 8.50137 0.86897
    64 64 64 36.87440 3.00902
    128 128 4 0.53880 0.23594
    128 128 8 1.96343 0.45360
    128 128 16 8.25908 1.31925
    128 128 32 35.51822 4.09999
    128 128 64 153.75306 12.80227
    256 256 4 2.40637 1.06782
    256 256 8 10.39279 3.11999
    256 256 16 41.58188 9.03881
    256 256 32 169.87069 28.01278
    256 256 64 722.37550 56.06882
    Lower is faster!

    View full-size slide

  32. 2.3 Layer Optimization

    View full-size slide

  33. 2.3 Three Ways To Implement Convolution
    1. For Loop (Direct Convolution)
    3. Winograd Convolution
    2. Matrix Multiplication (GEMM)

    View full-size slide

  34. 2.3.1 for Loop(Direct Convolution)
    Simply implemented - Overlapping multiple for-loops (with AVX/Neon)
    For i: = 1 to D
    For j := 1 to C
    For K := 1 to W
    For 1 := 1 to H
    For m := 1 to K*K

    View full-size slide

  35. 2.3.2 GEMM-Based Convolution
    Conventional implementation VS Optimized implementation
    . . .
    K
    K
    C
    Convolution weights: D filters
    Feature map
    D x (K"C)
    (K"C) x N
    *
    Matrix multiply
    =
    Reshape
    im2col
    W
    H
    N ≈ (H x W) / (Stride)
    (Accelerated by optimized GEMM)
    # of multiplication
    : D x (%&') x N
    Memory Usage‐
    K
    K
    C
    Convolution weights: D filters
    Feature map
    (K"C)
    cut into the vector size of CPU
    $
    Multiply-and-Add
    =
    im2row
    W
    H
    (BLAS Free!) (K"C)
    cut into the vector size of CPU
    Repeat (D x N) times!
    # of multiplication
    : D x (%&') x N

    View full-size slide

  36. 2.3.3 Winograd Convolution
    # of multiplication
    : D x (!"#) x N
    Reference: Song Han, cs231n lecture note, Stanford University
    # of multiplication
    decreases!
    # of multiplication
    decreases!

    View full-size slide

  37. Direct / GEMM / Winograd Convolution
    Which layer should we use?
    2.3.4 Compare

    View full-size slide

  38. 2.3.4 Compare
    Convolution 3x3 (Direct vs Winograd) operation speed (log scale) according to input image (Width, Height, Channel)
    0.001
    0.010
    0.100
    1.000
    10.000
    100.000
    1000.000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
    iPhone 7
    Conv Winograd
    Log latency (ms)
    0.010
    0.100
    1.000
    10.000
    100.000
    1000.000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
    Galaxy s7 (T=1)
    Conv Winograd
    Log latency (ms)
    Case: [Channel > 16], [16 < Width, Height < 128]
    Using Winograd convolution is faster than convolution
    3x3 Convolution: Direct Conv vs Winograd Conv
    Latency (ms)
    Input Input Channel Conv Winograd
    4 4 16 0.002 0.025
    4 4 32 0.008 0.044
    4 4 64 0.029 0.114
    4 4 128 0.112 0.331
    8 8 16 0.008 0.020
    8 8 32 0.032 0.043
    8 8 64 0.127 0.099
    8 8 128 0.509 0.317
    16 16 16 0.034 0.078
    16 16 32 0.134 0.152
    16 16 64 0.534 0.360
    16 16 128 2.164 1.042
    32 32 16 0.142 0.141
    32 32 32 0.567 0.305
    32 32 64 2.257 0.748
    32 32 128 9.174 2.240
    64 64 16 0.592 0.656
    64 64 32 2.276 1.475
    64 64 64 9.111 3.724
    64 64 128 39.543 10.553
    96 96 16 1.356 1.606
    96 96 32 5.337 3.457
    96 96 64 22.859 9.319
    96 96 128 99.884 26.341
    118 118 16 2.057 2.244
    118 118 32 8.711 5.183
    118 118 64 37.518 12.870
    118 118 128 158.712 37.014
    Input Input Channel Conv Winograd
    4 4 16 0.013 0.052
    4 4 32 0.039 0.062
    4 4 64 0.089 0.159
    4 4 128 0.337 0.574
    8 8 16 0.026 0.028
    8 8 32 0.095 0.058
    8 8 64 0.394 0.151
    8 8 128 1.410 0.566
    16 16 16 0.117 0.138
    16 16 32 0.442 0.340
    16 16 64 1.678 0.829
    16 16 128 6.519 2.476
    32 32 16 0.473 0.369
    32 32 32 1.862 0.847
    32 32 64 7.445 2.230
    32 32 128 37.709 6.572
    64 64 16 2.260 1.719
    64 64 32 8.118 4.738
    64 64 64 35.105 12.812
    64 64 128 169.529 35.936
    96 96 16 5.315 5.021
    96 96 32 18.155 12.065
    96 96 64 76.374 29.897
    96 96 128 363.084 78.233
    118 118 16 5.036 9.419
    118 118 32 21.389 21.273
    118 118 64 112.084 40.140
    118 118 128 447.813 120.284
    # of Threads = 1

    View full-size slide

  39. 2.3.4 Compare
    3x3 Convolution: Direct Conv vs GEMM-based Conv
    1 1
    Width Height Channel Conv GEMM
    4 4 4 0.00056 0.00217
    4 4 8 0.00126 0.00265
    4 4 16 0.00413 0.00283
    4 4 32 0.01567 0.00447
    4 4 64 0.06023 0.00847
    8 8 4 0.00209 0.00292
    8 8 8 0.00726 0.00421
    8 8 16 0.02689 0.00668
    8 8 32 0.10713 0.01686
    8 8 64 0.43642 0.04719
    16 16 4 0.00926 0.00685
    16 16 8 0.03473 0.01226
    16 16 16 0.14251 0.02770
    16 16 32 0.58063 0.06957
    16 16 64 2.23574 0.23985
    32 32 4 0.03867 0.02165
    32 32 8 0.15526 0.04605
    32 32 16 0.62187 0.11299
    32 32 32 2.46555 0.34119
    32 32 64 9.98108 1.02423
    64 64 4 0.16621 0.08484
    64 64 8 0.66195 0.17678
    64 64 16 2.64051 0.47944
    64 64 32 10.58819 1.38857
    64 64 64 42.13637 4.49060
    128 128 4 0.68576 0.34465
    128 128 8 2.73333 0.73755
    128 128 16 10.90763 2.07025
    128 128 32 43.65631 5.86608
    128 128 64 181.53884 18.73521
    256 256 4 2.82602 1.39904
    256 256 8 11.65558 3.05527
    256 256 16 45.63633 8.61684
    256 256 32 182.38495 25.70292
    256 256 64 718.17794 77.88327
    0.00010
    0.00100
    0.01000
    0.10000
    1.00000
    10.00000
    100.00000
    1000.00000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
    i7-4770hq
    Conv GEMM
    Case: [Width, Height > 8]
    Using GEMM-based
    convolution is faster than
    convolution
    Convolution 3x3 (Direct vs Winograd) operation speed (log scale) according to input image (Width, Height, Channel)
    Log latency (ms)

    View full-size slide

  40. 2.3.4 Compare
    1x1 Convolution: Direct Conv vs GEMM-based Conv
    Width Height Channel Conv GEMM
    4 4 64 0.004 0.003
    4 4 128 0.017 0.011
    4 4 256 0.065 0.049
    4 4 512 0.258 0.173
    8 8 64 0.014 0.020
    8 8 128 0.055 0.053
    8 8 256 0.226 0.182
    8 8 512 0.934 0.674
    16 16 64 0.056 0.063
    16 16 128 0.226 0.202
    16 16 256 0.938 0.712
    16 16 512 3.686 2.745
    32 32 64 0.224 0.241
    32 32 128 0.896 0.813
    32 32 256 3.615 3.002
    32 32 512 15.488 12.455
    64 64 64 4.414 0.960
    64 64 128 18.826 4.160
    64 64 256 73.898 13.708
    64 64 512 298.805 54.001
    128 128 64 18.640 4.589
    128 128 128 75.117 15.895
    128 128 256 299.652 59.546
    128 128 512 1195.597 251.955
    256 256 64 66.607 19.038
    256 256 128 258.776 63.814
    256 256 256 1070.993 256.399
    256 256 512 4381.642 1539.726
    # of Threads = 1
    Latency
    (ms)
    0.001
    0.010
    0.100
    1.000
    10.000
    100.000
    1000.000
    10000.000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
    iPhone 7
    Conv GEMM
    Log latency (ms)
    0.00100
    0.01000
    0.10000
    1.00000
    10.00000
    100.00000
    1000.00000
    10000.00000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
    i7-4770hq
    Conv GEMM
    Desktop
    Width Height Channel Conv GEMM
    4 4 64 0.017 0.004
    4 4 128 0.070 0.010
    4 4 256 0.243 0.045
    4 4 512 0.921 0.164
    8 8 64 0.029 0.011
    8 8 128 0.118 0.038
    8 8 256 0.502 0.147
    8 8 512 1.987 0.560
    16 16 64 0.143 0.048
    16 16 128 0.385 0.151
    16 16 256 1.516 0.591
    16 16 512 7.803 2.605
    32 32 64 0.378 0.227
    32 32 128 1.802 0.735
    32 32 256 7.572 2.596
    32 32 512 34.361 11.497
    64 64 64 1.777 0.947
    64 64 128 8.340 3.202
    64 64 256 33.438 11.721
    64 64 512 202.632 49.073
    128 128 64 9.975 4.405
    128 128 128 54.193 13.878
    128 128 256 233.069 46.819
    128 128 512 1085.760 211.534
    256 256 64 67.563 19.370
    256 256 128 288.263 55.190
    256 256 256 1021.372 191.702
    256 256 512 4976.360 812.025
    # of Threads = 1
    Latency
    (ms)
    Case: [Channel > 64]
    Using GEMM-based convolution is
    faster than convolution
    0.010
    0.100
    1.000
    10.000
    100.000
    1000.000
    10000.000
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
    Galaxy s7
    Conv GEMM
    Width Height Channel Conv GEMM
    4 4 64 0.028 0.018
    4 4 128 0.082 0.038
    4 4 256 0.206 0.123
    4 4 512 0.769 0.507
    8 8 64 0.074 0.035
    8 8 128 0.185 0.126
    8 8 256 0.680 0.480
    8 8 512 2.500 2.037
    16 16 64 0.187 0.136
    16 16 128 0.606 0.520
    16 16 256 2.274 2.062
    16 16 512 9.613 8.627
    32 32 64 1.278 0.574
    32 32 128 4.939 2.279
    32 32 256 20.686 8.985
    32 32 512 84.172 38.773
    64 64 64 5.149 2.693
    64 64 128 21.785 10.592
    64 64 256 93.905 40.704
    64 64 512 424.173 169.289
    128 128 64 32.471 12.533
    128 128 128 101.706 45.978
    128 128 256 455.670 175.253
    128 128 512 1920.443 752.894
    256 256 64 132.855 50.844
    256 256 128 606.822 182.193
    256 256 256 2362.191 806.543
    256 256 512 9522.537 3102.843
    # of Threads = 1
    Latency
    (ms)
    Log latency (ms)
    Log latency (ms)

    View full-size slide

  41. 2.3.4 Compare
    Our Model
    97.767
    64.998
    14.981
    34.346
    38.218
    81.869
    58.373
    14.876
    34.336 32.617
    0
    20
    40
    60
    80
    100
    120
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    Galaxy s7
    Before Optimize After
    40.262
    28.38
    6.086
    16.482 16.836
    29.215
    22.577
    5.618
    14.647
    13.106
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    iPhone 7
    Before Optimize After
    78.617
    45.513
    11.209
    25.224 25.371
    29.147
    26.202
    6.619
    13.001 12.437
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    i7-4770hq
    Without AVX With AVX
    Up to 1.2x faster‐
    Up to 2.7x faster‐
    Up to 1.3x faster‐

    View full-size slide

  42. 2.3.5 Other
    Batch Normalization Folding
    BN
    ACT
    Conv
    Conv
    BN
    Pre-act Post-act
    ACT
    Conv + BN
    Post-act folded
    ACT
    Measured Re
    # of
    threads
    Post-a
    (ms)
    1 104.58
    6 29.436
    * Measured on M

    View full-size slide

  43. 2.3.5 Other
    Batch Normalization Folding
    Post-act
    ACT
    BN
    Conv
    BN folding
    ACT
    Conv + BN
    Layer fusion
    Conv + Act

    View full-size slide

  44. 2.3.5 Other
    Our Model
    Up to 1.19x
    Up to 1.05x faster‐
    Up to 1.1x faster‐
    28.592
    22.141
    7.1
    14.822
    12.625
    27.508
    20.826
    6.069
    13.99
    12.378
    25.508
    18.826
    6.069
    13.99
    12.378
    0
    5
    10
    15
    20
    25
    30
    35
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    iPhone 7
    Original BN folding Layer Fusion
    83.28
    60.664
    16.1
    36.407
    32.583
    81.44
    58.217
    15.1
    36.325
    32.299
    81.262
    57.091
    15.1
    36.325
    32.299
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    Galaxy s7
    Original BN folding Layer Fusion
    35.98
    30.454
    7.01
    12.379
    14.241
    31.295
    27.912
    6.494
    12.313
    11.021
    29.065
    23.702
    6.494
    12.313
    11.021
    0
    5
    10
    15
    20
    25
    30
    35
    40
    MobileNet_V1 MobileNet_V2 Detector Alignment Recognizer
    i7-4770hq
    Original BN folding Layer Fusion

    View full-size slide

  45. In the face detector paper,
    there is a hidden layer that is not included in the
    speed measurement index.

    View full-size slide

  46. 2.4.1 Optimization Not Found in the Paper
    Face Detector (Post Processing)
    1. Priorbox
    2. Detection output layer
    https://arxiv.org/pdf/1705.02950.pdf

    View full-size slide

  47. 2.4.1 Optimization Not Found in the Paper
    NMS_TOP_K TOP_K Confidence
    Accuracy Drop
    [%]
    5000(base) 750 0.1 -
    2500 750 0.1 -
    2500 750 0.05 0.0019%
    1250 750 0.05 0.0099%
    Little difference in accuracy (reasonable)
    iPhone7 NMS_TOP_K TOP_K Confidence
    Thread = 1
    [ms]
    Detector
    5000 1000 0.1 41.09
    2500 750 0.1 12.49
    1250 750 0.05 6.11
    41.09 ms ➡ 6.11 ms (Up to 6x faster‐)
    Galaxy s7
    NMS_TOP_
    K
    TOP_K Confidence
    Threads = 1
    [ms]
    Threads = 2
    [ms]
    Detector
    5000 1000 0.1 46.52 23.05
    2500 750 0.1 24.94 12.92
    1250 750 0.05 17.25 10.33
    i7-4770hq NMS_TOP_
    K
    TOP_K Confidence Threads = 1
    [ms]
    Threads = 2
    [ms]
    Detector
    5000 1000 0.1 13.49 11.69
    2500 750 0.1 8.28 6.61
    1250 750 0.05 6.82 4.83
    11.69 ms ➡ 4.83 ms (Up to 2.4x faster‐)
    13.49 ms ➡ 6.82 ms (Up to 2x faster‐)
    46.518 ms ➡ 17.253 ms (Up to 2.6x faster‐)
    23.052 ms ➡ 10.334 ms (Up to 2.2x faster‐)

    View full-size slide

  48. There is space for speed improvement
    even at floating point.

    View full-size slide

  49. 2.4.2 Floating Point
    IEEE Standard 754 Floating Point Numbers
    1bit 8bit 23bit
    Sign Bit
    Exponent Bit
    Fraction, Significant, Mantissa Bit
    N = (−1)s × (1.m) × 2(e−127)

    View full-size slide

  50. 2.4.2 Floating Point
    IEEE Standard 754 Floating Point Numbers
    1bit 8bit 23bit
    Sign Bit
    Exponent Bit
    Fraction, Significant, Mantissa Bit
    N = (−1)s × (1.m) × 2(e−127)
    Https://Www.Ntu.Edu.Sg/Home/Ehchua/Programming/Java/DataRepresentation.html
    Some special cases
    • Zero
    • Exponent = 0, Significant = 0
    • NaN(Not a number)
    • Exponent = 255
    • Denormalized (Subnormal) numbers
    • Exponent = 0, Significant ≠ 0

    View full-size slide

  51. 2.4.2 Floating Point
    Denormalized (Subnormal) Numbers
    Looking at the weight of the model, there were a lot of subnormal values.

    View full-size slide

  52. Experiment
    Is floating point operation slow
    if there is a subnormal value in the chipset of a specific device?
    Real. Slow :(

    View full-size slide

  53. 2.4.2 Floating Point
    This value is eventually close to zero.
    The method we can take is to replace the subnormal number with the number of normalized express

    View full-size slide

  54. 2.4.2 Floating Point
    After Replacing the Values, the Accuracy of the Model Was Not Affected. However,
    Device Galaxy s7 Pixel3 Galaxy note 10+ iPhone7
    Geekbench
    (Singlecore)
    341 489 756 744
    Model Latency
    [ms]
    159.10 29.71 57.90 13.71
    Replace Denormals
    [ms]
    32.97 29.71 11.73 13.71
    159.10 ms ➡ 32.97 ms (Up to 5x faster‐)
    57.90 ms ➡ 11.73 ms (Up to 5x faster‐)
    !
    !

    View full-size slide

  55. 3. Engine to Product

    View full-size slide

  56. 3. Making AI To Product?
    + =
    + =
    ? Face Sign Application
    Developer
    Planner Designer
    Developer
    Planner Designer

    View full-size slide

  57. 3.1 Flowchart
    Seungyoun Yi
    Face Detect Face Alignment Recognize
    Landmark Recognize
    Detect

    View full-size slide

  58. 3.1 Flowchart
    1)Register with accurate face data
    2)Recognition with minimal reference time
    3)Seamless interact between server and client
    Registration
    Recognit
    : Client
    : Server
    Detect Landmark Recognize
    Detect Landmark Recognize

    View full-size slide

  59. 3.2 Client
    Detect Preprocessing
    Landmark Postprocessing
    User Recognition (Entrance)
    Face size Face Alignment Network Frame Processing Postprocess, UX Design and Development
    Preprocess, UX Design and Development
    Frame Processing
    Device type
    Device tilt

    View full-size slide

  60. 1. OpenCV (Laplacian)
    2. Gyroscope
    Frame Handling
    3.2.1 Detect Preprocessing
    bool isBlurryImage(const cv::Mat &grayImage, const int threshold) {
    cv::Mat laplacian;
    cv::Laplacian(grayImage, laplacian, CV_64F);
    cv::Scalar mean;
    cv::Scalar stddev;
    meanStdDev(laplacian, mean, stddev, cv::Mat());
    double variance = pow(stddev.val[0], 2);
    return (variance <= threshold);
    }

    View full-size slide

  61. 3.2.1 Detect Preprocessing
    1. Reduce eye strain
    2. Change camera recognition range
    Preprocess, UX Design and Development

    View full-size slide

  62. 3.2.1 Detect Preprocessing
    3. Insert Identity
    4. Put Face in the Guideline
    Preprocess, UX Design and Development

    View full-size slide

  63. 3.2.2 Landmark Postprocessing
    Face Size
    "
    "

    "

    "
    1. Set minimum face size for the engine
    2. Limit too small face size
    3. Limit too big face size

    View full-size slide

  64. 3.2.2 Landmark Postprocessing
    #
    "
    ✓ ✓
    " " "# #
    Face Size
    4. Limit cut face
    5. Optional limit when multiple people are recognized

    View full-size slide

  65. 3.2.2 Landmark Postprocessing
    Face Status
    1. Call API when movement is small
    2. Set face recognition location
    ✓ ✓

    " "
    "
    "

    View full-size slide

  66. 3.2.2 Landmark Postprocessing
    3. Face Alignment
    4. Euler Angle
    Face Status
    # compute the center of mass for each eye
    leftEyeCenter = leftEyePts.mean(axis=0).astype("int")
    rightEyeCenter = rightEyePts.mean(axis=0).astype("int")
    # compute the angle between the eye centroids
    dY = rightEyeCenter[1] - leftEyeCenter[1]
    dX = rightEyeCenter[0] - leftEyeCenter[0]
    angle = np.degrees(np.arctan2(dY, dX)) - 180
    https://www.pyimagesearch.com/2017/05/22/face-alignment-with-opencv-and-python/
    ▲ ▲
    ✓ ✓

    View full-size slide

  67. 3.2.2 Landmark Postprocessing
    5. Eye Blink
    6. Wink
    Face Status
    ▲ ▲

    https://www.pyimagesearch.com/2017/04/24/eye-blink-detection-opencv-python-dlib/

    View full-size slide

  68. 3.2.3 Speed Optimization
    Is the Result More
    Accurate Than the
    Threshold Value?
    Is the Detected
    Result Encrypted?
    Is the Recognized
    Person Same?
    Network
    1. Consideration for encryption
    2. Find optimal results in multi-requests
    Encrypt Recognized Face Encrypt Features Encrypt Network Encrypt Result

    View full-size slide

  69. 3.2.3 Speed Optimization
    3. Inference Size Optimization
    Network
    Camera Image
    (3MB) Engine
    Recognition Image
    (100KB) Server Transport Binary
    (30KB)

    View full-size slide

  70. 3.2.3 Speed Optimization
    4. Network Size Optimization (change data type)
    Network
    [UInt8] Binary
    15 times network API test
    4.18 ms ➡ 2.60 ms (Up to 1.6x faster‐)
    4.02 ms ➡ 2.41 ms (Up to 1.6x faster‐)

    View full-size slide

  71. 3.2.3 Speed Optimization
    4. Network Size Optimization (gzip)
    Network
    Before
    15 times network API test
    2.60 ms ➡ 1.85 ms (Up to 1.7x faster‐)
    2.41 ms ➡ 1.75 ms (Up to 1.7x faster‐)
    After

    View full-size slide

  72. 3.2.4 Landmark Postprocessing
    1. Face Size
    2. Face Status
    Postprocess, UX Design, and Development

    View full-size slide

  73. 3.2.4 Landmark Postprocessing
    3. Correct UX flow based on accuracy
    Postprocess, UX Design, and Development
    + =

    View full-size slide

  74. 3.2.5 Customization
    Finding Optimal Value in Application
    1. Face Related Setting
    2. Network Related Setting
    3. Process Related Setting

    View full-size slide

  75. 4.1 Engine Optimization
    Knowledge Distillation Neural Architecture search Latency-aware Design
    Lightweight model
    Low resolution Image
    Engine optimization
    Optimization to reduce even 1ms
    Layer Fusion
    Parallel Framework
    SIMD (AVX/Neon) Memory Reuse Layer Optimization
    Hidden Bottleneck Optimization Replace Subnormal Numbers

    View full-size slide

  76. 4.2 Service Optimization
    Detect preprocessing
    Landmark postprocessing
    Network Frame Processing Postprocess, UX Design and Development
    Face Alignment
    Face size
    Device tilt Device type Frame Processing Preprocess, UX Design and Development

    View full-size slide

  77. 4.3 Mobile AI in Production
    Field Test
    1) TechTalk
    2) Internal Cafe
    3) Balance Festival
    4) Practice, DEVIEW 2019
    5) Practice, LINE DEV DAY 2019

    View full-size slide

  78. 4.3 Mobile AI in Production
    Flows

    View full-size slide

  79. 4.3 Mobile AI in Production
    Network

    View full-size slide

  80. 4.4 Goal of AI
    Engine
    Development
    Wow User
    Convenience
    Service
    Development
    + =

    View full-size slide