Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Client-side deep learning at Mercari

Client-side deep learning at Mercari

Daiki Kumazawa

August 27, 2019
Tweet

More Decks by Daiki Kumazawa

Other Decks in Programming

Transcript

  1. 1
    Confidential - Do Not Share
    Client-side deep learning at Mercari
    Daiki Kumazawa
    Software Engineer Intern
    AI Engineering Team

    View full-size slide

  2. 2
    Confidential - Do Not Share
    - M.S. Statistics @ Stanford (2nd year)
    - Masa-son Scholar 1st generation
    - Interests: Edge ML, NAS, Kubernetes,
    Econometrics, Cooking, Singing
    $ whoami

    View full-size slide

  3. 3
    Confidential - Do Not Share
    Why client-side deep learning
    Category suggestion
    Related item search
    We want to make selling a breeze
    Real-time UX enabled by client-side DL
    is an important element
    towards accomplishing this goal.

    View full-size slide

  4. 4
    Confidential - Do Not Share
    There is a trade-off between:
    - Accuracy
    - Latency
    - Energy consumption
    - Model size
    → Need both algorithmic &
    engineering efforts to attain
    the targeted balance
    What must happen to make DL work on edge
    e.g.) “Not all ops are created equal!”[1]
    e.g.) “Programmability” of mobile GPUs[2]
    Image credit: [1] Image credit: [1]
    Image credit: [2]

    View full-size slide

  5. 5
    Confidential - Do Not Share
    Facebook’s report [2] shows...
    Landscape of execution environment
    SoCs and GPU specifications are diverse
    → Focus on running inference on CPUs & work on algorithmic optimization
    Image credit: [2]
    Image credit: [2]

    View full-size slide

  6. 6
    Confidential - Do Not Share
    Can we build an NN suited for inference on edge devices?
    Designing efficient networks: Manual efforts
    e.g.) MobileNets V1, V2, and V3 ([3], [4], & [5])
    - Depthwise separable conv to reduce
    computation
    - More effective non-linearity (hard swish)
    - Attention by small squeeze & excitation
    - Greedy expand ratio reduction by NetAdapt[10]
    - More lightweight final block etc...
    - Inverted residual with linear
    bottleneck that reduces main
    memory access
    Image credit: [4]
    Image credit: [5]
    Image credit: [5]

    View full-size slide

  7. 7
    Confidential - Do Not Share
    Can we automatically build an NN suited for inference on edge devices?
    Designing efficient networks: Automated ways
    Usual statistical learning problem:
    A model parameterized by weight :
    Let denote input, output, and loss . We want to solve:
    “Architecture search” problem:
    Let the model include , architecture parameters:
    . WWT optimize both for and

    View full-size slide

  8. 8
    Confidential - Do Not Share
    MnasNet[7] (RL-based)
    Two influential yet costly approaches
    FBNet[8] (Differentiable)
    → Incurs high GPU memory consumption
    because the supernet formulation uses a
    superposition of candidate nodes
    → Needs to sample thousands of models
    from the controller and to train each
    sampled child model from scratch
    These approaches (and their variants), although cheaper than the original NAS [6],
    are still costly, unless you have abundant computational resources.
    Image credit: [7]
    Image credit: [8]

    View full-size slide

  9. 9
    Confidential - Do Not Share
    ChamNet[9]
    Finding NNs more efficiently: Examples
    Iteratively optimizes the network by
    reducing the number of filters.
    +: Can iterate towards the targeted latency
    unlike NAS methods which often require
    careful tuning of the latency loss weight
    -: Do pruning algorithms attain better
    results than NAS? (cf. Zhuang et al. [11])
    Uses Gaussian Processes as accuracy &
    energy predictors, so we no longer need to
    train the sampled models.
    +: Searching for the optimal network is fast
    once the GPs are trained
    -: To fit the GP, you need to train hundreds of
    models, which can still be costly
    NetAdapt[10]
    Image credit: [9]
    Image credit: [10]

    View full-size slide

  10. 10
    Confidential - Do Not Share
    Our approach
    Single-Path NAS[12]
    → Reduces the architecture search cost from
    hundreds/thousands of GPU hours to a few
    hours on a TPU by using the “superkernel”
    Device
    SoC Generation
    (Snapdragon) Model
    ImageNet Top-1
    Accuracy* Latency (ms)*
    A 845 SPNAS 74.48 77.90
    A 845 MobileNetV2 71.80 76.36
    B 808 SPNAS 73.07 113.92
    B 808 MobileNetV2 71.80 162.82
    C 670 SPNAS 73.15 92.14
    C 670 MobileNetV2 71.80 111.85
    D 801 SPNAS 71.93 84.65
    D 801 MobileNetV2 71.80 120.82
    *All results are for float32
    →SPNAS discovers architectures with a good
    balance between latency and accuracy
    compared to MobileNetV2
    Takeaway: MobileNetV3 is a strong baseline at the moment, with Single-Path NAS being
    a relatively cheap NAS-based model optimization option
    Image credit: [12]

    View full-size slide

  11. 11
    Confidential - Do Not Share
    Intuition: Reduce computation and storage cost by representing the
    model with less bits than the usual float representation
    Model quantization
    e.g.) 8 bits [14]
    Image credit: [14]
    e.g.) 1 bit (XNOR-net, binary weight net) [15]
    Image credit: [15]
    Things to consider:
    - Can the entire inference be
    performed in fixed-point
    ops?
    → Not just reducing the
    model storage size but
    optimizing latency
    - Is the method effective
    even for parameter efficient
    models?
    → Tends to be “easier” to
    quantize big models like
    AlexNet, as pointed out in
    [13], [14]

    View full-size slide

  12. 12
    Confidential - Do Not Share
    Post-training per-channel quantization of weights & activation into
    uint8 is a reasonable place to start
    - Usually less than 2% accuracy degradation, even for already
    parameter-efficient networks like MobileNets
    - Often rich API-level support from frameworks like TFLite & QNNPACK
    Our approach
    Image credit: [13]

    View full-size slide

  13. 13
    Confidential - Do Not Share
    Edge ML stack: For the fastest & safest start
    Cloud TPU
    Kubernetes
    Engine
    Cloud
    Storage
    Server-side Client-side
    Container
    Registry
    - Internally developed
    ML platform
    (Lykeion) handles
    scheduling training
    jobs & storing
    artifacts
    - Tensorflow + TFLite
    for ease of
    integration into
    client devices after
    training models on
    Cloud TPUs
    (Optional)
    (TFLite)

    View full-size slide

  14. 14
    Confidential - Do Not Share
    Server side architecture
    MobileNetV3,
    Single-Path NAS
    Kubernetes
    Engine Cloud TPU
    Container
    Registry
    BigQuery
    Cloud
    Storage
    Training jobs
    Executed on
    Training images
    name: trainer
    workflow:
    - name: dataset
    module: edge.dataset-puller
    - name: train
    module: edge.mobilenetv3
    args:
    - --batch_size
    - 4096
    dependencies:
    - dataset
    Defines
    Infrastructure
    Data
    Stores training results
    & models

    View full-size slide

  15. 15
    Confidential - Do Not Share
    Client side architecture (Android as an example)
    MediaPipe[16]
    - Can define manipulations of data as a computational graph
    written in a proto (.pbtxt)
    - Executed in native code, which should be more preferable
    than writing pre/post-processing in, say, Kotlin/Java
    - Portable across various backends (iOS, android)
    Input/output
    View layer
    Relevant
    Activity
    Controller/presenter layer Edge ML feature
    Input
    - Pre-process
    - Execute the graph
    - Post-process
    Output
    Pre-
    process
    Graph exec.
    Post-
    process

    View full-size slide

  16. 16
    Confidential - Do Not Share
    - 1) Selecting the right graph execution engine
    - Ideal: A framework that works well across diverse backends
    - TFLite, caffe2go, SNPE, MACE etc., or compile NNs using, say, glow/tvm?
    - Need to closely follow the development of each framework
    - 2) Utilizing advances in algorithmic optimization research
    - e.g.) Gong et al. [17] proposed a promising quantization method this month
    - Straight through estimation → htan to better approximate quantized values
    Future directions (1/2)
    Image credit: [17]

    View full-size slide

  17. 17
    Confidential - Do Not Share
    - 3) Hosting multiple tasks more efficiently
    - Multi-task learning for deployment with less (or even a single) model(s)
    - e.g.) Optimal network subset selection with a framework like Zamir et al. [18]?
    UberNet [19] like parameter sharing in a single network?
    Future directions (2/2)
    Image credit: [18]
    Image credit: [19]

    View full-size slide

  18. 18
    Confidential - Do Not Share
    [1] Lai, Liangzhen, Naveen Suda, and Vikas Chandra. "Not all ops are created equal!." arXiv preprint arXiv:1801.04326 (2018).
    [2] Wu, Carole-Jean, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood et al. "Machine
    learning at facebook: Understanding inference at the edge." In 2019 IEEE International Symposium on High Performance
    Computer Architecture (HPCA), pp. 331-344. IEEE, 2019.
    [3] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and
    Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861
    (2017).
    [4] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. "Mobilenetv2: Inverted residuals
    and linear bottlenecks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520. 2018.
    [5] Howard, Andrew, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang et al. "Searching for
    mobilenetv3." arXiv preprint arXiv:1905.02244 (2019).
    [6] Zoph, Barret, and Quoc V. Le. "Neural architecture search with reinforcement learning." arXiv preprint arXiv:1611.01578 (2016).
    [7] Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. "Mnasnet:
    Platform-aware neural architecture search for mobile." In Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 2820-2828. 2019.
    [8] Wu, Bichen, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and
    Kurt Keutzer. "Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search." In Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734-10742. 2019.
    [9] Dai, Xiaoliang, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan et al. "Chamnet: Towards
    efficient network design through platform-aware model adaptation." In Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 11398-11407. 2019.
    [10] Yang, Tien-Ju, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. "Netadapt:
    Platform-aware neural network adaptation for mobile applications." In Proceedings of the European Conference on
    Computer Vision (ECCV), pp. 285-300. 2018.
    References (1/2)

    View full-size slide

  19. 19
    Confidential - Do Not Share
    [11] Liu, Zhuang, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. "Rethinking the value of network pruning." arXiv
    preprint arXiv:1810.05270 (2018).
    [12] Stamoulis, Dimitrios, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu.
    "Single-path nas: Designing hardware-efficient convnets in less than 4 hours." arXiv preprint arXiv:1904.02877 (2019).
    [13] Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint
    arXiv:1806.08342 (2018).
    [14] Jacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry
    Kalenichenko. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." In Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704-2713. 2018.
    [15] Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. "Xnor-net: Imagenet classification using binary
    convolutional neural networks." In European Conference on Computer Vision, pp. 525-542. Springer, Cham, 2016.
    [16] Lugaresi, Camillo, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang et al. "MediaPipe: A
    Framework for Perceiving and Augmenting Reality." (2019).
    [17] Gong, Ruihao, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. "Differentiable
    Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks." arXiv preprint arXiv:1908.05033 (2019).
    [18] Standley, Trevor, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. "Which Tasks Should Be
    Learned Together in Multi-task Learning?." arXiv preprint arXiv:1905.07553 (2019).
    [19] Kokkinos, Iasonas. "Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using
    diverse datasets and limited memory." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
    6129-6138. 2017.
    References (2/2)

    View full-size slide