Client-side deep learning at Mercari

Client-side deep learning at Mercari

6db95712207f1cbd65a9516f9eee71df?s=128

Daiki Kumazawa

August 27, 2019
Tweet

Transcript

  1. 1 Confidential - Do Not Share Client-side deep learning at

    Mercari Daiki Kumazawa Software Engineer Intern AI Engineering Team
  2. 2 Confidential - Do Not Share - M.S. Statistics @

    Stanford (2nd year) - Masa-son Scholar 1st generation - Interests: Edge ML, NAS, Kubernetes, Econometrics, Cooking, Singing $ whoami
  3. 3 Confidential - Do Not Share Why client-side deep learning

    Category suggestion Related item search We want to make selling a breeze Real-time UX enabled by client-side DL is an important element towards accomplishing this goal.
  4. 4 Confidential - Do Not Share There is a trade-off

    between: - Accuracy - Latency - Energy consumption - Model size → Need both algorithmic & engineering efforts to attain the targeted balance What must happen to make DL work on edge e.g.) “Not all ops are created equal!”[1] e.g.) “Programmability” of mobile GPUs[2] Image credit: [1] Image credit: [1] Image credit: [2]
  5. 5 Confidential - Do Not Share Facebook’s report [2] shows...

    Landscape of execution environment SoCs and GPU specifications are diverse → Focus on running inference on CPUs & work on algorithmic optimization Image credit: [2] Image credit: [2]
  6. 6 Confidential - Do Not Share Can we build an

    NN suited for inference on edge devices? Designing efficient networks: Manual efforts e.g.) MobileNets V1, V2, and V3 ([3], [4], & [5]) - Depthwise separable conv to reduce computation - More effective non-linearity (hard swish) - Attention by small squeeze & excitation - Greedy expand ratio reduction by NetAdapt[10] - More lightweight final block etc... - Inverted residual with linear bottleneck that reduces main memory access Image credit: [4] Image credit: [5] Image credit: [5]
  7. 7 Confidential - Do Not Share Can we automatically build

    an NN suited for inference on edge devices? Designing efficient networks: Automated ways Usual statistical learning problem: A model parameterized by weight : Let denote input, output, and loss . We want to solve: “Architecture search” problem: Let the model include , architecture parameters: . WWT optimize both for and
  8. 8 Confidential - Do Not Share MnasNet[7] (RL-based) Two influential

    yet costly approaches FBNet[8] (Differentiable) → Incurs high GPU memory consumption because the supernet formulation uses a superposition of candidate nodes → Needs to sample thousands of models from the controller and to train each sampled child model from scratch These approaches (and their variants), although cheaper than the original NAS [6], are still costly, unless you have abundant computational resources. Image credit: [7] Image credit: [8]
  9. 9 Confidential - Do Not Share ChamNet[9] Finding NNs more

    efficiently: Examples Iteratively optimizes the network by reducing the number of filters. +: Can iterate towards the targeted latency unlike NAS methods which often require careful tuning of the latency loss weight -: Do pruning algorithms attain better results than NAS? (cf. Zhuang et al. [11]) Uses Gaussian Processes as accuracy & energy predictors, so we no longer need to train the sampled models. +: Searching for the optimal network is fast once the GPs are trained -: To fit the GP, you need to train hundreds of models, which can still be costly NetAdapt[10] Image credit: [9] Image credit: [10]
  10. 10 Confidential - Do Not Share Our approach Single-Path NAS[12]

    → Reduces the architecture search cost from hundreds/thousands of GPU hours to a few hours on a TPU by using the “superkernel” Device SoC Generation (Snapdragon) Model ImageNet Top-1 Accuracy* Latency (ms)* A 845 SPNAS 74.48 77.90 A 845 MobileNetV2 71.80 76.36 B 808 SPNAS 73.07 113.92 B 808 MobileNetV2 71.80 162.82 C 670 SPNAS 73.15 92.14 C 670 MobileNetV2 71.80 111.85 D 801 SPNAS 71.93 84.65 D 801 MobileNetV2 71.80 120.82 *All results are for float32 →SPNAS discovers architectures with a good balance between latency and accuracy compared to MobileNetV2 Takeaway: MobileNetV3 is a strong baseline at the moment, with Single-Path NAS being a relatively cheap NAS-based model optimization option Image credit: [12]
  11. 11 Confidential - Do Not Share Intuition: Reduce computation and

    storage cost by representing the model with less bits than the usual float representation Model quantization e.g.) 8 bits [14] Image credit: [14] e.g.) 1 bit (XNOR-net, binary weight net) [15] Image credit: [15] Things to consider: - Can the entire inference be performed in fixed-point ops? → Not just reducing the model storage size but optimizing latency - Is the method effective even for parameter efficient models? → Tends to be “easier” to quantize big models like AlexNet, as pointed out in [13], [14]
  12. 12 Confidential - Do Not Share Post-training per-channel quantization of

    weights & activation into uint8 is a reasonable place to start - Usually less than 2% accuracy degradation, even for already parameter-efficient networks like MobileNets - Often rich API-level support from frameworks like TFLite & QNNPACK Our approach Image credit: [13]
  13. 13 Confidential - Do Not Share Edge ML stack: For

    the fastest & safest start Cloud TPU Kubernetes Engine Cloud Storage Server-side Client-side Container Registry - Internally developed ML platform (Lykeion) handles scheduling training jobs & storing artifacts - Tensorflow + TFLite for ease of integration into client devices after training models on Cloud TPUs (Optional) (TFLite)
  14. 14 Confidential - Do Not Share Server side architecture MobileNetV3,

    Single-Path NAS Kubernetes Engine Cloud TPU Container Registry BigQuery Cloud Storage Training jobs Executed on Training images name: trainer workflow: - name: dataset module: edge.dataset-puller - name: train module: edge.mobilenetv3 args: - --batch_size - 4096 dependencies: - dataset Defines Infrastructure Data Stores training results & models
  15. 15 Confidential - Do Not Share Client side architecture (Android

    as an example) MediaPipe[16] - Can define manipulations of data as a computational graph written in a proto (.pbtxt) - Executed in native code, which should be more preferable than writing pre/post-processing in, say, Kotlin/Java - Portable across various backends (iOS, android) Input/output View layer Relevant Activity Controller/presenter layer Edge ML feature Input - Pre-process - Execute the graph - Post-process Output Pre- process Graph exec. Post- process
  16. 16 Confidential - Do Not Share - 1) Selecting the

    right graph execution engine - Ideal: A framework that works well across diverse backends - TFLite, caffe2go, SNPE, MACE etc., or compile NNs using, say, glow/tvm? - Need to closely follow the development of each framework - 2) Utilizing advances in algorithmic optimization research - e.g.) Gong et al. [17] proposed a promising quantization method this month - Straight through estimation → htan to better approximate quantized values Future directions (1/2) Image credit: [17]
  17. 17 Confidential - Do Not Share - 3) Hosting multiple

    tasks more efficiently - Multi-task learning for deployment with less (or even a single) model(s) - e.g.) Optimal network subset selection with a framework like Zamir et al. [18]? UberNet [19] like parameter sharing in a single network? Future directions (2/2) Image credit: [18] Image credit: [19]
  18. 18 Confidential - Do Not Share [1] Lai, Liangzhen, Naveen

    Suda, and Vikas Chandra. "Not all ops are created equal!." arXiv preprint arXiv:1801.04326 (2018). [2] Wu, Carole-Jean, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood et al. "Machine learning at facebook: Understanding inference at the edge." In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331-344. IEEE, 2019. [3] Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017). [4] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. "Mobilenetv2: Inverted residuals and linear bottlenecks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520. 2018. [5] Howard, Andrew, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang et al. "Searching for mobilenetv3." arXiv preprint arXiv:1905.02244 (2019). [6] Zoph, Barret, and Quoc V. Le. "Neural architecture search with reinforcement learning." arXiv preprint arXiv:1611.01578 (2016). [7] Tan, Mingxing, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. "Mnasnet: Platform-aware neural architecture search for mobile." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820-2828. 2019. [8] Wu, Bichen, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. "Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734-10742. 2019. [9] Dai, Xiaoliang, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan et al. "Chamnet: Towards efficient network design through platform-aware model adaptation." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11398-11407. 2019. [10] Yang, Tien-Ju, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. "Netadapt: Platform-aware neural network adaptation for mobile applications." In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285-300. 2018. References (1/2)
  19. 19 Confidential - Do Not Share [11] Liu, Zhuang, Mingjie

    Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. "Rethinking the value of network pruning." arXiv preprint arXiv:1810.05270 (2018). [12] Stamoulis, Dimitrios, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. "Single-path nas: Designing hardware-efficient convnets in less than 4 hours." arXiv preprint arXiv:1904.02877 (2019). [13] Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint arXiv:1806.08342 (2018). [14] Jacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704-2713. 2018. [15] Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. "Xnor-net: Imagenet classification using binary convolutional neural networks." In European Conference on Computer Vision, pp. 525-542. Springer, Cham, 2016. [16] Lugaresi, Camillo, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang et al. "MediaPipe: A Framework for Perceiving and Augmenting Reality." (2019). [17] Gong, Ruihao, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. "Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks." arXiv preprint arXiv:1908.05033 (2019). [18] Standley, Trevor, Amir R. Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. "Which Tasks Should Be Learned Together in Multi-task Learning?." arXiv preprint arXiv:1905.07553 (2019). [19] Kokkinos, Iasonas. "Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6129-6138. 2017. References (2/2)