Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenTalks.AI - Илья Жариков, Optimization of neural networks and their development

99390fdb3165e59dcda59ab1b162fa1f?s=47 opentalks2
February 04, 2021

OpenTalks.AI - Илья Жариков, Optimization of neural networks and their development



February 04, 2021


  1. Optimization of neural networks and their development Ilia Zharikov e-mail:

    zharikov.i.n@yandex.ru ods slack: @ilzhara tg: @ilzhara machine-intelligence.ru
  2. Now Why? Models development optimization • Develop models for specific

    tasks, datasets, domain and etc. in acceptable time; • Save human and computing resources during development by automating search and speeding up the evaluation of results. Model optimization • Deep neural network should be on smartphones, smart watches, smart fridges and etc. • Support of low-bit processors should be provided; • Real-time apps require fast inference; • Neural networks are overparameterized; • Large models outperform small models; • DNN has poor generalization ability.
  3. Overview Models development optimization (network design, check heuristics and etc.)

    Model optimization (training, inference, RAM, ROM and etc.) NAS Neural Architecture Search Fast model assessment methods in terms of quality Effective training methods Quantization Knowledge Distillation Pruning
  4. Model optimization

  5. Quantization

  6. Quantization Main idea Main idea - use a less accurate

    weight and activation representation in the neural network, reducing the number of bits used from 32 or 64 to 2,4,8 etc.
  7. Determine the ranges of quantization Projecting onto Discrete Levels Clamping

    Reverse projection https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd Quantization Main idea
  8. Comparison ResNet-18 on ImageNet Drop in accuracy from the corresponding

    papers. Model: ResNet-18. Dataset: ImageNet. The minimum drop in accuracy for each number of bits is highlighted in green. Weight and activations are quantized with equal number of bits, expect the first and last layers for most methods. It is hard to compare different methods due to wide variety quantization settings (w/a bits). For some methods the complete quantization pipeline is unclear. Correctness: LINEAR SYMMETRIC QUANTIZATION OF NEURAL NETWORKS FOR LOW-PRECISION INTEGER HARDWARE
  9. • Significant reduction in size • Significantly faster inference •

    Portability to mobile devices and low-bit processors • The latest methods allows for not losing accuracy on certain tasks • Good at classification problems, img-to-img problems are understudied • Lots of research with unrealizable quantization schemes • Large networks with more aggressive quantization are often better than small networks with soft quantization • Trend - aware-training quantization Quantization Outcomes
  10. Pruning

  11. Idea: zero out certain weights (grouped as filters, layers, blocks

    or not). Pruning
  12. Existing prunings methods can be divided by one of the

    following characteristics: • Automation possibility: manual or fully automatic; • Fine tuning necessity: no fine-tuning, post-training, whole training process; • Structured or unstructured; • Global pruning or per-layer pruning (training and ranking); • Target scope: ◦ filter-level; ◦ block-level; ◦ model-specific; ◦ problem-specific; ◦ platform-specific. A lot of heuristics. Poor generalization ability. Good compression ⇎ Speed-up Pruning
  13. Train model Pruning Prune weights Fine-tuning Pruning Post-training Without fine-tuning

    Final model Several iteration of pruning after fine-tuning Data
  14. Pruning Aware-Training Soft Filter Pruning for Accelerating Deep Convolutional Neural

    Networks, 2018 Channel Pruning via Automatic Structure Search, 2020 Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks, 2020 Train model Final model Data Pruning Prune model Can change architecture Train one epoch Update global pruning parameter
  15. Pruning Structure Search Channel Pruning via Automatic Structure Search, 2020

    Motivation: “The key step in channel pruning lies in finding the optimal pruned structure, i.e., channel number in each layer, rather than selecting “important” channels.” Number of filters Choose filters from pre-trained model to speed-up training Calculate score on test dataset (fitness) and change number of layers
  16. Pruning Outcomes • One of the solution of NN overparametrization

    problem. • Fine-tuning is critically important (in img-to-img problems, for example). • Pruning speed-ups NN if it changes architecture parameters. • Pruning should be considered as effective method of NN architecture hyperparameters tuning. NAS in the space = neighbourhood of original model • The more variable the output space, the more variability the network should provide. Good quality in image classification ⇏ Good quality in image segmentation and etc.
  17. Knowledge Distillation

  18. Knowledge distillation is the methods of transferring “knowledge” from one

    network to another Objectives - increase the accuracy or generalizing ability of the student model without changing the architecture. До дистилляции После Distillation Main idea
  19. • The teacher network and the student network can be

    arbitrary sizes relative to each other • GT can be used or not • Losses: -DL - loss between teacher and student outputs - SL - traditional student loss -HL - loss on hidden representations • Ensemble of teachers may be used or self-distillation effect is possible. Distillation Main scheme
  20. • Increased accuracy of small networks • The possibility of

    retraining networks in the absence of initial training data • It is used as an auxiliary operation in many tasks. • Accelerates student network convergence Distillation Outcomes
  21. Model development optimization

  22. Neural Architecture Search

  23. NAS NAS (Neural architecture search) - methods for the automatic

    design of architectures. Google Cloud’s AutoML pipeline
  24. NAS approaches First NAS: • fully train all candidate models

    • have Controller (RNN) • It takes a very long time to learn (up to 40,000 gpu-hours) • Under a new task or constrains, you need to learn everything anew Other approaches: • Differentiable NAS • Hyper-network • One-shot Learning Transferable Architectures for Scalable Image Recognition
  25. NAS one-shot and weight-sharing One-shot approach: A super network contains

    subnets that are candidate networks from the search space. The entire super network is trained, after which the candidate networks are sampled from the super network with trained weights. Weight-sharing: Subnets share common nodes, use and update the same weights. Understanding and Simplifying One-Shot Architecture Search
  26. NAS training full network ways Super Network Learning Methods: •

    Learning the entire super network right away • Random (or partially random) sampling of paths from various distributions • Consistent training of subnets, ranking them by size • Using Distillation from Large to Small Subnets • Regularization via drop-out and weight decay Train super-net Search best sub-net using trained weights, ranking Fine-tuning Final test
  27. NAS ranking Model ranking is the main stage of one-shot

    NAS, necessary for the right choice of architectures. How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search SOTA: FairNAS with Kendall Tau = 0.9487 Kendall Tau metric: s-KdT: classifies architectures with a difference in accuracy of 0.1% to one rank.
  28. NAS Outcomes • Modern NAS get SOTA results, some do

    not require fine-tuning or retraining • 40,000 gpu-hours → 4 - 200 gpu-hours • Allow you to evaluate many architectures and choose the best models according to existing restrictions and for specific tasks • One training - many ready-made models • Model ranking is an important step in one-shot NAS
  29. Performance Prediction

  30. Model’s performance prediction Picture is from Peephole: Predicting Network Performance

    Before Training
  31. Model’s performance prediction Neural network architecture Encoding stage Encoded architecture

    as feature tensor Performance predictor Final accuracy or whole learning curve Additional information (epoch number, initial part of learning curve and etc.) Accelerating Neural Architecture Search using Performance Prediction, 2017 Learning Curve Prediction with Bayesian Neural Networks, 2017 Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves, 2015 *Peephole: Predicting Network Performance Before Training, 2017 *ReNAS:Relativistic Evaluation of Neural Architecture Search, 2019
  32. Effective Training

  33. Methods comprises the following techniques: • proxy datasets creation [1];

    • different sampling strategies [2] ◦ drop-and-pick; ◦ samples importance estimation; ◦ batch formation procedures and etc.); • convergence boosting [3] ◦ gradient approximation (for example, using only gradient sign); ◦ removing gradients which are close to zero; ◦ different learning rate schedulers; ◦ modification of standard optimizers and etc; • modification of standard layers (BN, dropout and etc) [4]; • various training schemes (dynamically skip a subset of layers and etc.) [5]; • decomposition, pruning, quantization. Effective training [1] Data Proxy Generation for Fast and Efficient Neural Architecture Search, 2019 [2] Cheng et al.; Peng et al.; Zhang et al.; Weinstein et al.; Alsadi et al. [3] Ye et al.; Han et al.; Mostafa et al.; Dutta et al.; Wang et al.; Georgakopoulos et al.; Liu et al. [4] Collins et al.; Hiroshi Inoue; Yuan et al. [5] E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings, 2019
  34. Effective training E2-Train E2-Train: Training State-of-the-art CNNs with Over 80%

    Energy Savings, 2019 Using techniques: • SMD - Stochastic Mini-batch Dropping • SLU - Selective Layer Update • PSG - Predictive Sign Gradient descent through quantization Results: x5-10 energy savings with drop in accuracy < 2%
  35. Conclusions Quantization: Pruning: Knowledge Distillation: NAS: Models assessment: Effective training:

    powerful method for specific hardware (low-bit processors) good choice for complex overparameterized models used both as a compression method and as an auxiliary method in various tasks another level of models tuning (problem-level) allow to significantly reduce the time resources needed for choosing a model can be potentially applied to any models and tasks without significant drop in final accuracy
  36. Thank you for your attention Ilia Zharikov e-mail: zharikov.i.n@yandex.ru ods

    slack: @ilzhara tg: @ilzhara machine-intelligence.ru
  37. Literature Quantization *Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for

    Neural Networks, 2020 Bit Efficient Quantization for Deep Neural Networks, 2019 Data-Free Quantization Through Weight Equalization and Bias Correction, 2019 Gradient ℓ1 Regularization for Quantization Robustness, 2020 Kernel Quantization for Efficient Network Compression, 2020 *Learned Step Size Quantization, 2019 Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware, 2020 *Loss Aware Post-training Quantization, 2020 Low-bit Quantization of Neural Networks for Efficient Inference, 2019 *LSQ+: Improving low-bit quantization through learnable offsets and better initialization, 2020 Post-training 4-bit quantization of convolution networks for rapid-deployment, 2019 Post-Training Piecewise Linear Quantization for Deep Neural Networks, 2020 *Quantization Networks, 2019 *Relaxed Quantization for Discretized Neural Networks, 2020 * indicates the most promising approaches for quantization in our opinion
  38. Literature Pruning *HRank: Filter Pruning using High-Rank Feature Map, 2020

    *CUP: Cluster Pruning for Compressing Deep Neural Networks, 2019 Deep Network Pruning for Object Detection, 2019 Pruning Filters for Efficient ConvNets, 2017 What is the State of Neural Network Pruning?, 2020 Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks, 2018 ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression, 2017 Rethinking the Value of Network Pruning, 2019 *Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks, 2020 *Channel Pruning via Automatic Structure Search, 2020 Cluster Pruning: An Efficient Filter Pruning Method for Edge AI Vision Applications, 2020 * indicates the most promising approaches for pruning in our opinion
  39. Literature Knowledge Distillation *Understanding and Improving Knowledge Distillation, 2020 Transfer

    Heterogeneous Knowledge Among Peer-to-Peer Teammates: A Model Distillation Approach, 2020 *Knowledge Distillation for Incremental Learning in Semantic Segmentation, 2020 *Search to Distill: Pearls are Everywhere but not the Eyes , 2019 Similarity-Preserving Knowledge Distillation, 2019 Towards Understanding Knowledge Distillation, 2019 *Born-Again Neural Networks, 2018 Learning from Noisy Labels with Distillation, 2017 Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons, 2018 Learning Deep Representations with Probabilistic Knowledge Transfer , 2019 *The Deep Weight Prior , 2019 Correlation Congruence for Knowledge Distillation, 2019 * indicates the most promising approaches for knowledge distillation in our opinion
  40. Literature NAS *BigNAS: Scaling Up Neural Architecture Search with Big

    Single-Stage Models, 2020 *FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search, 2020 How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS, 2020 *MixPath: A Unified Approach for One-shot Neural Architecture Search, 2020 *Once-for-All: Train One Network and Specialize it for Efficient Deployment, 2020 *SCARLET-NAS: Bridging the Gap between Stability and Scalability in Weight-sharing Neural Architecture Search , 2020 Single Path One-Shot Neural Architecture Search with Uniform Sampling, 2019 Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours, 2019 Understanding and Simplifying One-Shot Architecture Search, 2018 SMASH: One-Shot Model Architecture Search through HyperNetworks, 2017 ReNAS: Relativistic Evaluation of Neural Architecture Search , 2019 PROXYLESSNAS: Direct Neural Architecture Search on Target Task and Hardware, 2019 One-Shot Neural Architecture Search via Self-Evaluated Template Network, 2019 NAS-Bench-101: Towards Reproducible Neural Architecture Search, 2019 Efficient Neural Architecture Search via Parameter Sharing, 2018 Data Proxy Generation for Fast and Efficient Neural Architecture Search, 2019 DARTS: Differentiable Architecture Search, 2019 Learning Transferable Architectures for Scalable Image Recognition, 2018 * indicates the most promising approaches in our opinion