The environmental impact of present ML models and how can we improve it

The environmental impact of present ML models and how can we improve it

■イベント 
:機械学習勉強会
https://sansan.connpass.com/event/181799/

■登壇概要
タイトル:The environmental impact of present ML models and how can we improve it
発表者: 
DSOC R&D 李 星

▼Sansan DSOC
https://sansan-dsoc.com/

▼Twitter
https://twitter.com/SansanDSOC

A2cac4b3dcb2bc0b87917ddc034ef708?s=128

Sansan DSOC

July 21, 2020
Tweet

Transcript

  1. The environmental impact of present ML models and how can

    we improve it DSOC R&D 李 星 2020/07/21
  2. ※ 掲載されている内容等は発表時点の情報です。 ※ 公開に当たり、資料の⼀部を変更・削除している場合があります。

  3. Data Strategy and Operation Center We also have materials in

    Japanese. https://speakerdeck.com/sansandsoc
  4. Data Strategy and Operation Center self-introduction I joined Sansan DSOC

    from OCT/2019 as a new graduate student. I am in charge of exploiting Sansan’s data through various machine learning methods in an efficient way. Now I am currently focusing on recommendation system and some customized small NLP tasks. 李 星 XING LI
  5. Data Strategy and Operation Center The size of DL models

    is getting larger and larger…(NLP) [16]
  6. Data Strategy and Operation Center The size of DL models

    is getting larger and larger…(CV) [4]
  7. Data Strategy and Operation Center The training time of DL

    models is also increasing [17]
  8. Data Strategy and Operation Center Method p_c: the average power

    draw (in watts) from all CPU sockets during training p_r: the average power draw from all DRAM (main memory) sockets p_g: the average power draw of a GPU during training g: the number of GPUs used to train t: the total training time 1.58: Power Usage Effectiveness(PUE). [13] 0.954: Average CO2 produced for power consumed in the U.S. [14] [2] [2]
  9. Data Strategy and Operation Center A vivid comparison [2]

  10. Data Strategy and Operation Center Common Model’s carbon emission table

    [2]
  11. Data Strategy and Operation Center Where does the cost actually

    come from? [20]
  12. Data Strategy and Operation Center Two aspects to make models

    more energy efficient without unacceptably losing performance
  13. Data Strategy and Operation Center Algorithm 1. Mixed Precision (

    FP16 & FP32 ) 2. Model Distillation 3. Model Pruning 4. Weight Quantization & Sharing 5. Others
  14. Data Strategy and Operation Center Mixed precision training iteration for

    a layer. [8] Algorithm ─ Mixed Precision ─ Where is the “mixed” coming from?
  15. Data Strategy and Operation Center Algorithm ─ Mixed Precision ─

    Performance remain [8]
  16. Data Strategy and Operation Center Algorithm ─ Mixed Precision ─

    Mainstream Library Support PyTorch Mixed Precision Tutorial: https://pytorch.org/docs/stable/notes/amp_examples.html TensorFlow Mixed Precision Guide: https://www.tensorflow.org/guide/mixed_precision
  17. Data Strategy and Operation Center Algorithm ─ Model Distillation ─

    Core Idea [11]
  18. Data Strategy and Operation Center Algorithm ─ Model Distillation ─

    An example [9] [9]
  19. Data Strategy and Operation Center Algorithm ─ Model Distillation ─

    Useful Distilled Models’ Implementation Github: https://github.com/dkozlov/awesome-knowledge-distillation
  20. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    Basic Framework & Concepts [15] Training Pruning Fine-tuning
  21. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    Original Training Result [15]
  22. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    After only pruning [15]
  23. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    Prune+Retrain [15]
  24. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    Iteratively [15]
  25. Data Strategy and Operation Center Algorithm ─ Model Pruning ─

    Mainstream Library Support PyTorch Pruning Tutorial: https://pytorch.org/tutorials/intermediate/pruning_tutorial.html TensorFlow Pruning Tutorial: https://www.tensorflow.org/model_optimization/guide/pruning
  26. Data Strategy and Operation Center Algorithm ─ Weight Quantization &

    Sharing ─ Core Idea [5]
  27. Data Strategy and Operation Center [5] Algorithm ─ Weight Quantization

    & Sharing ─ Initialization of K-means Three different methods for centroids initialization. Distribution of weights (◼blue) and distribution of codebook before (×green cross) and after fine-tuning (•red dot)
  28. Data Strategy and Operation Center Algorithm ─ Weight Quantization &

    Sharing ─ Further compression trick(Huffman Coding) [5]
  29. Data Strategy and Operation Center Algorithm ─ Others (Most of

    them involve the redesign to original network architectures) • Special designed network architectures: • ShuffleNet, MobileNet, BottleNet, SqueezeNet[6] and etc…. • Winograd Transformation • Low Rank Approximation • Binary/Ternary Net • …
  30. Data Strategy and Operation Center Hardware All in one word:

    to minimize the memory access! Actually, I am neither going to nor able to discuss how to design the chips~(TT) But we could know how good our models run on a specific hardware platform so that we can decide to continue optimising our algorithm or buy a better x(C/G/T)PU.
  31. Data Strategy and Operation Center Hardware ─ Roofline Model ─

    Definition [18] [18]
  32. Data Strategy and Operation Center Hardware ─ Roofline Model ─

    How to use it [18]
  33. Data Strategy and Operation Center Hardware ─ Choose more environmental

    friendly hardware ─ xPU matters [1]
  34. Data Strategy and Operation Center Hardware ─ Choose more environmental

    friendly hardware ─ Location matters [1]
  35. Data Strategy and Operation Center Hardware ─ Choose more environmental

    friendly hardware ─ Platform matters [2]
  36. Data Strategy and Operation Center Hardware ─ Choose more environmental

    friendly hardware ─ Same platform in different locations [1] Amazon Web Services
  37. Data Strategy and Operation Center Summary If you don’t want

    to touch the network architecture: • Algorithm - Mixed Precision(FP16&FP32) • Algorithm - Model Distillation • Algorithm - Model Pruning • Algorithm - Weight Quantization & Sharing • Hardware - Use roofline to help you increase energy efficiency • Hardware - Carefully decide the device, location and platform. If you can be able to design new model: • Special designed network architectures: • Winograd Transformation • Low Rank Approximation • Binary/Ternary Net If you can start from hardware level: • My unknown area...
  38. Data Strategy and Operation Center Others ─ What about planting

    a tree? 6 Trees for life ~= 1 tonne of CO2 [19]
  39. Data Strategy and Operation Center Others ─ Sansan cares about

    it !
  40. None
  41. Data Strategy and Operation Center References 1. Quantifying the Carbon

    Emissions of Machine Learning (https://arxiv.org/pdf/1910.09700.pdf) 2. Energy and Policy Considerations for Deep Learning in NLP (https://arxiv.org/pdf/1906.02243.pdf) 3. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (https://arxiv.org/pdf/1910.01108.pdf) 4. Neural Network Architectures(https://towardsdatascience.com/neural-network-architectures-156e5bad51ba) 5. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING(https://arxiv.org/pdf/1510.00149.pdf) 6. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size(https://arxiv.org/pdf/1602.07360.pdf) 7. Deep Learning Performance Documentation Nvidia (https://docs.nvidia.com/deeplearning/performance/mixed-precision- training/index.html#mptrain__fig1) 8. MIXED PRECISION TRAINING (https://arxiv.org/pdf/1710.03740.pdf) 9. Distilling the Knowledge in a Neural Network(https://arxiv.org/pdf/1503.02531.pdf) 10. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks(https://arxiv.org/pdf/1903.12136.pdf) 11. Knowledge Distillation: Simplified (https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764) 12. ML CO2 IMPACT: https://mlco2.github.io/impact/#home 13. Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute. 14. EPA. 2018. Emissions & Generation Resource Integrated Database (eGRID). Technical report, U.S. Environmental Protection Agency. 15. Learning both Weights and Connections for Efficient Neural Networks(https://papers.nips.cc/paper/5784-learning-both-weights- and-connections-for-efficient-neural-network.pdf) 16. GPT-3: The New Mighty Language Model from OpenAI(https://mc.ai/gpt-3-the-new-mighty-language-model-from-openai-2/) 17. AI and Compute(https://openai.com/blog/ai-and-compute/) 18. Performance Analysis(HPC Course, University of Bristol) 19. Reduce your carbon footprint by Planting a tree(https://co2living.com/reduce-your-carbon-footprint-by-planting-a-tree/) 20. EIE: Efficient Inference Engine on Compressed Deep Neural Network(https://arxiv.org/pdf/1602.01528.pdf)