The environmental impact of present ML models and how can we improve it

The environmental impact of present ML models and how can
we improve it DSOC R&D 李星 2020/07/21

※ 掲載されている内容等は発表時点の情報です。 ※ 公開に当たり、資料の⼀部を変更・削除している場合があります。

Data Strategy and Operation Center We also have materials in
Japanese. https://speakerdeck.com/sansandsoc

Data Strategy and Operation Center self-introduction I joined Sansan DSOC
from OCT/2019 as a new graduate student. I am in charge of exploiting Sansan’s data through various machine learning methods in an efficient way. Now I am currently focusing on recommendation system and some customized small NLP tasks. 李星 XING LI

Data Strategy and Operation Center The size of DL models
is getting larger and larger…(NLP) [16]

Data Strategy and Operation Center The size of DL models
is getting larger and larger…(CV) [4]

Data Strategy and Operation Center The training time of DL
models is also increasing [17]

Data Strategy and Operation Center Method p_c: the average power
draw (in watts) from all CPU sockets during training p_r: the average power draw from all DRAM (main memory) sockets p_g: the average power draw of a GPU during training g: the number of GPUs used to train t: the total training time 1.58: Power Usage Effectiveness(PUE). [13] 0.954: Average CO2 produced for power consumed in the U.S. [14] [2] [2]

Data Strategy and Operation Center A vivid comparison [2]

Data Strategy and Operation Center Common Model’s carbon emission table
[2]

Data Strategy and Operation Center Where does the cost actually
come from? [20]

Data Strategy and Operation Center Two aspects to make models
more energy efficient without unacceptably losing performance

Data Strategy and Operation Center Algorithm 1. Mixed Precision (
FP16 & FP32 ) 2. Model Distillation 3. Model Pruning 4. Weight Quantization & Sharing 5. Others

Data Strategy and Operation Center Mixed precision training iteration for
a layer. [8] Algorithm ─ Mixed Precision ─ Where is the “mixed” coming from?

Data Strategy and Operation Center Algorithm ─ Mixed Precision ─
Performance remain [8]

Data Strategy and Operation Center Algorithm ─ Mixed Precision ─
Mainstream Library Support PyTorch Mixed Precision Tutorial: https://pytorch.org/docs/stable/notes/amp_examples.html TensorFlow Mixed Precision Guide: https://www.tensorflow.org/guide/mixed_precision

Data Strategy and Operation Center Algorithm ─ Model Distillation ─
Core Idea [11]

An example [9] [9]

Useful Distilled Models’ Implementation Github: https://github.com/dkozlov/awesome-knowledge-distillation

Data Strategy and Operation Center Algorithm ─ Model Pruning ─
Basic Framework & Concepts [15] Training Pruning Fine-tuning

Original Training Result [15]

After only pruning [15]

Prune+Retrain [15]

Iteratively [15]

Mainstream Library Support PyTorch Pruning Tutorial: https://pytorch.org/tutorials/intermediate/pruning_tutorial.html TensorFlow Pruning Tutorial: https://www.tensorflow.org/model_optimization/guide/pruning

Data Strategy and Operation Center Algorithm ─ Weight Quantization &
Sharing ─ Core Idea [5]

Data Strategy and Operation Center [5] Algorithm ─ Weight Quantization
& Sharing ─ Initialization of K-means Three different methods for centroids initialization. Distribution of weights (◼blue) and distribution of codebook before (×green cross) and after fine-tuning (•red dot)

Data Strategy and Operation Center Algorithm ─ Weight Quantization &
Sharing ─ Further compression trick(Huffman Coding) [5]

Data Strategy and Operation Center Algorithm ─ Others (Most of
them involve the redesign to original network architectures) • Special designed network architectures: • ShuffleNet, MobileNet, BottleNet, SqueezeNet[6] and etc…. • Winograd Transformation • Low Rank Approximation • Binary/Ternary Net • …

Data Strategy and Operation Center Hardware All in one word:
to minimize the memory access! Actually, I am neither going to nor able to discuss how to design the chips~(TT) But we could know how good our models run on a specific hardware platform so that we can decide to continue optimising our algorithm or buy a better x(C/G/T)PU.

Data Strategy and Operation Center Hardware ─ Roofline Model ─
Definition [18] [18]

Data Strategy and Operation Center Hardware ─ Roofline Model ─
How to use it [18]

Data Strategy and Operation Center Hardware ─ Choose more environmental
friendly hardware ─ xPU matters [1]

friendly hardware ─ Location matters [1]

friendly hardware ─ Platform matters [2]

friendly hardware ─ Same platform in different locations [1] Amazon Web Services

Data Strategy and Operation Center Summary If you don’t want
to touch the network architecture: • Algorithm - Mixed Precision(FP16&FP32) • Algorithm - Model Distillation • Algorithm - Model Pruning • Algorithm - Weight Quantization & Sharing • Hardware - Use roofline to help you increase energy efficiency • Hardware - Carefully decide the device, location and platform. If you can be able to design new model: • Special designed network architectures: • Winograd Transformation • Low Rank Approximation • Binary/Ternary Net If you can start from hardware level: • My unknown area...

Data Strategy and Operation Center Others ─ What about planting
a tree? 6 Trees for life ~= 1 tonne of CO2 [19]

Data Strategy and Operation Center Others ─ Sansan cares about
it !

Data Strategy and Operation Center References 1. Quantifying the Carbon
Emissions of Machine Learning (https://arxiv.org/pdf/1910.09700.pdf) 2. Energy and Policy Considerations for Deep Learning in NLP (https://arxiv.org/pdf/1906.02243.pdf) 3. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (https://arxiv.org/pdf/1910.01108.pdf) 4. Neural Network Architectures(https://towardsdatascience.com/neural-network-architectures-156e5bad51ba) 5. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING(https://arxiv.org/pdf/1510.00149.pdf) 6. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size(https://arxiv.org/pdf/1602.07360.pdf) 7. Deep Learning Performance Documentation Nvidia (https://docs.nvidia.com/deeplearning/performance/mixed-precision- training/index.html#mptrain__fig1) 8. MIXED PRECISION TRAINING (https://arxiv.org/pdf/1710.03740.pdf) 9. Distilling the Knowledge in a Neural Network(https://arxiv.org/pdf/1503.02531.pdf) 10. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks(https://arxiv.org/pdf/1903.12136.pdf) 11. Knowledge Distillation: Simplified (https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764) 12. ML CO2 IMPACT: https://mlco2.github.io/impact/#home 13. Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute. 14. EPA. 2018. Emissions & Generation Resource Integrated Database (eGRID). Technical report, U.S. Environmental Protection Agency. 15. Learning both Weights and Connections for Efficient Neural Networks(https://papers.nips.cc/paper/5784-learning-both-weights- and-connections-for-efficient-neural-network.pdf) 16. GPT-3: The New Mighty Language Model from OpenAI(https://mc.ai/gpt-3-the-new-mighty-language-model-from-openai-2/) 17. AI and Compute(https://openai.com/blog/ai-and-compute/) 18. Performance Analysis(HPC Course, University of Bristol) 19. Reduce your carbon footprint by Planting a tree(https://co2living.com/reduce-your-carbon-footprint-by-planting-a-tree/) 20. EIE: Efficient Inference Engine on Compressed Deep Neural Network(https://arxiv.org/pdf/1602.01528.pdf)

The environmental impact of present ML models a...

The environmental impact of present ML models and how can we improve it

More Decks by Sansan DSOC

Other Decks in Science

Featured

Transcript