Slide 1

Slide 1 text

The environmental impact of present ML models and how can we improve it DSOC R&D 李 星 2020/07/21

Slide 2

Slide 2 text

※ 掲載されている内容等は発表時点の情報です。 ※ 公開に当たり、資料の⼀部を変更・削除している場合があります。

Slide 3

Slide 3 text

Data Strategy and Operation Center We also have materials in Japanese. https://speakerdeck.com/sansandsoc

Slide 4

Slide 4 text

Data Strategy and Operation Center self-introduction I joined Sansan DSOC from OCT/2019 as a new graduate student. I am in charge of exploiting Sansan’s data through various machine learning methods in an efficient way. Now I am currently focusing on recommendation system and some customized small NLP tasks. 李 星 XING LI

Slide 5

Slide 5 text

Data Strategy and Operation Center The size of DL models is getting larger and larger…(NLP) [16]

Slide 6

Slide 6 text

Data Strategy and Operation Center The size of DL models is getting larger and larger…(CV) [4]

Slide 7

Slide 7 text

Data Strategy and Operation Center The training time of DL models is also increasing [17]

Slide 8

Slide 8 text

Data Strategy and Operation Center Method p_c: the average power draw (in watts) from all CPU sockets during training p_r: the average power draw from all DRAM (main memory) sockets p_g: the average power draw of a GPU during training g: the number of GPUs used to train t: the total training time 1.58: Power Usage Effectiveness(PUE). [13] 0.954: Average CO2 produced for power consumed in the U.S. [14] [2] [2]

Slide 9

Slide 9 text

Data Strategy and Operation Center A vivid comparison [2]

Slide 10

Slide 10 text

Data Strategy and Operation Center Common Model’s carbon emission table [2]

Slide 11

Slide 11 text

Data Strategy and Operation Center Where does the cost actually come from? [20]

Slide 12

Slide 12 text

Data Strategy and Operation Center Two aspects to make models more energy efficient without unacceptably losing performance

Slide 13

Slide 13 text

Data Strategy and Operation Center Algorithm 1. Mixed Precision ( FP16 & FP32 ) 2. Model Distillation 3. Model Pruning 4. Weight Quantization & Sharing 5. Others

Slide 14

Slide 14 text

Data Strategy and Operation Center Mixed precision training iteration for a layer. [8] Algorithm ─ Mixed Precision ─ Where is the “mixed” coming from?

Slide 15

Slide 15 text

Data Strategy and Operation Center Algorithm ─ Mixed Precision ─ Performance remain [8]

Slide 16

Slide 16 text

Data Strategy and Operation Center Algorithm ─ Mixed Precision ─ Mainstream Library Support PyTorch Mixed Precision Tutorial: https://pytorch.org/docs/stable/notes/amp_examples.html TensorFlow Mixed Precision Guide: https://www.tensorflow.org/guide/mixed_precision

Slide 17

Slide 17 text

Data Strategy and Operation Center Algorithm ─ Model Distillation ─ Core Idea [11]

Slide 18

Slide 18 text

Data Strategy and Operation Center Algorithm ─ Model Distillation ─ An example [9] [9]

Slide 19

Slide 19 text

Data Strategy and Operation Center Algorithm ─ Model Distillation ─ Useful Distilled Models’ Implementation Github: https://github.com/dkozlov/awesome-knowledge-distillation

Slide 20

Slide 20 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ Basic Framework & Concepts [15] Training Pruning Fine-tuning

Slide 21

Slide 21 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ Original Training Result [15]

Slide 22

Slide 22 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ After only pruning [15]

Slide 23

Slide 23 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ Prune+Retrain [15]

Slide 24

Slide 24 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ Iteratively [15]

Slide 25

Slide 25 text

Data Strategy and Operation Center Algorithm ─ Model Pruning ─ Mainstream Library Support PyTorch Pruning Tutorial: https://pytorch.org/tutorials/intermediate/pruning_tutorial.html TensorFlow Pruning Tutorial: https://www.tensorflow.org/model_optimization/guide/pruning

Slide 26

Slide 26 text

Data Strategy and Operation Center Algorithm ─ Weight Quantization & Sharing ─ Core Idea [5]

Slide 27

Slide 27 text

Data Strategy and Operation Center [5] Algorithm ─ Weight Quantization & Sharing ─ Initialization of K-means Three different methods for centroids initialization. Distribution of weights (◼blue) and distribution of codebook before (×green cross) and after fine-tuning (●red dot)

Slide 28

Slide 28 text

Data Strategy and Operation Center Algorithm ─ Weight Quantization & Sharing ─ Further compression trick(Huffman Coding) [5]

Slide 29

Slide 29 text

Data Strategy and Operation Center Algorithm ─ Others (Most of them involve the redesign to original network architectures) • Special designed network architectures: • ShuffleNet, MobileNet, BottleNet, SqueezeNet[6] and etc…. • Winograd Transformation • Low Rank Approximation • Binary/Ternary Net • …

Slide 30

Slide 30 text

Data Strategy and Operation Center Hardware All in one word: to minimize the memory access! Actually, I am neither going to nor able to discuss how to design the chips~(TT) But we could know how good our models run on a specific hardware platform so that we can decide to continue optimising our algorithm or buy a better x(C/G/T)PU.

Slide 31

Slide 31 text

Data Strategy and Operation Center Hardware ─ Roofline Model ─ Definition [18] [18]

Slide 32

Slide 32 text

Data Strategy and Operation Center Hardware ─ Roofline Model ─ How to use it [18]

Slide 33

Slide 33 text

Data Strategy and Operation Center Hardware ─ Choose more environmental friendly hardware ─ xPU matters [1]

Slide 34

Slide 34 text

Data Strategy and Operation Center Hardware ─ Choose more environmental friendly hardware ─ Location matters [1]

Slide 35

Slide 35 text

Data Strategy and Operation Center Hardware ─ Choose more environmental friendly hardware ─ Platform matters [2]

Slide 36

Slide 36 text

Data Strategy and Operation Center Hardware ─ Choose more environmental friendly hardware ─ Same platform in different locations [1] Amazon Web Services

Slide 37

Slide 37 text

Data Strategy and Operation Center Summary If you don’t want to touch the network architecture: ● Algorithm - Mixed Precision(FP16&FP32) ● Algorithm - Model Distillation ● Algorithm - Model Pruning ● Algorithm - Weight Quantization & Sharing ● Hardware - Use roofline to help you increase energy efficiency ● Hardware - Carefully decide the device, location and platform. If you can be able to design new model: ● Special designed network architectures: ● Winograd Transformation ● Low Rank Approximation ● Binary/Ternary Net If you can start from hardware level: ● My unknown area...

Slide 38

Slide 38 text

Data Strategy and Operation Center Others ─ What about planting a tree? 6 Trees for life ~= 1 tonne of CO2 [19]

Slide 39

Slide 39 text

Data Strategy and Operation Center Others ─ Sansan cares about it !

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Data Strategy and Operation Center References 1. Quantifying the Carbon Emissions of Machine Learning (https://arxiv.org/pdf/1910.09700.pdf) 2. Energy and Policy Considerations for Deep Learning in NLP (https://arxiv.org/pdf/1906.02243.pdf) 3. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (https://arxiv.org/pdf/1910.01108.pdf) 4. Neural Network Architectures(https://towardsdatascience.com/neural-network-architectures-156e5bad51ba) 5. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING(https://arxiv.org/pdf/1510.00149.pdf) 6. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size(https://arxiv.org/pdf/1602.07360.pdf) 7. Deep Learning Performance Documentation Nvidia (https://docs.nvidia.com/deeplearning/performance/mixed-precision- training/index.html#mptrain__fig1) 8. MIXED PRECISION TRAINING (https://arxiv.org/pdf/1710.03740.pdf) 9. Distilling the Knowledge in a Neural Network(https://arxiv.org/pdf/1503.02531.pdf) 10. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks(https://arxiv.org/pdf/1903.12136.pdf) 11. Knowledge Distillation: Simplified (https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764) 12. ML CO2 IMPACT: https://mlco2.github.io/impact/#home 13. Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute. 14. EPA. 2018. Emissions & Generation Resource Integrated Database (eGRID). Technical report, U.S. Environmental Protection Agency. 15. Learning both Weights and Connections for Efficient Neural Networks(https://papers.nips.cc/paper/5784-learning-both-weights- and-connections-for-efficient-neural-network.pdf) 16. GPT-3: The New Mighty Language Model from OpenAI(https://mc.ai/gpt-3-the-new-mighty-language-model-from-openai-2/) 17. AI and Compute(https://openai.com/blog/ai-and-compute/) 18. Performance Analysis(HPC Course, University of Bristol) 19. Reduce your carbon footprint by Planting a tree(https://co2living.com/reduce-your-carbon-footprint-by-planting-a-tree/) 20. EIE: Efficient Inference Engine on Compressed Deep Neural Network(https://arxiv.org/pdf/1602.01528.pdf)