Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.

https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 04, 2017
Tweet

Transcript

  1. None
  2. DEEP LEARNING & MULTI GPUs Training Deep Learning Models on

    Multiple GPUs in the Cloud BEE PART OF THE CHANGE Avenida de Burgos, 16 D, 28036 Madrid hablemos@beeva.com www.beeva.com
  3. ENRIQUE OTERO enrique.otero@beeva.com @beevalabs_eom Data Scientist in BEEVA hablemos@beeva.com |

    www.beeva.com The intro: deep learning & GPUs The training: challenges & benchmarks on image classification The lessons: science, engineering, infrastructure & business
  4. WWW.BEEVA.COM 3 BIG DATA CLOUD COMPUTI NG MACHINE INTELLIGE NCE

    • INNOVATION LABS • INNOVATION SERVICES 100 % +40 % Annual growth rate in last 4 years +65 0 Employees in Spain +80 0 Employees globally WE MAKE COMPLEX THINGS SIMPLE
  5. 4 Deep Learning disruption • Computer Vision • Speech Recognition

    • Machine Translation • Ranking
  6. • more (labeled) data • more computing power • some

    tricks Why now? Source: http://yann.lecun.com/exdb/lenet/
  7. None
  8. 7 1º 4º 2º 3º shaders & hacks OpenGL, DirectX

    GPUs. Nvidia & CUDA
  9. THANK YOU GAMERS! THANK YOU GAMERS!

  10. Training times vs. accuracy Accelerate training is essential! Source: https://github.com/sailing-pmls/pmls-caffe/

    Source: Canziani et al 2017
  11. • Stochastic Gradient Descent (SGD) • Mini-batch SGD Source: Andrew

    Ng. Source: http://www.eeng.dcu.ie/~mcguinne/ Error (loss) function Stochastic gradient descent
  12. Data parallel vs. model parallel • Faster or larger models?

    Asynchronous vs. Synchronous • Fast or precise? Distributed training Source: https://github.com/tensorflow/models/tree/master/research/inception
  13. (Multi-node) third party benchmarks Seems quite good!! Easy? Source: https://chainer.org

  14. (Multi-node) third party benchmarks ResNet152 (8 to 256 gpus): 95%

    to 90% efficiency AlexNet (8 to 256 gpus): 78% to 53% efficiency Source: mxnet on AWS 16 x p2.16x
  15. (Multi-node) third party benchmarks Small print: • High speed connections!

    • Synthetic data vs. real data • Bottlenecks in hard disk And more... • accuracy penalization • number of parameter servers Source: tensorflow.org Source: https://chainer.org
  16. (Multi-node) third party benchmarks So let’s start single-node multi-gpu in

  17. Let’s begin: Tesla K80 K80 GPUs on: • AWS p2:

    1, 8 & 16 ◦ ready-to-go AMIs • Azure NC: 1, 2 & 4 • Google Cloud Platform: 1 to 8 ◦ setup scripts
  18. Let’s begin: MNIST & cloud

  19. • Goal: saturate GPUs! • Bottlenecks: ◦ I/O Reads ◦

    Image pipelines ▪ Decoding ▪ Data augmentation ◦ Communications: ▪ efficient primitives: NCCL ▪ Overlap with computation ▪ QPI < PCIe < NVLink Data pipeline bottlenecks
  20. (internal) interconnections matter Azure NC24 (4 K80): Google n1- highmem-32

    with 8 K80:
  21. GPUDirect & PCIe GPUDirect support (internal) Interconnections matter

  22. GPUDirect & PCIe GPUDirect support (internal) Interconnections matter

  23. None
  24. More lessons: CIFAR10 on AWS AWS p2.8x = 8x GPU

    K80 sync. data-parallel After 8 epochs... mxnet: • validation accuracy = [0.77, 0.82] tensorflow: • validation accuracy = [0.47, 0.59]
  25. Batch sizes matter • Larger batches reduce communication overhead ◦

    More throughput • But degrade convergence. ◦ Less accuracy!
  26. None
  27. Accuracy vs. throughput Empirical workaround for learning rates: • warm

    up: start small • increase 5 epochs... • finish on #gpus x lr Scenario: • NVLink, 50Gbps Ethernet (> 15Gbps) • Caffe2 + NCCL + gloo • Synchronous SGD + Momentum Source: Facebook AI Research
  28. Being practical: fine tuning with MxNet Source: https://mxnet.incubator.apache.org/how_to/finetune.html

  29. Being practical: fine tuning with MxNet Scenario: • p2.8x •

    ResNet50 • batch-size: 16 x gpu • lr = lr_i x gpu • 1 epoch 94% efficiency :)
  30. Being practical: fine tuning with MxNet Scenario: • p2.8x •

    ResNet152 • batch-size: 16,32 x gpu • lr = lr_i x gpu • 1 epoch • val-acc = [0.830, 0.836] 95% efficiency :) < 1% accuracy loss
  31. What about costs?

  32. Tesla K80 price on premises Source: amazon.com November 2017

  33. Tesla K80 prices on cloud 1$/h per-second billing only 0.3$/h

    on AWS spot market Purchase 1 or rent 4000 to 12000 hours! Training ResNet50 Imagenet1K (100 epochs): 180$ to 730$ Fine-tuning (8 epochs): < 2$
  34. 2014 to 2017: from Kepler... to Volta! Source: aws.amazon.com New!

    October 2017 And Tesla Pascal P100 beta on Google Cloud Platform New! September 2017. on-demand spot
  35. Extra: NVIDIA optimized containers! New!

  36. Extra: NVIDIA Volta on AWS P3 instances! • Great performance!

    • Cost-effective (on- demand) • (still) scarce availability
  37. Summary SCIENCE Batch sizes & learning rates matter! • high

    batch sizes degrade convergence • linear scaling rule & warm-up ENGINEERING Data pipeline matters! • Data feed • Overlap computation & communications INFRASTRUCTURE Architecture & Bandwidth matters! • Volta > Pascal > Kepler • NVLink > PCle > (25 Gbps) Ethernet BUSINESS Pricing matters! • Cost effective cloud instances in spot market
  38. THANKS FOR YOUR TIME hablemos@beeva.com | www.beeva.com And we’re hiring!