Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

DEEP LEARNING & MULTI GPUs Training Deep Learning Models on
Multiple GPUs in the Cloud BEE PART OF THE CHANGE Avenida de Burgos, 16 D, 28036 Madrid [email protected] www.beeva.com

ENRIQUE OTERO [email protected] @beevalabs_eom Data Scientist in BEEVA [email protected] |
www.beeva.com The intro: deep learning & GPUs The training: challenges & benchmarks on image classification The lessons: science, engineering, infrastructure & business

WWW.BEEVA.COM 3 BIG DATA CLOUD COMPUTI NG MACHINE INTELLIGE NCE
• INNOVATION LABS • INNOVATION SERVICES 100 % +40 % Annual growth rate in last 4 years +65 0 Employees in Spain +80 0 Employees globally WE MAKE COMPLEX THINGS SIMPLE

4 Deep Learning disruption • Computer Vision • Speech Recognition
• Machine Translation • Ranking

• more (labeled) data • more computing power • some
tricks Why now? Source: http://yann.lecun.com/exdb/lenet/

7 1º 4º 2º 3º shaders & hacks OpenGL, DirectX
GPUs. Nvidia & CUDA

THANK YOU GAMERS! THANK YOU GAMERS!

Training times vs. accuracy Accelerate training is essential! Source: https://github.com/sailing-pmls/pmls-caffe/
Source: Canziani et al 2017

• Stochastic Gradient Descent (SGD) • Mini-batch SGD Source: Andrew
Ng. Source: http://www.eeng.dcu.ie/~mcguinne/ Error (loss) function Stochastic gradient descent

Data parallel vs. model parallel • Faster or larger models?
Asynchronous vs. Synchronous • Fast or precise? Distributed training Source: https://github.com/tensorflow/models/tree/master/research/inception

(Multi-node) third party benchmarks Seems quite good!! Easy? Source: https://chainer.org

(Multi-node) third party benchmarks ResNet152 (8 to 256 gpus): 95%
to 90% efficiency AlexNet (8 to 256 gpus): 78% to 53% efficiency Source: mxnet on AWS 16 x p2.16x

(Multi-node) third party benchmarks Small print: • High speed connections!
• Synthetic data vs. real data • Bottlenecks in hard disk And more... • accuracy penalization • number of parameter servers Source: tensorflow.org Source: https://chainer.org

(Multi-node) third party benchmarks So let’s start single-node multi-gpu in

Let’s begin: Tesla K80 K80 GPUs on: • AWS p2:
1, 8 & 16 ◦ ready-to-go AMIs • Azure NC: 1, 2 & 4 • Google Cloud Platform: 1 to 8 ◦ setup scripts

Let’s begin: MNIST & cloud

• Goal: saturate GPUs! • Bottlenecks: ◦ I/O Reads ◦
Image pipelines ▪ Decoding ▪ Data augmentation ◦ Communications: ▪ efficient primitives: NCCL ▪ Overlap with computation ▪ QPI < PCIe < NVLink Data pipeline bottlenecks

(internal) interconnections matter Azure NC24 (4 K80): Google n1- highmem-32
with 8 K80:

GPUDirect & PCIe GPUDirect support (internal) Interconnections matter

More lessons: CIFAR10 on AWS AWS p2.8x = 8x GPU
K80 sync. data-parallel After 8 epochs... mxnet: • validation accuracy = [0.77, 0.82] tensorflow: • validation accuracy = [0.47, 0.59]

Batch sizes matter • Larger batches reduce communication overhead ◦
More throughput • But degrade convergence. ◦ Less accuracy!

Accuracy vs. throughput Empirical workaround for learning rates: • warm
up: start small • increase 5 epochs... • finish on #gpus x lr Scenario: • NVLink, 50Gbps Ethernet (> 15Gbps) • Caffe2 + NCCL + gloo • Synchronous SGD + Momentum Source: Facebook AI Research

Being practical: fine tuning with MxNet Source: https://mxnet.incubator.apache.org/how_to/finetune.html

Being practical: fine tuning with MxNet Scenario: • p2.8x •
ResNet50 • batch-size: 16 x gpu • lr = lr_i x gpu • 1 epoch 94% efficiency :)

Being practical: fine tuning with MxNet Scenario: • p2.8x •
ResNet152 • batch-size: 16,32 x gpu • lr = lr_i x gpu • 1 epoch • val-acc = [0.830, 0.836] 95% efficiency :) < 1% accuracy loss

What about costs?

Tesla K80 price on premises Source: amazon.com November 2017

Tesla K80 prices on cloud 1$/h per-second billing only 0.3$/h
on AWS spot market Purchase 1 or rent 4000 to 12000 hours! Training ResNet50 Imagenet1K (100 epochs): 180$ to 730$ Fine-tuning (8 epochs): < 2$

2014 to 2017: from Kepler... to Volta! Source: aws.amazon.com New!
October 2017 And Tesla Pascal P100 beta on Google Cloud Platform New! September 2017. on-demand spot

Extra: NVIDIA optimized containers! New!

Extra: NVIDIA Volta on AWS P3 instances! • Great performance!
• Cost-effective (on- demand) • (still) scarce availability

Summary SCIENCE Batch sizes & learning rates matter! • high
batch sizes degrade convergence • linear scaling rule & warm-up ENGINEERING Data pipeline matters! • Data feed • Overlap computation & communications INFRASTRUCTURE Architecture & Bandwidth matters! • Volta > Pascal > Kepler • NVLink > PCle > (25 Gbps) Ethernet BUSINESS Pricing matters! • Cost effective cloud instances in spot market

THANKS FOR YOUR TIME [email protected] | www.beeva.com And we’re hiring!

Training Deep Learning Models on Multiple GPUs ...

Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero at Big Data Spain 2017

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript