CodeFest 2019. Леонид Кулигин (Google) — Тренировка моделей машинного обучения на больших объемах данных

Distributed Tensorflow

Leonid Kuligin ! Moscow Institute of Physics & Technique !
2017 - … Google Cloud, PSO, ML engineer ! 2016-2017 Scout GmbH, Senior Software Engineer ! 2015-2016 HH.RU, Senior Product Manager (Search/ML) ! 2013-2015 Yandex, Team Lead (Data production for Local search) https://www.linkedin.com/in/leonid-kuligin-53569544/

Need for computational power “The amount of compute power used
in the largest AI training runs has been increasing exponentially… Since 2012, this metric has grown by more than 300’000X” D. Amodei & D. Hernandez https://blog.openai.com/ai-and-compute/

OOO / too long training times? ! You can either
scale vertically and hit the limit ! Or you can scale horizontally and try the distributed mode ! TPU

Training a large-scale   machine translation model 24 hours on
32 GPUs 6 hours on ⅛ of a TPU Pod

v2 Tensor Processing Unit Google-designed custom ASIC built and optimized
for TensorFlow 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s mem BW

V2 Pod TPU Alpha 11.5 petaflops 4TB high-bandwidth memory

Tensorflow ! Some things to know about distributed Tensorflow !
Dealing with high-level APIs ! Profiling during the training

Tensorflow “... is an interface of expressing machine learning algorithms,
and an implementation of such algorithm” Tensorflow whitepaper, 2015 https://www.tensorflow.org/about/bib

Main terms ! Master - implements a Session interface and
coordinates workers ! Workers - execute Tensorflow Graph on local devices. They are stateless. ! Parameter server - store & update variables.

SGD Can this algorithm be naturally parallelized? ! We compute
the loss (not only!) for every single example in a batch ! We aggregate the loss and start computing gradients / updating weights

Data parallelism 1. We shuffle data randomly across workers. 2.
Every worker executes a forward pass on a mini-batch. 3. PS aggregate the results. 4. Every worker executes a backward pass on its mini-batch and send updates to PS.

Sync vs async data parallelism

What’s a global_step? • We run a distributed job with
5 workers • Each worker performs a 100 steps • How many global_steps do we have?

5 workers • Each worker performs a 100 steps • We have 5*100=500 global_steps

5 workers • Each worker performs a 100 steps • We have 5*100=500 global_steps ◦ It depends on how you call the optimizer.minimize()

Let’s have a look at logs

Why... 1. I increase my cluster size twice and performance
per step is still the same? 2. Does tf.train.TrainSpec have no support for epochs?

From single CPU to 4 workers 1. Global_step / sec
2. Steps / epochs until convergence 3. Time until convergence

sec / step* global_step / sec global_steps until convergence time
until convergence

sec / step* same global_step / sec 4x more global_steps
until convergence time until convergence

until convergence same time until convergence

until convergence same time until convergence 4x less

What about batch_size? ! Batch_size is an attribute of a
worker ◦ You want to have a lot of equal input files • But a worker keeps track of a global_step

Data input with TF ! tf.data.Dataset API provides a high-level
API to implement data input pipelines ◦ FixedLengthRecordDataset, TextLineDataset, TFRecordDataset ◦ You can read CSV with tf.decode_csv ! Various dataset’s transformation - cache, map, flat_map, decode, batch, ...

Data randomization ! shard according to amount of workers ◦
within a single batch you use data from ~N different files • shuffle the in-memory portion of loaded data on every worker • interleave on every worker means to consume data from different files within a single minibatch

Input pipeline dataset = dataset.batch(batch_size=...)

Prefetching dataset = dataset.batch(batch_size=...)\  .prefetch(1)

Data transformation d ataset = dataset.map(func).batch(...)

Parallel data transformation dataset = dataset.map(func, num_parallel_calls=2).batch(BATCH_SIZE)

Data randomization dataset = files.interleave( lambda x: tf.data.TFRecordDataset(x), cycle_length =
ROWS_PER_FILESOURCE)

Data parallel extraction dataset = files.interleave( lambda x: tf.data.TFRecordDataset(x), cycle_length
= ROWS_PER_FILESOURCE, num_parallel_calls = 2)

Profiling Tensorflow ! Training: ◦ Tensorboard ◦ Dump and visualize
a single step ◦ Trace a few steps for a deeper inspection ◦ * TPU profiler • Inference: ◦ Tensorflow Model Benchmarking Tool

Dump a single step ! Every step is dumped to
a separate json file ! You can manually inspect its visualization with chrome://tracing • Unlocked Memory/Compute tabs at the Graph visualization with Tensorboard

Using with high-level API hook = tf.train.ProfilerHook(  save_steps=10,

Single CPU ! Compute tab ◦ Ops over time and
its dependencies • Tensor tab ◦ Creation/deallocation timestamps ◦ Shapes/memory/... as at the snapshot timestamp • Allocators tab ◦ Allocated memory over time

Distributed mode ! A timeline for every worker AND parameter
server at that step ! Allocators tab ◦ Allocated memory over time for CPU, GPU, cuda_host

Collect a traceback ! Collect a traceback over a sequence
of steps ! Dump as a single file (or split into subsets) • Inspect aggregated statistics with CLI or UI

Using with high-level API with tf.contrib.tfprof.ProfileContext( ''.join([output_dir, 'profiler']), trace_steps=range(1050,1100), dump_steps=[1100])
as _: tf.estimator.train_and_evaluate( estimator, train_spec, eval_spec)

Build a CLI tool ! Build with bazel the CLI
tool  bazel build -c opt tensorflow/core/profiler:profiler ! Copy the dump locally • Launch the CLI tool  bazel-bin/tensorflow/core/profiler/profiler \  --profile_path=...

Analyze your dump • Build visualization • Query for most
execution time/memory/… consuming ops • Profile device placement and your model architecture (tensor shapes, memory allocated, etc.) ◦ Attribution to python code ◦ Filtering by regexp • Explore other options in the documentation

UI tool ! Clone the github repository  git clone https://github.com/tensorflow/profiler-ui
! Install dependencies, incl. Ttprof

Outcomes ! Use high-level APIs if possible ! Be careful
when training in a distributed mode (adjust batch_size and data sharding if needed) ! Profile your models during training in order to optimize for bottlenecks (e.g,. the input data pipeline)

Thank you!   Questions?

CodeFest 2019. Леонид Кулигин (Google) — Тренир...

CodeFest 2019. Леонид Кулигин (Google) — Тренировка моделей машинного обучения на больших объемах данных

More Decks by CodeFest

Other Decks in Technology

Featured

Transcript