Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CodeFest 2019. Леонид Кулигин (Google) — Тренир...

CodeFest
April 06, 2019

CodeFest 2019. Леонид Кулигин (Google) — Тренировка моделей машинного обучения на больших объемах данных

Объём доступных данных, как и сложность моделей машинного обучения, растёт экспоненциально. Мы поговорим про то, как работает распределённое обучение с TensorFlow, как устроен процессинг данных и какие средства доступны для профайлинга тренировки.

CodeFest

April 06, 2019
Tweet

More Decks by CodeFest

Other Decks in Technology

Transcript

  1. Leonid Kuligin ! Moscow Institute of Physics & Technique !

    2017 - … Google Cloud, PSO, ML engineer ! 2016-2017 Scout GmbH, Senior Software Engineer ! 2015-2016 HH.RU, Senior Product Manager (Search/ML) ! 2013-2015 Yandex, Team Lead (Data production for Local search) https://www.linkedin.com/in/leonid-kuligin-53569544/
  2. Need for computational power “The amount of compute power used

    in the largest AI training runs has been increasing exponentially… Since 2012, this metric has grown by more than 300’000X” D. Amodei & D. Hernandez https://blog.openai.com/ai-and-compute/
  3. OOO / too long training times? ! You can either

    scale vertically and hit the limit ! Or you can scale horizontally and try the distributed mode ! TPU
  4. v2 Tensor Processing Unit Google-designed custom ASIC built and optimized

    for TensorFlow 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s mem BW
  5. Tensorflow ! Some things to know about distributed Tensorflow !

    Dealing with high-level APIs ! Profiling during the training
  6. Tensorflow “... is an interface of expressing machine learning algorithms,

    and an implementation of such algorithm” Tensorflow whitepaper, 2015 https://www.tensorflow.org/about/bib
  7. Main terms ! Master - implements a Session interface and

    coordinates workers ! Workers - execute Tensorflow Graph on local devices. They are stateless. ! Parameter server - store & update variables.
  8. SGD Can this algorithm be naturally parallelized? ! We compute

    the loss (not only!) for every single example in a batch ! We aggregate the loss and start computing gradients / updating weights
  9. Data parallelism 1. We shuffle data randomly across workers. 2.

    Every worker executes a forward pass on a mini-batch. 3. PS aggregate the results. 4. Every worker executes a backward pass on its mini-batch and send updates to PS.
  10. What’s a global_step? • We run a distributed job with

    5 workers • Each worker performs a 100 steps • How many global_steps do we have?
  11. What’s a global_step? • We run a distributed job with

    5 workers • Each worker performs a 100 steps • We have 5*100=500 global_steps
  12. What’s a global_step? • We run a distributed job with

    5 workers • Each worker performs a 100 steps • We have 5*100=500 global_steps ◦ It depends on how you call the optimizer.minimize()
  13. Why... 1. I increase my cluster size twice and performance

    per step is still the same? 2. Does tf.train.TrainSpec have no support for epochs?
  14. From single CPU to 4 workers 1. Global_step / sec

    2. Steps / epochs until convergence 3. Time until convergence
  15. sec / step* same global_step / sec 4x more global_steps

    until convergence time until convergence
  16. sec / step* same global_step / sec 4x more global_steps

    until convergence same time until convergence
  17. sec / step* same global_step / sec 4x more global_steps

    until convergence same time until convergence 4x less
  18. What about batch_size? ! Batch_size is an attribute of a

    worker ◦ You want to have a lot of equal input files • But a worker keeps track of a global_step
  19. Data input with TF ! tf.data.Dataset API provides a high-level

    API to implement data input pipelines ◦ FixedLengthRecordDataset, TextLineDataset, TFRecordDataset ◦ You can read CSV with tf.decode_csv ! Various dataset’s transformation - cache, map, flat_map, decode, batch, ...
  20. Data randomization ! shard according to amount of workers ◦

    within a single batch you use data from ~N different files • shuffle the in-memory portion of loaded data on every worker • interleave on every worker means to consume data from different files within a single minibatch
  21. Profiling Tensorflow ! Training: ◦ Tensorboard ◦ Dump and visualize

    a single step ◦ Trace a few steps for a deeper inspection ◦ * TPU profiler • Inference: ◦ Tensorflow Model Benchmarking Tool
  22. Dump a single step ! Every step is dumped to

    a separate json file ! You can manually inspect its visualization with chrome://tracing • Unlocked Memory/Compute tabs at the Graph visualization with Tensorboard
  23. Single CPU ! Compute tab ◦ Ops over time and

    its dependencies • Tensor tab ◦ Creation/deallocation timestamps ◦ Shapes/memory/... as at the snapshot timestamp • Allocators tab ◦ Allocated memory over time
  24. Distributed mode ! A timeline for every worker AND parameter

    server at that step ! Allocators tab ◦ Allocated memory over time for CPU, GPU, cuda_host
  25. Collect a traceback ! Collect a traceback over a sequence

    of steps ! Dump as a single file (or split into subsets) • Inspect aggregated statistics with CLI or UI
  26. Build a CLI tool ! Build with bazel the CLI

    tool
 bazel build -c opt tensorflow/core/profiler:profiler ! Copy the dump locally • Launch the CLI tool
 bazel-bin/tensorflow/core/profiler/profiler \
 --profile_path=...
  27. Analyze your dump • Build visualization • Query for most

    execution time/memory/… consuming ops • Profile device placement and your model architecture (tensor shapes, memory allocated, etc.) ◦ Attribution to python code ◦ Filtering by regexp • Explore other options in the documentation
  28. Outcomes ! Use high-level APIs if possible ! Be careful

    when training in a distributed mode (adjust batch_size and data sharding if needed) ! Profile your models during training in order to optimize for bottlenecks (e.g,. the input data pipeline)