Распределенный Tensorflow и облака

Distributed TensorFlow and cloud Glib Ivashkevych independent developer

WTF? What is TensorFlow? - library for numerical computation -
computation = graph - graph = nodes (operations) + edges (tensors) - works on CPU, GPU (desktop, server, mobile) - from Google

WTF? What is TensorFlow? - great for deep learning: from
research to deployment - deep learning is intrinsically parallel - batteries included: neural net operations, tools to compute in parallel

Deep learning hardware 101 main workhorse: GPU extremely efficient for
parallel operations basic: single machine + single GPU intermediate: single machine + multiple GPUs advanced: multiple machines + multiple GPUs each

Why and when go parallel data is large - training
takes too long model is large - GPU memory is limited you just have multiple GPUs (lucky you) data access and transfer patterns are important

TensorFlow at a glance: graphs and variables - computations are
arranged as graph - tensors “flow” between nodes - variables = persistent tensors x add_op tensor_1 import tensorflow as tf x = tf.Variable(initial_values) y = tf.get_variable(var_name) z = x + y y

TensorFlow at a glance: placeholders and sessions placeholder: to be
fed on execution import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) z = x + y with tf.Session() as sess: sess.run(tf.global_variables_initializer()) result = z.eval(feed_dict={x:x_val, y:y_val}) session manages execution actual graph evaluation

Faces of parallelism: naive Run training script multiple times on
different GPUs: > CUDA_VISIBLE_DEVICES=0 python train_model.py > CUDA_VISIBLE_DEVICES=1 python train_model.py … OK for initial hyperparameters search

Faces of parallelism: data parallel loss grad graph loss grad
graph average update batch 1 batch 2 device 1 device 2 device 0 - multiple replicas of model graph - each on different set of batches - aggregate gradients, update variables

Faces of parallelism: task parallel sub_graph_1 device 1 - split
graph between devices - TensorFlow will handle running it in parallel - good for large complex models sub_graph_2 device 2 sub_graph_0 device 0 sub_graph_3 device 0 dataflow

Faces of parallelism: device placement lives on first GPU import
tensorflow as tf with tf.device("/gpu:0"): a = tf.placeholder(tf.float32) b = tf.placeholder(tf.float32) ab = tf.matmul(a, b) with tf.device("/gpu:1"): c = tf.placeholder(tf.float32) d = tf.placeholder(tf.float32) cd = tf.matmul(c, d) with tf.device("/cpu:0"): res = ab + cd sess = tf.Session() result = sess.run(res, feed_dict={a: a_val, …}) lives on CPU, also sync point lives on second GPU log_device_placement is useful to understand what is going on

Faces of parallelism: choice large data, moderate model: data (sync
or async) moderate data, large model: task large data, large model: consider distributed what we do not cover: queues

Faces of parallelism: distributed TensorFlow provides powerful (but complex) tools
for distributed computing cluster job:ps job:worker task:0 task:1 task:0 task:1 task:2 task:3 address_0 address_1 address_2 address_3 address_4 address_5

Faces of parallelism: cluster import tensorflow as tf cluster =
tf.train.ClusterSpec({"ps":[], "worker":[server_addr]}) server = tf.train.Server(cluster, job_name='worker', task_index=0) server.join() import tensorflow as tf with tf.Session("grpc://server_addr") as sess: result = sess.run(...) server: client:

Faces of parallelism: cluster import tensorflow as tf # set
flags etc ... ps = [ps_addr_0, ps_addr_1, ...] # parameter servers workers = [server_addr_0, server_addr_1, ...] # worker servers cluster = tf.train.ClusterSpec({"ps":ps, "worker":workers}) server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_idx) if FLAGS.job_name == "ps": server.join() elif FLAGS.job_name == "worker": # do work

Faces of parallelism: cluster ... elif FLAGS.job_name == "worker": with
tf.device("/job:ps/task:0/cpu:0"): x = tf.placeholder(tf.float32, ...) # do some stuff ... device placement: ... elif FLAGS.job_name == "worker": with tf.device(tf.train.replica_device_setter(cluster=cluster,worker_device=this_worker)): x = tf.placeholder(tf.float32, ...) # do some stuff ... better way:

Faces of parallelism: wrap-up Useful stuff: - ClusterSpec, Server -
define cluster and server - tf.train.replica_device_setter - spread variables across parameter servers - Supervisor - handle crashes, saving etc. Does it work? https://www.tensorflow.org/performance/benchmarks

Faces of parallelism: cloudy Small scale: - use TensorFlow provided
abstractions - do everything by hand - store data on disk as is - feed data manually Large scale: - use cluster manager: Kubernetes, Mesos etc. - dokerized workers - distributed FS (HDFS, Amazon EFS) - run on Google Cloud ML, TensorPort? - run over Spark: TensorFlowOnSpark, databricks

Multi-GPU/distributed TF Isn’t it overly complex?

Multi-GPU/distributed TF It is. TensorFlow is a low-level framework, which
pretends to be high-level.

TF vs MXNet (gluon) Some MXNet pros: - Almost as
simple as keras, but explicitly exposes CPU/GPU tensors and device contexts - gluon looks promising Some TensorFlow pros: - TensorBoard, TensorFlow Serving, TF for Android - do whatever you want: it’s not always easy, but you can

Further reading TensorFlow white papers https://www.tensorflow.org/about/bib TensorFlow architecture https://www.tensorflow.org/extend/architecture Awesome
TensorFlow https://github.com/jtoy/awesome-tensorflow TensorFlow Dev Summit 2017 videos https://events.withgoogle.com/tensorflow-dev-summit/ Learning TensorFlow, book by Tom Hope et al Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron

questions?

Распределенный Tensorflow и облака

Распределенный Tensorflow и облака

Moscow Python Meetup
PRO

More Decks by Moscow Python Meetup

Other Decks in Programming

Featured

Transcript

Distributed TensorFlow and cloud Glib Ivashkevych independent developer

WTF? What is TensorFlow? - library for numerical computation -

WTF? What is TensorFlow? - great for deep learning: from

Deep learning hardware 101 main workhorse: GPU extremely efficient for

Why and when go parallel data is large - training

TensorFlow at a glance: graphs and variables - computations are

TensorFlow at a glance: placeholders and sessions placeholder: to be

Faces of parallelism: naive Run training script multiple times on

Faces of parallelism: data parallel loss grad graph loss grad

Faces of parallelism: task parallel sub_graph_1 device 1 - split

Faces of parallelism: device placement lives on first GPU import

Faces of parallelism: choice large data, moderate model: data (sync

Faces of parallelism: distributed TensorFlow provides powerful (but complex) tools

Faces of parallelism: cluster import tensorflow as tf cluster =

Faces of parallelism: cluster import tensorflow as tf # set

Faces of parallelism: cluster ... elif FLAGS.job_name == "worker": with

Faces of parallelism: wrap-up Useful stuff: - ClusterSpec, Server -

Faces of parallelism: cloudy Small scale: - use TensorFlow provided

Multi-GPU/distributed TF Isn’t it overly complex?

Multi-GPU/distributed TF It is. TensorFlow is a low-level framework, which

TF vs MXNet (gluon) Some MXNet pros: - Almost as

Further reading TensorFlow white papers https://www.tensorflow.org/about/bib TensorFlow architecture https://www.tensorflow.org/extend/architecture Awesome

questions?