Slide 1

Slide 1 text

Deep Learning with Python & TensorFlow EuroPython 2016 #EuroPython

Slide 2

Slide 2 text

Confidential & Proprietary Google Cloud Platform 2 Ian Lewis Developer Advocate - Google Cloud Platform Tokyo, Japan +Ian Lewis @IanMLewis

Slide 3

Slide 3 text

Confidential & Proprietary Google Cloud Platform 3

Slide 4

Slide 4 text

Confidential & Proprietary Google Cloud Platform 4

Slide 5

Slide 5 text

Confidential & Proprietary Google Cloud Platform 5 Deep Learning 101

Slide 6

Slide 6 text

Google Cloud Platform 6 ["cat"] Input Hidden Output(label) pixels( )

Slide 7

Slide 7 text

Confidential & Proprietary Google Cloud Platform 7 Neural Network can find a way to solve the problem How do you classify these data points?

Slide 8

Slide 8 text

Confidential & Proprietary Google Cloud Platform 8

Slide 9

Slide 9 text

Google Cloud Platform 9 (x,y,z,?,?,?,?,...)

Slide 10

Slide 10 text

Google Cloud Platform 10 v[x] => vector

Slide 11

Slide 11 text

Google Cloud Platform 11 m[x][y][z] => matrix

Slide 12

Slide 12 text

Google Cloud Platform 12 t[x][y][z][?][?]... => tensor

Slide 13

Slide 13 text

Google Cloud Platform 13

Slide 14

Slide 14 text

Confidential & Proprietary Google Cloud Platform 14

Slide 15

Slide 15 text

Confidential & Proprietary Google Cloud Platform 15 Breakthroughs

Slide 16

Slide 16 text

From: Andrew Ng

Slide 17

Slide 17 text

Google Cloud Platform 17 The Inception model (GoogLeNet, 2015)

Slide 18

Slide 18 text

DNN = a large matrix ops a few GPUs >> CPU (but it still takes hours/days to train) a supercomputer >> a few GPUs (but you don't have a supercomputer) You need Distributed Training

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

What's the scalability of Google Brain? "Large Scale Distributed Systems for Training Neural Networks", NIPS 2015 ○ Inception / ImageNet: 40x with 50 GPUs ○ RankBrain: 300x with 500 nodes

Slide 22

Slide 22 text

Confidential & Proprietary Google Cloud Platform 22 TensorFlow

Slide 23

Slide 23 text

23 Google's open source library for machine intelligence tensorflow.org launched in Nov 2015 The second generation Used by many production ML projects What is Tensorflow?

Slide 24

Slide 24 text

24 Operates over tensors: n-dimensional arrays Using a flow graph: data flow computation framework TensorFlow ● Flexible, intuitive construction ● automatic differentiation ● Support for threads, queues, and asynchronous computation; distributed runtime ● Train on CPUs, GPUs ● Run wherever you like

Slide 25

Slide 25 text

Google Cloud Platform 25 Core TensorFlow data structures and concepts... - Graph: A TensorFlow computation, represented as a dataflow graph. - collection of ops that may be executed together as a group - Operation: a graph node that performs computation on tensors - Tensor: a handle to one of the outputs of an Operation - provides a means of computing the value in a TensorFlow Session.

Slide 26

Slide 26 text

Google Cloud Platform 26 Core TensorFlow data structures and concepts - Constants - Placeholders: must be fed with data on execution - Variables: a modifiable tensor that lives in TensorFlow's graph of interacting operations. - Session: encapsulates the environment in which Operation objects are executed, and Tensor objects are evaluated.

Slide 27

Slide 27 text

Google Cloud Platform 27 Category Element-wise math ops Array ops Matrix ops Stateful ops NN building blocks Checkpointing ops Queue & synch ops Control flow ops Operations Examples Add, Sub, Mul, Div, Exp, Log, Greater, Less… Concat, Slice, Split, Constant, Rank, Shape… MatMul, MatrixInverse, MatrixDeterminant… Variable, Assign, AssignAdd... SoftMax, Sigmoid, ReLU, Convolution2D… Save, Restore Enqueue, Dequeue, MutexAcquire… Merge, Switch, Enter, Leave...

Slide 28

Slide 28 text

Google Cloud Platform 28

Slide 29

Slide 29 text

Distributed Training with TensorFlow

Slide 30

Slide 30 text

Model Parallelism = split model, share data

Slide 31

Slide 31 text

Distributed Training Model Parallelism Sub-Graph ● Allows fine grained application of parallelism to slow graph components ● Larger more complex graph Full Graph ● Code is more similar to single process models ● Not necessarily as performant (large models) Data Parallelism Synchronous ● Prevents workers from “Falling behind” ● Workers progress at the speed of the slowest worker Asynchronous ● Workers advance as fast as they can ● Can result in runs that aren’t reproducible or difficult to debug behavior (large models)

Slide 32

Slide 32 text

● CPU/GPU scheduling ● Communications ○ Local, RPC, RDMA ○ 32/16/8 bit quantization ● Cost-based optimization ● Fault tolerance Distributed Training with TensorFlow

Slide 33

Slide 33 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Model Parallelism: Full Graph Replication ● Similar code runs on each worker and workers use flags to determine their role in the cluster: server = tf.train.Server(cluster_def, job_name=this_job_name, task_index=this_task_index) if this_job_name == 'ps': server.join() elif this_job_name=='worker': // cont’d

Slide 34

Slide 34 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Model Parallelism: Full Graph Replication ● Copies of each variable and op are deterministically assigned to parameter servers and worker with tf.device(tf.train.replica_device_setter( worker_device="/job:worker/task:{}".format(this_task_index), cluster=cluster_def)): // Build the model global_step = tf.Variable(0) train_op = tf.train.AdagradOptimizer(0.01).minimize( loss, global_step=global_step)

Slide 35

Slide 35 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Model Parallelism: Full Graph Replication ● Workers coordinate once-per-cluster tasks using a Supervisor and train independently sv = tf.train.Supervisor( is_chief = (this_task_index==0), // training, summary and initialization ops)) with sv.managed_session(server.target) as session: step = 0 while not sv.should_stop() and step < 1000000: # Run a training step asynchronously. _, step = sess.run([train_op, global_step])

Slide 36

Slide 36 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Model Parallelism: Sub-Graph Replication with tf.Graph().as_default(): losses = [] for worker in loss_workers: with tf.device(worker): // Computationally expensive model section // e.g. loss calculation losses.append(loss) ● Can pin operations specifically to individual nodes in the cluster

Slide 37

Slide 37 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Model Parallelism: Sub-Graph Replication with tf.device(master): losses_avg = tf.add_n(losses) / len(workers) train_op = tf.train.AdagradOptimizer(0.01).minimize( losses_avg, global_step=global_step) with tf.Session('grpc://master.address:8080') as session: step = 0 while step < num_steps: _, step = sess.run([train_op, global_step]) ● Can use a single synchronized training step, averaging losses from multiple workers

Slide 38

Slide 38 text

Data Parallelism = split data, share model

Slide 39

Slide 39 text

Data Parallelism: Asynchronous train_op = tf.train.AdagradOptimizer(1.0, use_locking=False).minimize( loss, global_step=gs)

Slide 40

Slide 40 text

Data Parallelism: Synchronous for worker in workers: with tf.device(worker): // expensive computation, e.g. loss losses.append(loss) with tf.device(master): avg_loss = tf.add_n(losses) / len(workers) tf.train.AdagradOptimizer(1.0).minimize(avg_loss, global_step=gs)

Slide 41

Slide 41 text

bit.ly/tensorflow-at-pycon bit.ly/tensorflow-workshop Kubernetes as a Tensorflow Cluster Manager Jupyter Ingress :80 Tensorboard Ingress :6006 Jupyter gRPC :8080 jupyter-server tensorboard-server tensorflow-worker (master) ps-0 tensorflow -worker gRPC :8080 ps-1 tensorflow -worker gRPC :8080 worker-0 tensorflow -worker gRPC :8080 worker-1 tensorflow -worker gRPC :8080 worker-14 tensorflow -worker gRPC :8080

Slide 42

Slide 42 text

Google Cloud Platform 42 Why is this important?

Slide 43

Slide 43 text

Confidential & Proprietary Google Cloud Platform 43 “dog”

Slide 44

Slide 44 text

Google Cloud Platform 44 Tweak Me!

Slide 45

Slide 45 text

Google Cloud Platform 45 Tweak Me?!?

Slide 46

Slide 46 text

Google Cloud Platform 46 ¯\_(ツ)_/¯

Slide 47

Slide 47 text

Google Cloud Platform 47

Slide 48

Slide 48 text

Fully managed, distributed training and prediction for custom TensorFlow graph Supports Regression and Classification initially Integrated with Cloud Dataflow and Cloud Datalab Limited Preview - cloud.google.com/ml Cloud Machine Learning (Cloud ML)

Slide 49

Slide 49 text

Jeff Dean's keynote: YouTube video Define a custom TensorFlow graph Training at local: 8.3 hours w/ 1 node Training at cloud: 32 min w/ 20 nodes (15x faster) Prediction at cloud at 300 reqs / sec Cloud ML

Slide 50

Slide 50 text

Tensor Processing Unit ASIC for TensorFlow Designed by Google 10x better perf / watt 8 bit quantization

Slide 51

Slide 51 text

Confidential & Proprietary Google Cloud Platform 51 Thank You https://www.tensorflow.org/ https://cloud.google.com/ml/ http://bit.ly/tensorflow-workshop Ian Lewis @IanMLewis