Tensorflow for Janitors

Slide 1

Slide 1 text

Tensorﬂow for Janitors Cra$ Conference Budapest 2017 Daniel Molnar @soobrosa door2door GmbH 1

Slide 2

Slide 2 text

Perspec've • rounded, not complete, • slow, old, stupid and lazy and • looking for feedback either to add or remove. 2

Slide 3

Slide 3 text

Where I'm coming from • head of data and analy-cs, • senior applied and data scien-st, • data analyst, • head of data, • or just data janitor. 3

Slide 4

Slide 4 text

Orienta(on What's this talk is about: - deep learning, - what a generalist can use Tensorﬂow for, - what can it teach us about a good product. What's this talk is not about: - ex%nc%on or salva%on by AI, - coding tutorial, - pitching a Google product. 4

Slide 5

Slide 5 text

Decyphering jargon via history 5

Slide 6

Slide 6 text

Adventures in CS (cca 1999) Machine learning is a func/on trained (un)supervised that generalizes well but hopefully not too much (overﬁ1ng) on a dataset ending up in fancy thesises. 6

Slide 7

Slide 7 text

Technical and simpliﬁed We: • run mul$variate linear regression • with cost/loss func$on to op,mize for (typically squared error) • with a batch gradient descent. 7

Slide 8

Slide 8 text

A neuron 8

Slide 9

Slide 9 text

Neural networks 9

Slide 10

Slide 10 text

Perceptron (1958) • random start weights, • ac*va*on func*on is a weighted sum exceeding a threshold. 10

Slide 11

Slide 11 text

Hidden layers before the AI winter ('70s) Mostly Minsky's fault: - non-linear failure (XOR), - backpropaga)on. 11

Slide 12

Slide 12 text

Backpropaga)on ('70-'80s) • ac$va$on func$on diﬀeren$able, • deriva$ve to adjust the weight to minimize error, • chain rule to blame prior layers, • op$mize with stochas.c gradient descent. 12

Slide 13

Slide 13 text

Deep Learning 13

Slide 14

Slide 14 text

What is it good for? • supervised near-human level accuracy in image classiﬁca+on, voice recogni+on, natural language processing • unsupervised use large volumes of unstructured data to learn hierarchical models which capture the complex structure in the data and then use these models to predict proper+es of previously unseen data 14

Slide 15

Slide 15 text

Slide 16

Slide 16 text

So is this supercharged ML? Kinda yes: • large scale neural networks with many layers, • weighs can be n dimensional arrays (tensors), • high level way of deﬁning predic0on code or forward pass, • framework ﬁgures the deriva

Slide 17

Slide 17 text

Who made it work? Blame Canada! According to Geoﬀrey Hinton in the past: - our labeled datasets were thousands of 9mes too small, - our computers were millions of 9mes too slow, - we ini1alized the weights in a stupid way, - we used the wrong type of non-linearity. 17

Slide 18

Slide 18 text

Datasets 18

Slide 19

Slide 19 text

Speed 19

Slide 20

Slide 20 text

GPUs to scale Training is highly parallelizable linear matrix algebra. 20

Slide 21

Slide 21 text

Weights 21

Slide 22

Slide 22 text

Training jargon • regulariza)on to avoid overﬁ,ng (dataset augmenta)on, early stopping, dropout layer, weight penalty L1 and L2), • proper learning rate decay (both high and low can be bad, proper rate decay), • batch normaliza)on (faster learning and higher overall accuracy). 22

Slide 23

Slide 23 text

Non-linearity 23

Slide 24

Slide 24 text

Ac#va#on func#ons • sigmoid 1/(1+e^-x) • TanH (2/(1+e^-2x))-1 • ReLU (rec'ﬁed linear unit) max(0,x) • so/plus ln(1+e^x) 24

Slide 25

Slide 25 text

ReLU for president ReLU • is sparse and gives more robust representa2ons, • has best performance, • avoids vanishing gradient problem, • actually it's so#max so it's diﬀeren2able. 25

Slide 26

Slide 26 text

Basic architectures 26

Slide 27

Slide 27 text

Convolu'onal (CNN) • tradi'onal CV was hand-cra3ing, • mimics visual percep'on, • convolu'on extracts features1, • with lots of matrix mul'plica'on, • subsampling/pooling to reduce size and avoid overﬁ@ng. 1 LeNet5, Yann LeCun, 1988 27

Slide 28

Slide 28 text

Recurrent (RNN) • stateful, • TDNN -me delay neural networks, • LSTM long short-term memory, • supervised. 28

Slide 29

Slide 29 text

Autoencoder • reinforcement learning (deliver ac2on on context) • DBN Deep Belief Networks - directed • DBM Deep Boltzmann Machines - undirected • unsupervised. 29

Slide 30

Slide 30 text

Tensorﬂow 30

Slide 31

Slide 31 text

Recent major contestants • 2002 Torch (Lua) industrial, mul3ple GPUs, acyclic comp. graphs • 2010 Caﬀe (Python) academic, boilerplate-heavy • 2010 Theano (Python) academic, high level lightweight Keras • 2011 DistBelief (Google) • 2015 Tensorﬂow (Python) • 2016 CNTK (C#) 31

Slide 32

Slide 32 text

TF is 18 months old • pla%orms: DSP, CPU (ARM, Intel), (mul+ple) GPU(s), TPU, • Linux, OSX, Windows, Android, iOS, Raspberry Pi, • Python, Go, Rust, Java and Haskell, • performance improvements. 32

Slide 33

Slide 33 text

Liason with Python • API stability, • resemble NumPy more closely, • pip packages are now PyPI compliant, • high-level API includes a new *.keras module (almost halve the boilerplate), • Sonnet, a new high level API from DeepMind. 33

Slide 34

Slide 34 text

TF is open source for 18 months • the most popular machine learning project on GitHub in 2016, • 16.644 commits by 789 people, • 10,031 related repos. 34

Slide 35

Slide 35 text

Tooling • TensorBoard visualize network topology and performance, • Embedding Projector high level model understanding via visualiza:on, • XLA domain-speciﬁc compiler for TF graphs (CPUs and GPUs), • Fold for dynamic batching, • TensorFlow Serving to serve TF models in produc:on, 35

Slide 36

Slide 36 text

TF product choices: Tesla, not Ford • the right language, • mul/ple GPUs for training eﬃciency, • compile /mes are great (no to conﬁg), • high level API, • enable community, • tooling. 36

Slide 37

Slide 37 text

OS models Dozens of pretrained models like: • Incep'on (CNN), • SyntaxNet parser (LSTM), • Parsey McParseface for English (LSTM), • Parsey's Cousins for 40 addi'onal languages (LSTM). 37

Slide 38

Slide 38 text

Examples CNN (percep)on, image recogni)on) recycling and cucumber sor1ng with RasPI preven1ng skin cancer and blindness in diabe1cs LSTM (transla)on, speech recogni)on) language transla1on RNN (genera)on, )me series analysis) text, image and doodle genera1on in style or from text Reinforcement learning (control and play, autonomous driving) OpenAI Lab 38

Slide 39

Slide 39 text

A good product • don't lead the pack, • well stolen is half done, • end-to-end, • ecosystem (tooling), • eat your own dogfood. 39

Slide 40

Slide 40 text

Distributed deep learning Past: centralized, shovel all to the same pit, do magic, command and control. Future: pretrain models centrally, distribute models, retrain locally, merge and manage models (Squeezenet 500 kb). Gain: - eﬃciency, - no big data pipes, - privacy. 40

Slide 41

Slide 41 text

Federated Learning (3 weeks ago) Phones collabora-vely learn a shared predic-on model • device downloads current model, • improves it by learning from local data (retrain), • summarizes changes of model as small focused update, • update, but no data, is sent to the cloud encrypted, • averaged with other user updates to improve the shared model. 41

Slide 42

Slide 42 text

Subject ma+er experts - deep learning novices • Do you really need it? • Prepare data (small data < transfer learning + domain adapta9on, cover problem space, balance classes, lower dimensionality). • Find analogy (CNN, RNN/LSTM/GRU, RL). • Create a simple, small & easy baseline model, visualize & debug. • Fine-tune (evalua9on metrics - test data, loss func9on - training). (Smith: Best Prac0ces for Applying Deep Learning to Novel ... , 2017) 42

Slide 43

Slide 43 text

Training • hosted (GCMLE, Rescale, Floydub, Indico, Bi

Slide 44

Slide 44 text

Near future? • bots go bust, • deep learning goes commodity, • AI is cleantech 2.0 for VCs, • MLaaS dies a second death, • full stack verCcal AI startups actually work. (Cross: Five AI Startup Predic6ons for 2017) 44

Slide 45

Slide 45 text

Major sources and read more • Andrey Kurenkov: A 'Brief' History of Neural Nets and Deep Learning, Part 1-4 • Adam Geitgey: Machine Learning is Fun! • TensorFlow and Deep Learning – Without a PhD (1 and 3 hour version) • Pete Warden: Tensorﬂow for Poets 45

Slide 46

Slide 46 text

Thank you! @soobrosa 46