Introduction to Deep Learning & NLP - PyData London 2016

Slide 1

Slide 1 text

Introduction to Deep Learning Raghotham S Nischal HP www.unnati.xyz

Slide 2

Slide 2 text

AGENDA Motivation to Learning Introduction to Artificial Neural Networks Multi Layer Perceptron Deep Learning in Natural Language Processing Impact of GPU

Slide 3

Slide 3 text

MACHINE LEARNING

Slide 4

Slide 4 text

Machine Learning

Slide 5

Slide 5 text

What is Machine Learning ? Extracting features from data to solve predictive tasks

Slide 6

Slide 6 text

Use Cases Forecasting Recommendations Anomalies Classification Ranking Summarizing Decision making

Slide 7

Slide 7 text

LEARNING

Slide 8

Slide 8 text

How do we recognize digits? What is Learning?

Slide 9

Slide 9 text

8 Computation Outputs Model source: http://www.slideshare.net/indicods/deep-learning-with-python-and-the-theano-library Inputs Machine Learning Framework

Slide 10

Slide 10 text

Recognizing Digit – An algorithm Use functions that compute relevant information to solve the problem

Slide 11

Slide 11 text

For each image, find “most similar” image. Guess that as the label. k Nearest-Neighbors Recognizing Digit – An algorithm Use functions that compute relevant information to solve the problem

Slide 12

Slide 12 text

Recognizing Digit – An algorithm Hard coded rules are brittle. Mostly better than a coin-toss. But far from being effective and accurate

Slide 13

Slide 13 text

LEARNING Biological Inspiration

Slide 14

Slide 14 text

Biological Inspiration Connected network of neurons Communicate by electrical and chemical signals ~ 1011 neurons ~1000 synapses per neuron

Slide 15

Slide 15 text

How is information detected? How is it stored? How does it influence recognition? Questions about Learning

Slide 16

Slide 16 text

Learnings from Neuro & Cognitive Science

Slide 17

Slide 17 text

Learnings from Neuro & Cognitive Science Kids talk grammatically correct sentences even before they are taught formal language. Kids learn after listening to a lot of sentences. See/hear/feel first. Assimilate. Build the context hierarchically. Recognize. Respond. Associations and Structural inferences. Understand context. Eg: Drinking water Vs River Vs Ocean

Slide 18

Slide 18 text

Simplified Brain’s Visual System source: ASDM Summer School on Deep Learning 2014 Hierarchical Structure: Several layers of representation Each layer builds on top of the previous one to obtain more complex representations

Slide 19

Slide 19 text

Importance of Connectionism Simple units interacting in a complex network Distributed representation of knowledge Parallelism Threshold mechanism for robust classification Mechanism for learning – adjusting synaptic weights Comprehend inner structure of observed data Lessons from Biological Learning

Slide 20

Slide 20 text

Example for Learning

Slide 21

Slide 21 text

Classify points to a curve source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Two Curves on a plane

Slide 22

Slide 22 text

source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Classify points to a curve Two Curves on a plane A simple model

Slide 23

Slide 23 text

source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Classify points to a curve Two Curves on a plane A better model

Slide 24

Slide 24 text

Classify points to a curve source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Obtained by transforming to a higher dimension A better model Two Curves on a plane

Slide 25

Slide 25 text

Visualization on learning separation source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Slide 26

Slide 26 text

Now – a Bummer ☺ source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Slide 27

Slide 27 text

Neural Network Building Blocks

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

INPUT ACTIVATION FUNCTION OUTPUT NEURON WEIGHTS Neuron, Activation Function

Slide 30

Slide 30 text

Single neuron is not enough INPUT OUTPUT Individual elements are weak computational elements Need: Networked elements

Slide 31

Slide 31 text

A Simple Neural Network source: https://en.wikipedia.org/wiki/Artificial_neural_network

Slide 32

Slide 32 text

Feed-forward Neural Network source: https://en.wikipedia.org/wiki/Artificial_neural_network Input layer of source nodes One or several hidden layers of processing neurons Output layer of processing neuron(s) Connections only between adjacent layers There are no feedback connections The network is formed by

Slide 33

Slide 33 text

Activation Function source: https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions activation function of a node defines the output of that node given inputs

Slide 34

Slide 34 text

Perceptron A A a source: ASDM Summer School on Deep Learning 2014 Invented in 1957 Classifies input data into one of the output classes. If the weighted input is more than the threshold, classify as 1. Else 0 Online learning possible

Slide 35

Slide 35 text

Sigmoid/Logistic Output is bounded between 0 & 1 Domain: Complete set of Real numbers Smooth and continuous function Symmetric Derivative can be quickly calculated Positive, Bounded, strictly positive

Slide 36

Slide 36 text

Generalization of logistic regression for Multi-class classification Softmax

Slide 37

Slide 37 text

Rectified Linear Units Cheap to compute (no exponentials) Faster training Sparser networks Bounded below 0 Strictly increasing

Slide 38

Slide 38 text

source: ASDM Summer School on Deep Learning 2014 Max-out Generalization of Rectified Linear Max of k linear functions -> piece-wise linear At large k, can approximate a nonlinear function

Slide 39

Slide 39 text

Learning in ANN

Slide 40

Slide 40 text

Learning in ANN - Backpropagation source: ASDM Summer School on Deep Learning 2014

Slide 41

Slide 41 text

Backpropagation Algorithm - Math source: ASDM Summer School on Deep Learning 2014 M..AA...AA…TTTT.. HHH..

Slide 42

Slide 42 text

Backpropagation Algorithm Recursively and iteratively, the weight of the neuron that contributed most to the error gets penalized the most

Slide 43

Slide 43 text

Backpropagation Algorithm

Slide 44

Slide 44 text

Backpropagation Algorithm

Slide 45

Slide 45 text

Backpropagation Algorithm

Slide 46

Slide 46 text

Backpropagation Algorithm

Slide 47

Slide 47 text

Backpropagation Algorithm Computes gradient of the loss function w.r. to the weights. Backpropagate training error to generate deltas of all the neurons from hidden layers to output layer Use gradient descent to update weights

Slide 48

Slide 48 text

Goal: To find minimum of the loss function (minimize error of the model) Learning in ANN – Gradient Descent

Slide 49

Slide 49 text

Learning in ANN – Gradient Descent Depending on where we start, we can end up in different places.

Slide 50

Slide 50 text

Er…. How to compute gradient?

Slide 51

Slide 51 text

SGD/Mini-Batch Instead of using all of the training data, train iteratively on “mini-batches”

Slide 52

Slide 52 text

Online Learning Mini-batch size is 1. Weights are adjusted for every single data point.

Slide 53

Slide 53 text

Multi-Layer Perceptron Feedforward ANN Mostly sigmoid Can classify data that aren’t linearly separable

Slide 54

Slide 54 text

HANDS ON - MLP

Slide 55

Slide 55 text

Overfitting source:http://mathbabe.org/2012/11/20/columbia-data-science-course-week-12-predictive-modeling-data-leakage-model-evaluation/ Underfitted Model Good Model Overfitted Model

Slide 56

Slide 56 text

Some ways to address Overfitting Weight Decay L1/L2 Regularization Suitable Model Architectures Unsupervised Pre-training Dropout Data Augmentation

Slide 57

Slide 57 text

Dropout Cripple the network by removing hidden units stochastically In practice, probability of 0.5 works well. BEFORE DROPOUT source:http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Natural Language Processing

Slide 60

Slide 60 text

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc Automatic Generation of Cooking Recipe

Slide 61

Slide 61 text

Existing ML framework for NLP Input Text Linear Model (SVM, Softmax) Features (BOW, TFIDF, etc)

Slide 62

Slide 62 text

A Sentence source: http://web.stanford.edu/class/cs224n/handouts/CS224N_DeepNLP_Week7_lecture2.pdf

Slide 63

Slide 63 text

Structure important for humor & sarcasm Structure is very important A small crowd quietly enters the historic church A Historic crowd enters the small quietly church

Slide 64

Slide 64 text

Representing structure is hard n-grams quickly explodes

Slide 65

Slide 65 text

Limitation of the architectures so far Fixed Size Input (Eg: Image) Fixed computational steps (Eg: Number of layers) Fixed Size Output (Eg: probabilities of different classes) Hierarchy captured. But context, structure ? – mostly NOT !

Slide 66

Slide 66 text

Solution?

Slide 67

Slide 67 text

Solution? Vectorization

Slide 68

Slide 68 text

Convert text to numbers Vectorization One Hot Encoding WordNet Co-occurrence Matrix

Slide 69

Slide 69 text

Word2Vec Window based vectorization Learn surrounding words The model learns using the data instead of relying on any other kind of corpus Represents discrete state of the word Similar words are clustered together

Slide 70

Slide 70 text

Word2Vec - Skip Gram source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Slide 71

Slide 71 text

Word2Vec - CBoW Continuous Bag of Words source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Slide 72

Slide 72 text

Word2Vec source: http://deeplearning4j.org/word2vec

Slide 73

Slide 73 text

Hands on Vectorization

Slide 74

Slide 74 text

Recurrent Neural Networks

Slide 75

Slide 75 text

Recurrent Neural Network

Slide 76

Slide 76 text

One-to-One Sequence Vanilla-model. No RNN source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Slide 77

Slide 77 text

One-to-Many Sequence Sequence output (Recognize image and explain it in words) source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Slide 78

Slide 78 text

Many-to-One Sequence source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Sequence input (Sentiment analysis of a sentence)

Slide 79

Slide 79 text

Many-to-Many Sequence source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Sequence input (Machine Translation – English to Spanish)

Slide 80

Slide 80 text

RNN

Slide 81

Slide 81 text

RNN

Slide 82

Slide 82 text

LSTM Long Short Term Memory

Slide 83

Slide 83 text

LSTM

Slide 84

Slide 84 text

Effect of Dataset Size 1 thousand labeled examples perform 25-50% worse than linear model 1 Million labeled examples perform 0-30% better than linear model RNNs have poor generalization properties on small datasets RNNs have better generalization on large datasets.

Slide 85

Slide 85 text

Hands On - RNN

Slide 86

Slide 86 text

Graphical Processing Unit

Slide 87

Slide 87 text

No content

Slide 88

Slide 88 text

Impact of GPU Compared to CPUs 20x speedups are typical Source: http://www.nvidia.com/object/what-is-gpu-computing.html GPUs have thousands of cores to process parallel workloads effectively

Slide 89

Slide 89 text

Impact of GPU Accelerated computations on float32 data Matrix multiplication, convolution, and large element-wise operations can be accelerated a lot Difficult to parallelize dense neural networks on multiple GPU efficiently (Active area of research) Copying of large quantities of data to and from a device is relatively slow CUDA has released cuDNN

Slide 90

Slide 90 text

SOME DEEP LEARNING EXAMPLES

Slide 91

Slide 91 text

Video Classification

Slide 92

Slide 92 text

Image Recognition source: http://www.youtube.com

Slide 93

Slide 93 text

Image Generation – Google Inceptionism source: http://googleresearch.blogspot.in/2015/06/inceptionism-going-deeper-into-neural.html