Introduction to Deep Learning & NLP - PyData London 2016

Introduction to Deep Learning Raghotham S Nischal HP www.unnati.xyz

AGENDA Motivation to Learning Introduction to Artificial Neural Networks Multi
Layer Perceptron Deep Learning in Natural Language Processing Impact of GPU

MACHINE LEARNING

Machine Learning

What is Machine Learning ? Extracting features from data to
solve predictive tasks

Use Cases Forecasting Recommendations Anomalies Classification Ranking Summarizing Decision making

LEARNING

How do we recognize digits? What is Learning?

8 Computation Outputs Model source: http://www.slideshare.net/indicods/deep-learning-with-python-and-the-theano-library Inputs Machine Learning Framework

Recognizing Digit – An algorithm Use functions that compute relevant
information to solve the problem

For each image, find “most similar” image. Guess that as
the label. k Nearest-Neighbors Recognizing Digit – An algorithm Use functions that compute relevant information to solve the problem

Recognizing Digit – An algorithm Hard coded rules are brittle.
Mostly better than a coin-toss. But far from being effective and accurate

LEARNING Biological Inspiration

Biological Inspiration Connected network of neurons Communicate by electrical and
chemical signals ~ 1011 neurons ~1000 synapses per neuron

How is information detected? How is it stored? How does
it influence recognition? Questions about Learning

Learnings from Neuro & Cognitive Science

Learnings from Neuro & Cognitive Science Kids talk grammatically correct
sentences even before they are taught formal language. Kids learn after listening to a lot of sentences. See/hear/feel first. Assimilate. Build the context hierarchically. Recognize. Respond. Associations and Structural inferences. Understand context. Eg: Drinking water Vs River Vs Ocean

Simplified Brain’s Visual System source: ASDM Summer School on Deep
Learning 2014 Hierarchical Structure: Several layers of representation Each layer builds on top of the previous one to obtain more complex representations

Importance of Connectionism Simple units interacting in a complex network
Distributed representation of knowledge Parallelism Threshold mechanism for robust classification Mechanism for learning – adjusting synaptic weights Comprehend inner structure of observed data Lessons from Biological Learning

Example for Learning

Classify points to a curve source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Two Curves on
a plane

source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Classify points to a curve Two Curves on
a plane A simple model

source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Classify points to a curve Two Curves on
a plane A better model

Classify points to a curve source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ Obtained by transforming
to a higher dimension A better model Two Curves on a plane

Visualization on learning separation source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Now – a Bummer ☺ source: http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Neural Network Building Blocks

INPUT ACTIVATION FUNCTION OUTPUT NEURON WEIGHTS Neuron, Activation Function

Single neuron is not enough INPUT OUTPUT Individual elements are
weak computational elements Need: Networked elements

A Simple Neural Network source: https://en.wikipedia.org/wiki/Artificial_neural_network

Feed-forward Neural Network source: https://en.wikipedia.org/wiki/Artificial_neural_network Input layer of source nodes
One or several hidden layers of processing neurons Output layer of processing neuron(s) Connections only between adjacent layers There are no feedback connections The network is formed by

Activation Function source: https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions activation function of a node defines
the output of that node given inputs

Perceptron A A a source: ASDM Summer School on Deep
Learning 2014 Invented in 1957 Classifies input data into one of the output classes. If the weighted input is more than the threshold, classify as 1. Else 0 Online learning possible

Sigmoid/Logistic Output is bounded between 0 & 1 Domain: Complete
set of Real numbers Smooth and continuous function Symmetric Derivative can be quickly calculated Positive, Bounded, strictly positive

Generalization of logistic regression for Multi-class classification Softmax

Rectified Linear Units Cheap to compute (no exponentials) Faster training
Sparser networks Bounded below 0 Strictly increasing

source: ASDM Summer School on Deep Learning 2014 Max-out Generalization
of Rectified Linear Max of k linear functions -> piece-wise linear At large k, can approximate a nonlinear function

Learning in ANN

Learning in ANN - Backpropagation source: ASDM Summer School on
Deep Learning 2014

Backpropagation Algorithm - Math source: ASDM Summer School on Deep
Learning 2014 M..AA...AA…TTTT.. HHH..

Backpropagation Algorithm Recursively and iteratively, the weight of the neuron
that contributed most to the error gets penalized the most

Backpropagation Algorithm

Backpropagation Algorithm Computes gradient of the loss function w.r. to
the weights. Backpropagate training error to generate deltas of all the neurons from hidden layers to output layer Use gradient descent to update weights

Goal: To find minimum of the loss function (minimize error
of the model) Learning in ANN – Gradient Descent

Learning in ANN – Gradient Descent Depending on where we
start, we can end up in different places.

Er…. How to compute gradient?

SGD/Mini-Batch Instead of using all of the training data, train
iteratively on “mini-batches”

Online Learning Mini-batch size is 1. Weights are adjusted for
every single data point.

Multi-Layer Perceptron Feedforward ANN Mostly sigmoid Can classify data that
aren’t linearly separable

HANDS ON - MLP

Overfitting source:http://mathbabe.org/2012/11/20/columbia-data-science-course-week-12-predictive-modeling-data-leakage-model-evaluation/ Underfitted Model Good Model Overfitted Model

Some ways to address Overfitting Weight Decay L1/L2 Regularization Suitable
Model Architectures Unsupervised Pre-training Dropout Data Augmentation

Dropout Cripple the network by removing hidden units stochastically In
practice, probability of 0.5 works well. BEFORE DROPOUT source:http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

Dropout Cripple the network by removing hidden units stochastically In
practice, probability of 0.5 works well. BEFORE DROPOUT source:http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html AFTER DROPOUT

Natural Language Processing

Source: https://gist.github.com/nylki/1efbaa36635956d35bcc Automatic Generation of Cooking Recipe

Existing ML framework for NLP Input Text Linear Model (SVM,
Softmax) Features (BOW, TFIDF, etc)

A Sentence source: http://web.stanford.edu/class/cs224n/handouts/CS224N_DeepNLP_Week7_lecture2.pdf

Structure important for humor & sarcasm Structure is very important
A small crowd quietly enters the historic church A Historic crowd enters the small quietly church

Representing structure is hard n-grams quickly explodes

Limitation of the architectures so far Fixed Size Input (Eg:
Image) Fixed computational steps (Eg: Number of layers) Fixed Size Output (Eg: probabilities of different classes) Hierarchy captured. But context, structure ? – mostly NOT !

Solution?

Solution? Vectorization

Convert text to numbers Vectorization One Hot Encoding WordNet Co-occurrence
Matrix

Word2Vec Window based vectorization Learn surrounding words The model learns
using the data instead of relying on any other kind of corpus Represents discrete state of the word Similar words are clustered together

Word2Vec - Skip Gram source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Word2Vec - CBoW Continuous Bag of Words source: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Word2Vec source: http://deeplearning4j.org/word2vec

Hands on Vectorization

Recurrent Neural Networks

Recurrent Neural Network

One-to-One Sequence Vanilla-model. No RNN source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

One-to-Many Sequence Sequence output (Recognize image and explain it in
words) source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Many-to-One Sequence source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Sequence input (Sentiment analysis of a
sentence)

Many-to-Many Sequence source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Sequence input (Machine Translation – English
to Spanish)

LSTM Long Short Term Memory

Effect of Dataset Size 1 thousand labeled examples perform 25-50%
worse than linear model 1 Million labeled examples perform 0-30% better than linear model RNNs have poor generalization properties on small datasets RNNs have better generalization on large datasets.

Hands On - RNN

Graphical Processing Unit

Impact of GPU Compared to CPUs 20x speedups are typical
Source: http://www.nvidia.com/object/what-is-gpu-computing.html GPUs have thousands of cores to process parallel workloads effectively

Impact of GPU Accelerated computations on float32 data Matrix multiplication,
convolution, and large element-wise operations can be accelerated a lot Difficult to parallelize dense neural networks on multiple GPU efficiently (Active area of research) Copying of large quantities of data to and from a device is relatively slow CUDA has released cuDNN

SOME DEEP LEARNING EXAMPLES

Video Classification

Image Recognition source: http://www.youtube.com

Image Generation – Google Inceptionism source: http://googleresearch.blogspot.in/2015/06/inceptionism-going-deeper-into-neural.html

Natural Language Processing Handwriting Generator

Thank you www.unnati.xyz @unnati_xyz

Introduction to Deep Learning & NLP - PyData Lo...

Introduction to Deep Learning & NLP - PyData London 2016

More Decks by unnati_xyz

Other Decks in Technology

Featured

Transcript