Deep Networks and Kernel Methods

Slide 1

Slide 1 text

Deep Networks and Kernel Methods Edgar Marca [email protected] Grupo de Reconocimiento de Patrones e Inteligencia Artiﬁcial Aplicada — PUC Lima, Perú June 18th, 2015 1 / 35

Slide 2

Slide 2 text

Table of Contents I Image Classiﬁcation Problem Deep Networks Convolutional Neural Networks Software How to start Kernel Methods SVM The Kernel Trick History of Kernel Methods Software How to start Kernels and Deep Learning Convolutional Kernel Networks Deep Fried Convnets 2 / 35

Slide 3

Slide 3 text

Image Classiﬁcation Problem Image Classiﬁcation Problem Figure: http://www.image-net.org/ 3 / 35

Slide 4

Slide 4 text

Deep Networks

Slide 5

Slide 5 text

Deep Networks Human-level control through deep reinforcement learning ▶ Volodymyr Mnih et al., Human-level control through deep reinforcement learning. 5 / 35

Slide 6

Slide 6 text

Deep Networks Convolutional Neural Networks Convolutional Neural Networks 6 / 35

Slide 7

Slide 7 text

Deep Networks Software Software ▶ Torch7 — http://torch.ch/. ▶ Caﬀe — http://caffe.berkeleyvision.org/. ▶ Minerva — https://github.com/dmlc/minerva. ▶ Theano — http://deeplearning.net/software/theano/. 7 / 35

Slide 8

Slide 8 text

Deep Networks How to start How to start I ▶ Deep Learning Course by Nando de Freitas — https://www.youtube.com/watch?v=PlhFWT7vAEw&list= PLjK8ddCbDMphIMSXn-w1IjyYpHU3DaUYw. ▶ Alex Smola Lecture on Deep Networks — https://www.youtube.com/watch?v=xZzZb7wZ6eE. ▶ Convolutional Neural Networks for Visual Recognition — http://vision.stanford.edu/teaching/cs231n/. ▶ Deep Learning, Spring 2015 — http://cilvr.cs.nyu.edu/doku.php?id=courses: deeplearning2015:start. 8 / 35

Slide 9

Slide 9 text

Deep Networks How to start How to start II ▶ Deep Learning for Natural Language Processing — http://cs224d.stanford.edu/. ▶ Applied Deep Learning for Computer Vision with Torch – http://torch.ch/docs/cvpr15.html. ▶ DEEP LEARNING, An MIT Press book in preparation — http://www.iro.umontreal.ca/~bengioy/dlbook/. ▶ Reading List — http://deeplearning.net/reading-list/. 9 / 35

Slide 10

Slide 10 text

Kernel Methods

Slide 11

Slide 11 text

Kernel Methods SVM Linear Support Vector Machine w, x + b = 1 w, x + b = −1 w, x + b = 0 margen Figure: Linear Support Vector Machine 11 / 35

Slide 12

Slide 12 text

Kernel Methods SVM Linear Support Vector Machine Linear SVM - Primal Problem Given a linear separable training set D = {(x1, y1), (x2, y2), ..., (xl, yl)} ⊂ Rn × {+1, −1}, we can calculate the max margin decision surface ⟨w∗, x⟩ = b∗ solving the convex program (P)        min w,b ϕ(w, b) = 1 2 ⟨w, w⟩ subject to ⟨w, yixi⟩ ≥ 1 + yib, where (xi, yi) ∈ D ⊂ Rn × {−1, +1}. (1) 1. The objective function doesn’t depends on b. 2. The displacement b appears in the restrictions. 3. The number of restrictions is equal to the number of training points. 12 / 35

Slide 13

Slide 13 text

Kernel Methods SVM Linear Support Vector Machine Lineal SVM - Dual Problem (DP)                    max α h(α) = max α ( l ∑ i=1 αi − 1 2 l ∑ i=1 l ∑ j=1 αiαjyiyj⟨xi, xj⟩ ) sujeto a l ∑ i=1 αiyi = 0, αi ≥ 0 for i = 1, . . . , l. The calculus of b es in terms of w∗, as following: b+ = min {⟨w∗, x⟩ | (x, y) ∈ D where y = +1)} b− = max {⟨w∗, x⟩ | (x, y) ∈ D where y = −1)} Then b∗ = b++b− 2 The training vectors associated to αi > 0 are named support vectors. 13 / 35

Slide 14

Slide 14 text

Kernel Methods SVM Linear Support Vector Machine Linear Support Vector Machine ˆ f(x) = sign ( l ∑ i=1 α∗ i yi⟨xi, x⟩ − b∗ ) 14 / 35

Slide 15

Slide 15 text

The Kernel Trick

Slide 16

Slide 16 text

Kernel Methods The Kernel Trick Motivation ▶ How we can split data that is not linear separable? ▶ How we can utilize algorithms that works for linear separable data that only depends on the inner product? 16 / 35

Slide 17

Slide 17 text

Kernel Methods The Kernel Trick R to R2 Case How to separate two classes? 0 R R2 ϕ(x) = (x, x 2) ϕ Figure: Separating the two classes of points by transforming the points into a higher dimensional space where the data is separable. 17 / 35

Slide 18

Slide 18 text

Kernel Methods The Kernel Trick R2 to R3 Case + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Figure: Data which is not linear separable. 18 / 35

Slide 19

Slide 19 text

Kernel Methods The Kernel Trick R2 to R3 Case A simulation Figure: SVM with polynomial kernel visualization. 19 / 35

Slide 20

Slide 20 text

Kernel Methods The Kernel Trick Idea ϕ ϕ(+) ϕ(+) ϕ(+) ϕ(−) ϕ(−) ϕ(−) ϕ(−) ϕ(−) ϕ(+) ϕ(+) Figure: φ is a non-linear mapping from the input space to the feature space. 20 / 35

Slide 21

Slide 21 text

Kernel Methods The Kernel Trick Non Linear Support Vector Machine Now we can use a non-linear function φ to map the information from the initial space to a higher dimensional space. Non Linear Support Vector Machine ˆ f(x) = sign ( l ∑ i=1 α∗ i yi⟨φ(xi), φ(x)⟩ − b∗ ) 21 / 35

Slide 22

Slide 22 text

Kernel Methods The Kernel Trick Deﬁnition 3.1 (Kernel) Let X a non-empty set. A function k : X × X → K is called kernel in X if and only if there is Hilbert Space H and a mapping Φ : X → H such that for all s, t it holds k(t, s) := ⟨Φ(t), Φ(s)⟩H (2) The function Φ is called feature mapping and H feature space of k. 22 / 35

Slide 23

Slide 23 text

Kernel Methods The Kernel Trick Example 3.2 Consider X = R and the function k deﬁned by k(s, t) = st = ⟨ [ s √ 2 s √ 2 ] , [ t √ 2 t √ 2 ]⟩ where the feature mappings are Φ(s) = s and ˜ Φ(s) = [ s √ 2 s √ 2 ] and the features spaces are H = R and ˜ H = R2 respectively. 23 / 35

Slide 24

Slide 24 text

Kernel Methods The Kernel Trick Non Linear Support Vector Machines Using the kernel trick we can replace ⟨φ(xi), φ(x)⟩ by a kernel k(xi, x). ˆ f(x) = sign ( l ∑ i=1 α∗ i yik(xi, x) − b∗ ) 24 / 35

Slide 25

Slide 25 text

History of Kernel Methods

Slide 26

Slide 26 text

Kernel Methods History of Kernel Methods Timeline Table: Timeline of Support Vector Machines Algorithm Development 1909 • Mercer Theorem — James Mercer. "Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations". 1950 • "Moore-Aronzajn Theorem" — Nachman Aronszajn. "Reproducing Kernel Hilbert Spaces". 1964 • Introduced the geometrical interpretation of the kernels as inner products in a feature space — Aizerman, Braverman and Rozonoer. "Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning". 1964 • Original SVM algorithm — Vladimir Vapnik and Alexey Chervonenkis. "A Note on One Class of Perceptrons" 26 / 35

Slide 27

Slide 27 text

Kernel Methods History of Kernel Methods Timeline Table: Timeline of Support Vector Machines Algorithm Development 1965 • Cover’s Theorem — Thomas Cover. "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition". 1992 • Support Vector Machines — Bernhard Boser, Isabelle Guyon and Vladimir Vapnik. "A Training Algorithm for Optimal Margin Classiﬁers". 1995 • Soft Support Vector Machines — Corinna Cortes and Vladimir Vapnik. "Support Vector Networks". 27 / 35

Slide 28

Slide 28 text

Kernel Methods Software Software ▶ LibSVM — https://www.csie.ntu.edu.tw/~cjlin/libsvm/. ▶ SVMLight — http://svmlight.joachims.org/. ▶ Scikit Learn — http://scikit-learn.org/stable/modules/svm.html. 28 / 35

Slide 29

Slide 29 text

Kernel Methods How to start How to start ▶ Introduction to Support Vector Machines — https://beta.oreilly.com/learning/intro-to-svm ▶ Lutz H. Hamel, Knowledge Discovery with Support Vector Machines. ▶ John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis. 29 / 35

Slide 30

Slide 30 text

Kernel and Deep Learning

Slide 31

Slide 31 text

Kernels and Deep Learning Kernels and Deep Learning ▶ Julien Mairal et al., Convolutional Kernel Networks. ▶ Zichao Yang et al., Deep Fried Convnets. 31 / 35

Slide 32

Slide 32 text

Kernels and Deep Learning Convolutional Kernel Networks Convolutional Kernel Networks 32 / 35

Slide 33

Slide 33 text

Kernels and Deep Learning Deep Fried Convnets Deep Fried Convnets ▶ Quoc Viet Le et al., Fastfood: Approximate Kernel Expansions in Loglinear Time. 33 / 35