Perceptron,
Support Vector Machine, and
Passive Aggressive Algorithm.
Sorami Hisamoto
14 May 2013, PG
Slide 2
Slide 2 text
/ 37
2
Disclaimer
This material gives a brief impression of how these algorithms work.
This material may contain wrong explanations and errors.
Please refer to other materials for more detailed and reliable information.
Slide 3
Slide 3 text
/ 37
What is “Machine Learning” ?
3
“Field of study that gives computers the ability to learn
without being explicitly programmed.” Arthur Samuel, 1959.
Slide 4
Slide 4 text
/ 37
Types of Machine Learning Algorithms
- Supervised Learning ڭࢣ͋Γֶश
- Unsupervised Learning ڭࢣͳֶ͠श
- Semi-supervised Learning ڭࢣ͋Γֶश
- Reinforcement Learning ڧԽֶश
- Active Learning ೳಈֶश
- ... 4
by the property of data.
Slide 5
Slide 5 text
/ 37
Types of Machine Learning Algorithms
- Binary Classification ೋྨ
- Regression ճؼ
- Multi Class Classification ଟྨ
- Sequential Labeling ܥྻϥϕϦϯά
- Learning to Rank ϥϯΫֶश
- ... 5
by the property of problem.
Slide 6
Slide 6 text
/ 37
Types of Machine Learning Algorithms
- Batch Learning
- Online Learning
6
by the parameter optimization strategy.
Slide 7
Slide 7 text
/ 37
Linear binary classification
7
Given X, predict Y.
Slide 8
Slide 8 text
/ 37
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
Slide 9
Slide 9 text
/ 37
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
Slide 10
Slide 10 text
/ 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
Slide 11
Slide 11 text
/ 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
Slide 12
Slide 12 text
/ 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
Slide 13
Slide 13 text
/ 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
?
Slide 14
Slide 14 text
/ 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classification
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
?
!
Slide 15
Slide 15 text
/ 37
Implementing a linear binary classifier
8
Slide 16
Slide 16 text
/ 37
Implementing a linear binary classifier
8
How do we learn
the weights?
Slide 17
Slide 17 text
/ 37
9
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 18
Slide 18 text
/ 37
10
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 19
Slide 19 text
/ 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11
Slide 20
Slide 20 text
/ 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11
Hinge loss ώϯδଛࣦ
loss (w, x, y) = max(0, -ywx)
Stochastic gradient descent method ֬తޯ߱Լ๏
∂loss() / ∂w = max(0, -yx)
x: input vector, w: weight vector, y: correct label (+1 or -1)
Slide 21
Slide 21 text
/ 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11
Hinge loss ώϯδଛࣦ
loss (w, x, y) = max(0, -ywx)
Stochastic gradient descent method ֬తޯ߱Լ๏
∂loss() / ∂w = max(0, -yx)
x: input vector, w: weight vector, y: correct label (+1 or -1)
Sum up these 2 procedures as 1:
+ y * x
Slide 22
Slide 22 text
/ 37
Learning hyperplane: Illustrated
12
Figure from http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 .
Slide 23
Slide 23 text
/ 37
Implementing a perceptron
13
Slide 24
Slide 24 text
/ 37
Implementing a perceptron
13
Slide 25
Slide 25 text
/ 37
Implementing a perceptron
13
loss (w, x, y) = max(0, -ywx)
Slide 26
Slide 26 text
/ 37
Implementing a perceptron
13
loss (w, x, y) = max(0, -ywx)
/ 37
14
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 29
Slide 29 text
/ 37
15
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 30
Slide 30 text
/ 37
16
Slide 31
Slide 31 text
/ 37
SVM [Vapnik&Cortes 1995]
- Perceptron, plus ...
-Margin maximizing.
-(Kernel).
17
Support Vector Machine
Slide 32
Slide 32 text
/ 37
Which one looks better, and why?
18
All 3 classify correctly but ...
the middle one seems the best.
Slide 33
Slide 33 text
/ 37
Which one looks better, and why?
18
All 3 classify correctly but ...
the middle one seems the best.
Slide 34
Slide 34 text
/ 37
Margin maximizing
19
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
Slide 35
Slide 35 text
/ 37
Margin maximizing
19
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
Slide 36
Slide 36 text
/ 37
Margin maximizing
19
support
vector
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
Slide 37
Slide 37 text
/ 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
Slide 38
Slide 38 text
/ 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
Slide 39
Slide 39 text
/ 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
Slide 40
Slide 40 text
/ 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
As w^2 becomes smaller,
margin (λ / w^2) becomes bigger.
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
Slide 41
Slide 41 text
/ 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classification performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
As w^2 becomes smaller,
margin (λ / w^2) becomes bigger.
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
For a detailed explanation, refer to other materials;
e.g. [ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1
/ 37
Soft-margin
- Sometimes impossible to linearly separate.
- → Soft-margin
- Permit violation of margin.
- If margin negative, give penalty.
- Minimize penalty and maximize margin.
22
Slide 49
Slide 49 text
/ 37
23
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 50
Slide 50 text
/ 37
24
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 51
Slide 51 text
/ 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm
Slide 52
Slide 52 text
/ 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm
minimally change
Slide 53
Slide 53 text
/ 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm
minimally change
... and correctly
classify.
/ 37
Passive & Aggressive: Illustrated
26
Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .
Do nothing.
Slide 56
Slide 56 text
/ 37
Passive & Aggressive: Illustrated
26
Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .
Do nothing.
Move minimally
to classify correctly.
Slide 57
Slide 57 text
/ 37
Implementing PA
27
Slide 58
Slide 58 text
/ 37
PA vs. Perceptron & SVM
28
- PA always correctly classify the last-seen data.
- But not with Perceptron & SVM, as the update size is constant.
- → PA seems more efficient, but is weaker to noise than Perceptron&SVM.
Slide 59
Slide 59 text
/ 37
PA, or MIRA?
29
MIRA (Margin Infused Relaxed Algorithm)
[Crammer+ 2003]
PA (Passive Aggressive Algorithm)
[Crammer+ 2006]
Slide 60
Slide 60 text
/ 37
PA, or MIRA?
29
“... MIRAઢܗՄೳͳʹ͔͠ରԠ͓ͯ͠Βͣɺ
͜ΕΛൃలͤͨ͞ͷ͕PA[2]ͱͳ͍ͬͯΔɻ
·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ
࣮ࡍʹ[2]Λ͍ͬͯΔɻ”
“... ΦϦδφϧͷMIRAϞσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ
ͷͰͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏
͕ҟͳΔɻ”
[தᖒ 2009]
MIRA (Margin Infused Relaxed Algorithm)
[Crammer+ 2003]
PA (Passive Aggressive Algorithm)
[Crammer+ 2006]
/ 37
Expansion of PA
- PA-I, PA-II
- Confidence-Weighted Algorithm (CW) [Dredze+ 2008]
- If a feature appeared frequently in the past, likely to be more reliable (hence update less).
- Fast convergence.
- Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009]
- More tolerant to noise than CW.
- Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012]
- ...
30
Slide 63
Slide 63 text
/ 37
Expansion of PA
- PA-I, PA-II
- Confidence-Weighted Algorithm (CW) [Dredze+ 2008]
- If a feature appeared frequently in the past, likely to be more reliable (hence update less).
- Fast convergence.
- Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009]
- More tolerant to noise than CW.
- Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012]
- ...
30
Used for Gmailʼs
“priority inbox”.
Slide 64
Slide 64 text
/ 37
31
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm
Slide 65
Slide 65 text
/ 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32
Slide 66
Slide 66 text
/ 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32
Slide 67
Slide 67 text
/ 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32
Slide 68
Slide 68 text
/ 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32
Slide 69
Slide 69 text
/ 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, first pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difficult to linearly classify?
- Devise better features.
- Non-linear classifier. 33
Slide 70
Slide 70 text
/ 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, first pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difficult to linearly classify?
- Devise better features.
- Non-linear classifier. 33
Slide 71
Slide 71 text
/ 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, first pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difficult to linearly classify?
- Devise better features.
- Non-linear classifier. 33
Slide 72
Slide 72 text
/ 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, first pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difficult to linearly classify?
- Devise better features.
- Non-linear classifier. 33
Slide 73
Slide 73 text
/ 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, first pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difficult to linearly classify?
- Devise better features.
- Non-linear classifier. 33