2.2k

# Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Original presentation at Computational Linguistics Lab, Nara Institute of Science and Technology. May 14, 2013

## Transcript

1. Perceptron,
Support Vector Machine, and
Passive Aggressive Algorithm.
Sorami Hisamoto
14 May 2013, PG

2. / 37
2
Disclaimer
This material gives a brief impression of how these algorithms work.
This material may contain wrong explanations and errors.
Please refer to other materials for more detailed and reliable information.

3. / 37
What is “Machine Learning” ?
3
“Field of study that gives computers the ability to learn
without being explicitly programmed.” Arthur Samuel, 1959.

4. / 37
Types of Machine Learning Algorithms
- Supervised Learning ڭࢣ͋Γֶश
- Unsupervised Learning ڭࢣͳֶ͠श
- Semi-supervised Learning ൒ڭࢣ͋Γֶश
- Reinforcement Learning ڧԽֶश
- Active Learning ೳಈֶश
- ... 4
by the property of data.

5. / 37
Types of Machine Learning Algorithms
- Binary Classiﬁcation ೋ஋෼ྨ
- Regression ճؼ
- Multi Class Classiﬁcation ଟ஋෼ྨ
- Sequential Labeling ܥྻϥϕϦϯά
- Learning to Rank ϥϯΫֶश
- ... 5
by the property of problem.

6. / 37
Types of Machine Learning Algorithms
- Batch Learning
- Online Learning
6
by the parameter optimization strategy.

7. / 37
Linear binary classiﬁcation
7
Given X, predict Y.

8. / 37
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.

9. / 37
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.

10. / 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.

11. / 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.

12. / 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.

13. / 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
?

14. / 37
1. Get features from the input.
2. Calculate inner product of the feature vector and weights.
3. If result ≧ 0 output is +1, else -1.
Linear binary classiﬁcation
7
Given X, predict Y.
Output Y is binary;
e.g. +1 or -1.
e.g.
X - email.
Y - spam or not.
?
!

15. / 37
Implementing a linear binary classiﬁer
8

16. / 37
Implementing a linear binary classiﬁer
8
How do we learn
the weights?

17. / 37
9
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

18. / 37
10
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

19. / 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11

20. / 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11
Hinge loss ώϯδଛࣦ
loss (w, x, y) = max(0, -ywx)
∂loss() / ∂w = max(0, -yx)
x: input vector, w: weight vector, y: correct label (+1 or -1)

21. / 37
Perceptron [Rosenblatt 1957]
- For every sample:
- If prediction is correct, do nothing.
- If label=+1 and prediction=-1, add feature vector to the weights.
- If label=-1 and prediction=+1, subtract feature vector from the weights.
11
Hinge loss ώϯδଛࣦ
loss (w, x, y) = max(0, -ywx)
∂loss() / ∂w = max(0, -yx)
x: input vector, w: weight vector, y: correct label (+1 or -1)
Sum up these 2 procedures as 1:
+ y * x

22. / 37
Learning hyperplane: Illustrated
12
Figure from http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 .

23. / 37
Implementing a perceptron
13

24. / 37
Implementing a perceptron
13

25. / 37
Implementing a perceptron
13
loss (w, x, y) = max(0, -ywx)

26. / 37
Implementing a perceptron
13
loss (w, x, y) = max(0, -ywx)

27. / 37
Implementing a perceptron
13
∂loss(w, x, y) / ∂w = - yx
loss (w, x, y) = max(0, -ywx)

28. / 37
14
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

29. / 37
15
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

30. / 37
16

31. / 37
SVM [Vapnik&Cortes 1995]
- Perceptron, plus ...
-Margin maximizing.
-(Kernel).
17
Support Vector Machine

32. / 37
Which one looks better, and why?
18
All 3 classify correctly but ...
the middle one seems the best.

33. / 37
Which one looks better, and why?
18
All 3 classify correctly but ...
the middle one seems the best.

34. / 37
Margin maximizing
19
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.

35. / 37
Margin maximizing
19
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.

36. / 37
Margin maximizing
19
support
vector
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.

37. / 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.

38. / 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

39. / 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

40. / 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
As w^2 becomes smaller,
margin (λ / w^2) becomes bigger.
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

41. / 37
Margin maximizing
19
support
vector
margin
“Vapnik–Chervonenkis theory” ensures that if margin is maximized,
classiﬁcation performance on unknown data will be maximized.
SVM’s loss function
max(0, λ - ywx) + α * w^2 / 2
As w^2 becomes smaller,
margin (λ / w^2) becomes bigger.
If prediction correct
BUT score < λ ,
then get penalty.
margin(w) = λ / w^2
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
For a detailed explanation, refer to other materials;
e.g. [਺ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1

42. / 37
Perceptron and SVM
20
loss(w, x, y) ∂loss(w, x, y) / ∂w
Perceptron
SVM
max(0, -ywx) max(0, -yx)
max(0, λ - ywx) + α * w^2 / 2 - yx + αw
x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

43. / 37
Implementing SVM
21

44. / 37
Implementing SVM
21

45. / 37
Implementing SVM
21
loss(w, x, y) =
max(0, λ - ywx)
+ α * w^2 / 2

46. / 37
Implementing SVM
21
loss(w, x, y) =
max(0, λ - ywx)
+ α * w^2 / 2

47. / 37
Implementing SVM
21
∂loss(w, x, y) / ∂w =
- tx + αw
loss(w, x, y) =
max(0, λ - ywx)
+ α * w^2 / 2

48. / 37
Soft-margin
- Sometimes impossible to linearly separate.
- → Soft-margin
- Permit violation of margin.
- If margin negative, give penalty.
- Minimize penalty and maximize margin.
22

49. / 37
23
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

50. / 37
24
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

51. / 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm

52. / 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm
minimally change

53. / 37
PA [Crammer+ 2006]
- Passive:
If prediction correct, do nothing.
- Aggressive:
If prediction wrong, minimally update the weights to correctly classify.
25
Passive Aggressive Algorithm
minimally change
... and correctly
classify.

54. / 37
Passive & Aggressive: Illustrated
26
Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .

55. / 37
Passive & Aggressive: Illustrated
26
Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .
Do nothing.

56. / 37
Passive & Aggressive: Illustrated
26
Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .
Do nothing.
Move minimally
to classify correctly.

57. / 37
Implementing PA
27

58. / 37
PA vs. Perceptron & SVM
28
- PA always correctly classify the last-seen data.
- But not with Perceptron & SVM, as the update size is constant.
- → PA seems more efﬁcient, but is weaker to noise than Perceptron&SVM.

59. / 37
PA, or MIRA?
29
MIRA (Margin Infused Relaxed Algorithm)
[Crammer+ 2003]
PA (Passive Aggressive Algorithm)
[Crammer+ 2006]

60. / 37
PA, or MIRA?
29
“... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ
͜ΕΛൃలͤͨ͞΋ͷ͕PAͱͳ͍ͬͯΔɻ
·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸
࣮ࡍʹ͸Λ࢖͍ͬͯΔɻ”
“... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ
ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏
఺͕ҟͳΔɻ”
[தᖒ 2009]
MIRA (Margin Infused Relaxed Algorithm)
[Crammer+ 2003]
PA (Passive Aggressive Algorithm)
[Crammer+ 2006]

61. / 37
PA, or MIRA?
29
“... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ
͜ΕΛൃలͤͨ͞΋ͷ͕PAͱͳ͍ͬͯΔɻ
·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸
࣮ࡍʹ͸Λ࢖͍ͬͯΔɻ”
“... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ
ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏
఺͕ҟͳΔɻ”
[தᖒ 2009]
MIRA (Margin Infused Relaxed Algorithm)
[Crammer+ 2003]
PA (Passive Aggressive Algorithm)
[Crammer+ 2006]

62. / 37
Expansion of PA
- PA-I, PA-II
- Conﬁdence-Weighted Algorithm (CW) [Dredze+ 2008]
- If a feature appeared frequently in the past, likely to be more reliable (hence update less).
- Fast convergence.
- Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009]
- More tolerant to noise than CW.
- Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012]
- ...
30

63. / 37
Expansion of PA
- PA-I, PA-II
- Conﬁdence-Weighted Algorithm (CW) [Dredze+ 2008]
- If a feature appeared frequently in the past, likely to be more reliable (hence update less).
- Fast convergence.
- Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009]
- More tolerant to noise than CW.
- Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012]
- ...
30
Used for Gmailʼs
“priority inbox”.

64. / 37
31
Perceptron
Support
Vector
Machine
Passive
Aggressive
Algorithm

65. / 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32

66. / 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32

67. / 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32

68. / 37
... and which one should we use? (1) [ಙӬ 2012, p.286]
1. Perceptron
- Easiest to implement, as a baseline for other algorithms.
2. SVM with FOBOS optimization
- Almost same implementation as perceptron, but the result should go up.
3. Logistic regression
4. If not enough... (next slide)
32

69. / 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, ﬁrst pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difﬁcult to linearly classify?
- Devise better features.
- Non-linear classiﬁer. 33

70. / 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, ﬁrst pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difﬁcult to linearly classify?
- Devise better features.
- Non-linear classiﬁer. 33

71. / 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, ﬁrst pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difﬁcult to linearly classify?
- Devise better features.
- Non-linear classiﬁer. 33

72. / 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, ﬁrst pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difﬁcult to linearly classify?
- Devise better features.
- Non-linear classiﬁer. 33

73. / 37
... and which one should we use? (2) [ಙӬ 2012, p.286]
- If learning speed not enough, try PA, CW, AROW, etc.
- But be aware they are sensitive to the noise.
- If accuracy not enough, ﬁrst pinpoint the cause.
- Difference between #Positive/Negative examples large.
- Give special treat to the smaller size examples, give large margin.
- Reconstruct learning data, and make the positive/negative example size about the same.
- Too noisy.
- Reconsider data and features.
- Difﬁcult to linearly classify?
- Devise better features.
- Non-linear classiﬁer. 33

74. / 37
Software packages
- Perceptron, Averaged Perceptron, Passive Aggressive, ALMA, Conﬁdence Weighted.
- LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/
- ...
34

75. / 37
For further study: Books
- “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012.
Chapter 5, Appendix 4 & 5.
- “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010.
Chapter 4.
- “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998.
Chapter 2 & 3.
- “An Introduction to SVM” Christiani & Shawe-Talyor, 2000.
- “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ
35

76. / 37
For further study: on Web - slides and article
- “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti
http://prezi.com/9cozgxlearff/svmsvm/
- “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig.
http://www.phontron.com/slides/nlp-programming-ja-03-perceptron.pdf
- “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011.
http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1
- “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ.
http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/SML1/kernel1.pdf
- “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009.
http://nlp.ist.i.kyoto-u.ac.jp/member/nakazawa/pubdb/other/MIRA.pdf
36

77. / 37
For further study: on Web - blog articles
- ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ
͋Μͪ΂ʂ
http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138
- ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ
EchizenBlog-Zwei
http://d.hatena.ne.jp/echizen_tm/20110120/1295547335
- ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ
EchizenBlog-Zwei
http://d.hatena.ne.jp/echizen_tm/20110606/1307378609
- ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ
EchizenBlog-Zwei
http://d.hatena.ne.jp/echizen_tm/20110627/1309188711
37