1.7k

# Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Original presentation at Computational Linguistics Lab, Nara Institute of Science and Technology.

May 14, 2013

## Transcript

1. ### Perceptron, Support Vector Machine, and Passive Aggressive Algorithm. Sorami Hisamoto

14 May 2013, PG
2. ### / 37 2 Disclaimer This material gives a brief impression

of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.
3. ### / 37 What is “Machine Learning” ? 3 “Field of

study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.
4. ### / 37 Types of Machine Learning Algorithms - Supervised Learning

ڭࢣ͋Γֶश - Unsupervised Learning ڭࢣͳֶ͠श - Semi-supervised Learning ൒ڭࢣ͋Γֶश - Reinforcement Learning ڧԽֶश - Active Learning ೳಈֶश - ... 4 by the property of data.
5. ### / 37 Types of Machine Learning Algorithms - Binary Classiﬁcation

ೋ஋෼ྨ - Regression ճؼ - Multi Class Classiﬁcation ଟ஋෼ྨ - Sequential Labeling ܥྻϥϕϦϯά - Learning to Rank ϥϯΫֶश - ... 5 by the property of problem.
6. ### / 37 Types of Machine Learning Algorithms - Batch Learning

- Online Learning 6 by the parameter optimization strategy.

8. ### / 37 Linear binary classiﬁcation 7 Given X, predict Y.

Output Y is binary; e.g. +1 or -1.
9. ### / 37 Linear binary classiﬁcation 7 Given X, predict Y.

Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
10. ### / 37 1. Get features from the input. 2. Calculate

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
11. ### / 37 1. Get features from the input. 2. Calculate

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
12. ### / 37 1. Get features from the input. 2. Calculate

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
13. ### / 37 1. Get features from the input. 2. Calculate

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ?
14. ### / 37 1. Get features from the input. 2. Calculate

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ? !

16. ### / 37 Implementing a linear binary classiﬁer 8 How do

we learn the weights?

19. ### / 37 Perceptron [Rosenblatt 1957] - For every sample: -

If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11
20. ### / 37 Perceptron [Rosenblatt 1957] - For every sample: -

If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)
21. ### / 37 Perceptron [Rosenblatt 1957] - For every sample: -

If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1) Sum up these 2 procedures as 1: + y * x

25. ### / 37 Implementing a perceptron 13 loss (w, x, y)

= max(0, -ywx)
26. ### / 37 Implementing a perceptron 13 loss (w, x, y)

= max(0, -ywx)
27. ### / 37 Implementing a perceptron 13 ∂loss(w, x, y) /

∂w = - yx loss (w, x, y) = max(0, -ywx)

31. ### / 37 SVM [Vapnik&Cortes 1995] - Perceptron, plus ... -Margin

maximizing. -(Kernel). 17 Support Vector Machine
32. ### / 37 Which one looks better, and why? 18 All

3 classify correctly but ... the middle one seems the best.
33. ### / 37 Which one looks better, and why? 18 All

3 classify correctly but ... the middle one seems the best.
34. ### / 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if

margin is maximized, classiﬁcation performance on unknown data will be maximized.
35. ### / 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if

margin is maximized, classiﬁcation performance on unknown data will be maximized.
36. ### / 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures

that if margin is maximized, classiﬁcation performance on unknown data will be maximized.
37. ### / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.
38. ### / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
39. ### / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
40. ### / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
41. ### / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [਺ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1
42. ### / 37 Perceptron and SVM 20 loss(w, x, y) ∂loss(w,

x, y) / ∂w Perceptron SVM max(0, -ywx) max(0, -yx) max(0, λ - ywx) + α * w^2 / 2 - yx + αw x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

45. ### / 37 Implementing SVM 21 loss(w, x, y) = max(0,

λ - ywx) + α * w^2 / 2
46. ### / 37 Implementing SVM 21 loss(w, x, y) = max(0,

λ - ywx) + α * w^2 / 2
47. ### / 37 Implementing SVM 21 ∂loss(w, x, y) / ∂w

= - tx + αw loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2
48. ### / 37 Soft-margin - Sometimes impossible to linearly separate. -

→ Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22

51. ### / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm
52. ### / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change
53. ### / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change ... and correctly classify.

.
55. ### / 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000

. Do nothing.
56. ### / 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000

. Do nothing. Move minimally to classify correctly.

58. ### / 37 PA vs. Perceptron & SVM 28 - PA

always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efﬁcient, but is weaker to noise than Perceptron&SVM.
59. ### / 37 PA, or MIRA? 29 MIRA (Margin Infused Relaxed

Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
60. ### / 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PAͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸

࣮ࡍʹ͸Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
61. ### / 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PAͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸

࣮ࡍʹ͸Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] https://twitter.com/taku910/status/243760585030901761 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
62. ### / 37 Expansion of PA - PA-I, PA-II - Conﬁdence-Weighted

Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012] - ... 30
63. ### / 37 Expansion of PA - PA-I, PA-II - Conﬁdence-Weighted

Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012] - ... 30 Used for Gmailʼs “priority inbox”.

65. ### / 37 ... and which one should we use? (1)

[ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
66. ### / 37 ... and which one should we use? (1)

[ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
67. ### / 37 ... and which one should we use? (1)

[ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
68. ### / 37 ... and which one should we use? (1)

[ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
69. ### / 37 ... and which one should we use? (2)

[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, ﬁrst pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difﬁcult to linearly classify? - Devise better features. - Non-linear classiﬁer. 33
70. ### / 37 ... and which one should we use? (2)

[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, ﬁrst pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difﬁcult to linearly classify? - Devise better features. - Non-linear classiﬁer. 33
71. ### / 37 ... and which one should we use? (2)

[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, ﬁrst pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difﬁcult to linearly classify? - Devise better features. - Non-linear classiﬁer. 33
72. ### / 37 ... and which one should we use? (2)

[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, ﬁrst pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difﬁcult to linearly classify? - Devise better features. - Non-linear classiﬁer. 33
73. ### / 37 ... and which one should we use? (2)

[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, ﬁrst pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difﬁcult to linearly classify? - Devise better features. - Non-linear classiﬁer. 33
74. ### / 37 Software packages - OLL https://code.google.com/p/oll/wiki/OllMainJa - Perceptron, Averaged

Perceptron, Passive Aggressive, ALMA, Conﬁdence Weighted. - LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ - AROW++ https://code.google.com/p/arowpp/ - ... 34
75. ### / 37 For further study: Books - “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012.

Chapter 5, Appendix 4 & 5. - “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010. Chapter 4. - “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998. Chapter 2 & 3. - “An Introduction to SVM” Christiani & Shawe-Talyor, 2000. - “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ 35
76. ### / 37 For further study: on Web - slides and

article - “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti http://prezi.com/9cozgxlearff/svmsvm/ - “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig. http://www.phontron.com/slides/nlp-programming-ja-03-perceptron.pdf - “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011. http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1 - “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ. http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/SML1/kernel1.pdf - “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009. http://nlp.ist.i.kyoto-u.ac.jp/member/nakazawa/pubdb/other/MIRA.pdf 36
77. ### / 37 For further study: on Web - blog articles

- ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ ͋Μͪ΂ʂ http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 - ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110120/1295547335 - ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110606/1307378609 - ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110627/1309188711 37