Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm. Sorami Hisamoto
14 May 2013, PG

/ 37 2 Disclaimer This material gives a brief impression
of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.

/ 37 What is “Machine Learning” ? 3 “Field of
study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.

/ 37 Types of Machine Learning Algorithms - Supervised Learning
ڭࢣ͋Γֶश - Unsupervised Learning ڭࢣͳֶ͠श - Semi-supervised Learning ൒ڭࢣ͋Γֶश - Reinforcement Learning ڧԽֶश - Active Learning ೳಈֶश - ... 4 by the property of data.

/ 37 Types of Machine Learning Algorithms - Binary Classiﬁcation
ೋ஋෼ྨ - Regression ճؼ - Multi Class Classiﬁcation ଟ஋෼ྨ - Sequential Labeling ܥྻϥϕϦϯά - Learning to Rank ϥϯΫֶश - ... 5 by the property of problem.

/ 37 Types of Machine Learning Algorithms - Batch Learning
- Online Learning 6 by the parameter optimization strategy.

/ 37 Linear binary classiﬁcation 7 Given X, predict Y.

Output Y is binary; e.g. +1 or -1.

Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

/ 37 1. Get features from the input. 2. Calculate
inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ?

inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ? !

/ 37 Implementing a linear binary classiﬁer 8

/ 37 Implementing a linear binary classiﬁer 8 How do
we learn the weights?

/ 37 9 Perceptron Support Vector Machine Passive Aggressive Algorithm

/ 37 Perceptron [Rosenblatt 1957] - For every sample: -
If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11

If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)

If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1) Sum up these 2 procedures as 1: + y * x

/ 37 Learning hyperplane: Illustrated 12 Figure from http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 .

/ 37 Implementing a perceptron 13

/ 37 Implementing a perceptron 13 loss (w, x, y)
= max(0, -ywx)

/ 37 Implementing a perceptron 13 ∂loss(w, x, y) /
∂w = - yx loss (w, x, y) = max(0, -ywx)

/ 37 16

/ 37 SVM [Vapnik&Cortes 1995] - Perceptron, plus ... -Margin
maximizing. -(Kernel). 17 Support Vector Machine

/ 37 Which one looks better, and why? 18 All
3 classify correctly but ... the middle one seems the best.

/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if
margin is maximized, classiﬁcation performance on unknown data will be maximized.

/ 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures
that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”
ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [਺ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1

/ 37 Perceptron and SVM 20 loss(w, x, y) ∂loss(w,
x, y) / ∂w Perceptron SVM max(0, -ywx) max(0, -yx) max(0, λ - ywx) + α * w^2 / 2 - yx + αw x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

/ 37 Implementing SVM 21

/ 37 Implementing SVM 21 loss(w, x, y) = max(0,
λ - ywx) + α * w^2 / 2

/ 37 Implementing SVM 21 ∂loss(w, x, y) / ∂w
= - tx + αw loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

/ 37 Soft-margin - Sometimes impossible to linearly separate. -
→ Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct,
do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm

do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change

do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change ... and correctly classify.

/ 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000
.

. Do nothing.

. Do nothing. Move minimally to classify correctly.

/ 37 Implementing PA 27

/ 37 PA vs. Perceptron & SVM 28 - PA
always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efﬁcient, but is weaker to noise than Perceptron&SVM.

/ 37 PA, or MIRA? 29 MIRA (Margin Infused Relaxed
Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸
࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸
࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] https://twitter.com/taku910/status/243760585030901761 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

/ 37 Expansion of PA - PA-I, PA-II - Conﬁdence-Weighted
Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012] - ... 30

/ 37 Expansion of PA - PA-I, PA-II - Conﬁdence-Weighted
Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012] - ... 30 Used for Gmailʼs “priority inbox”.

/ 37 ... and which one should we use? (1)
[ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

/ 37 ... and which one should we use? (2)
[ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

/ 37 Software packages - OLL https://code.google.com/p/oll/wiki/OllMainJa - Perceptron, Averaged
Perceptron, Passive Aggressive, ALMA, Conﬁdence Weighted. - LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ - AROW++ https://code.google.com/p/arowpp/ - ... 34

/ 37 For further study: Books - “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012.
Chapter 5, Appendix 4 & 5. - “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010. Chapter 4. - “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998. Chapter 2 & 3. - “An Introduction to SVM” Christiani & Shawe-Talyor, 2000. - “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ 35

/ 37 For further study: on Web - slides and
article - “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti http://prezi.com/9cozgxlearff/svmsvm/ - “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig. http://www.phontron.com/slides/nlp-programming-ja-03-perceptron.pdf - “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011. http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1 - “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ. http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/SML1/kernel1.pdf - “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009. http://nlp.ist.i.kyoto-u.ac.jp/member/nakazawa/pubdb/other/MIRA.pdf 36

/ 37 For further study: on Web - blog articles
- ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ ͋Μͪ΂ʂ http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 - ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110120/1295547335 - ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110606/1307378609 - ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110627/1309188711 37

Perceptron, Support Vector Machine, and Passive...

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

More Decks by Sorami Hisamoto

Other Decks in Research

Featured

Transcript