Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Slide 1

Slide 1 text

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm. Sorami Hisamoto 14 May 2013, PG

Slide 2

Slide 2 text

/ 37 2 Disclaimer This material gives a brief impression of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.

Slide 3

Slide 3 text

/ 37 What is “Machine Learning” ? 3 “Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.

Slide 4

Slide 4 text

/ 37 Types of Machine Learning Algorithms - Supervised Learning ڭࢣ͋Γֶश - Unsupervised Learning ڭࢣͳֶ͠श - Semi-supervised Learning ൒ڭࢣ͋Γֶश - Reinforcement Learning ڧԽֶश - Active Learning ೳಈֶश - ... 4 by the property of data.

Slide 5

Slide 5 text

/ 37 Types of Machine Learning Algorithms - Binary Classiﬁcation ೋ஋෼ྨ - Regression ճؼ - Multi Class Classiﬁcation ଟ஋෼ྨ - Sequential Labeling ܥྻϥϕϦϯά - Learning to Rank ϥϯΫֶश - ... 5 by the property of problem.

Slide 6

Slide 6 text

/ 37 Types of Machine Learning Algorithms - Batch Learning - Online Learning 6 by the parameter optimization strategy.

Slide 7

Slide 7 text

/ 37 Linear binary classiﬁcation 7 Given X, predict Y.

Slide 8

Slide 8 text

/ 37 Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1.

Slide 9

Slide 9 text

/ 37 Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 10

Slide 10 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classiﬁcation 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

/ 37 Implementing a linear binary classiﬁer 8

Slide 16

Slide 16 text

/ 37 Implementing a linear binary classiﬁer 8 How do we learn the weights?

Slide 17

Slide 17 text

/ 37 9 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 18

Slide 18 text

/ 37 10 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 19

Slide 19 text

Slide 20

Slide 20 text

/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)

Slide 21

Slide 21 text

Slide 22

Slide 22 text

/ 37 Learning hyperplane: Illustrated 12 Figure from http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 .

Slide 23

Slide 23 text

/ 37 Implementing a perceptron 13

Slide 24

Slide 24 text

/ 37 Implementing a perceptron 13

Slide 25

Slide 25 text

/ 37 Implementing a perceptron 13 loss (w, x, y) = max(0, -ywx)

Slide 26

Slide 26 text

/ 37 Implementing a perceptron 13 loss (w, x, y) = max(0, -ywx)

Slide 27

Slide 27 text

/ 37 Implementing a perceptron 13 ∂loss(w, x, y) / ∂w = - yx loss (w, x, y) = max(0, -ywx)

Slide 28

Slide 28 text

/ 37 14 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 29

Slide 29 text

/ 37 15 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 30

Slide 30 text

/ 37 16

Slide 31

Slide 31 text

/ 37 SVM [Vapnik&Cortes 1995] - Perceptron, plus ... -Margin maximizing. -(Kernel). 17 Support Vector Machine

Slide 32

Slide 32 text

/ 37 Which one looks better, and why? 18 All 3 classify correctly but ... the middle one seems the best.

Slide 33

Slide 33 text

/ 37 Which one looks better, and why? 18 All 3 classify correctly but ... the middle one seems the best.

Slide 34

Slide 34 text

/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

Slide 35

Slide 35 text

/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

Slide 36

Slide 36 text

/ 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

Slide 37

Slide 37 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized.

Slide 38

Slide 38 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 39

Slide 39 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 40

Slide 40 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 41

Slide 41 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classiﬁcation performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [਺ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1

Slide 42

Slide 42 text

/ 37 Perceptron and SVM 20 loss(w, x, y) ∂loss(w, x, y) / ∂w Perceptron SVM max(0, -ywx) max(0, -yx) max(0, λ - ywx) + α * w^2 / 2 - yx + αw x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 43

Slide 43 text

/ 37 Implementing SVM 21

Slide 44

Slide 44 text

/ 37 Implementing SVM 21

Slide 45

Slide 45 text

/ 37 Implementing SVM 21 loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 46

Slide 46 text

/ 37 Implementing SVM 21 loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 47

Slide 47 text

/ 37 Implementing SVM 21 ∂loss(w, x, y) / ∂w = - tx + αw loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 48

Slide 48 text

/ 37 Soft-margin - Sometimes impossible to linearly separate. - → Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22

Slide 49

Slide 49 text

/ 37 23 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 50

Slide 50 text

/ 37 24 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 51

Slide 51 text

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm

Slide 52

Slide 52 text

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change

Slide 53

Slide 53 text

Slide 54

Slide 54 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 .

Slide 55

Slide 55 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 . Do nothing.

Slide 56

Slide 56 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from http://kazoo04.hatenablog.com/entry/2012/12/20/000000 . Do nothing. Move minimally to classify correctly.

Slide 57

Slide 57 text

/ 37 Implementing PA 27

Slide 58

Slide 58 text

/ 37 PA vs. Perceptron & SVM 28 - PA always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efﬁcient, but is weaker to noise than Perceptron&SVM.

Slide 59

Slide 59 text

/ 37 PA, or MIRA? 29 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 60

Slide 60 text

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸ ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 61

Slide 61 text

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸ ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] https://twitter.com/taku910/status/243760585030901761 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 62

Slide 62 text

/ 37 Expansion of PA - PA-I, PA-II - Conﬁdence-Weighted Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Conﬁdence-Weight Learning (SCW) [Zhao+ 2012] - ... 30

Slide 63

Slide 63 text

Slide 64

Slide 64 text

/ 37 31 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 65

Slide 65 text

/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

/ 37 Software packages - OLL https://code.google.com/p/oll/wiki/OllMainJa - Perceptron, Averaged Perceptron, Passive Aggressive, ALMA, Conﬁdence Weighted. - LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ - AROW++ https://code.google.com/p/arowpp/ - ... 34

Slide 75

Slide 75 text

/ 37 For further study: Books - “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012. Chapter 5, Appendix 4 & 5. - “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010. Chapter 4. - “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998. Chapter 2 & 3. - “An Introduction to SVM” Christiani & Shawe-Talyor, 2000. - “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ 35

Slide 76

Slide 76 text

/ 37 For further study: on Web - slides and article - “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti http://prezi.com/9cozgxlearff/svmsvm/ - “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig. http://www.phontron.com/slides/nlp-programming-ja-03-perceptron.pdf - “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011. http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1 - “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ. http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/SML1/kernel1.pdf - “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009. http://nlp.ist.i.kyoto-u.ac.jp/member/nakazawa/pubdb/other/MIRA.pdf 36

Slide 77

Slide 77 text

/ 37 For further study: on Web - blog articles - ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ ͋Μͪ΂ʂ http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 - ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110120/1295547335 - ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110606/1307378609 - ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110627/1309188711 37