Slide 1

Slide 1 text

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm. Sorami Hisamoto 14 May 2013, PG

Slide 2

Slide 2 text

/ 37 2 Disclaimer This material gives a brief impression of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.

Slide 3

Slide 3 text

/ 37 What is “Machine Learning” ? 3 “Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.

Slide 4

Slide 4 text

/ 37 Types of Machine Learning Algorithms - Supervised Learning ڭࢣ͋Γֶश - Unsupervised Learning ڭࢣͳֶ͠श - Semi-supervised Learning ൒ڭࢣ͋Γֶश - Reinforcement Learning ڧԽֶश - Active Learning ೳಈֶश - ... 4 by the property of data.

Slide 5

Slide 5 text

/ 37 Types of Machine Learning Algorithms - Binary Classification ೋ஋෼ྨ - Regression ճؼ - Multi Class Classification ଟ஋෼ྨ - Sequential Labeling ܥྻϥϕϦϯά - Learning to Rank ϥϯΫֶश - ... 5 by the property of problem.

Slide 6

Slide 6 text

/ 37 Types of Machine Learning Algorithms - Batch Learning - Online Learning 6 by the parameter optimization strategy.

Slide 7

Slide 7 text

/ 37 Linear binary classification 7 Given X, predict Y.

Slide 8

Slide 8 text

/ 37 Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1.

Slide 9

Slide 9 text

/ 37 Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 10

Slide 10 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 11

Slide 11 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 12

Slide 12 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.

Slide 13

Slide 13 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ?

Slide 14

Slide 14 text

/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ? !

Slide 15

Slide 15 text

/ 37 Implementing a linear binary classifier 8

Slide 16

Slide 16 text

/ 37 Implementing a linear binary classifier 8 How do we learn the weights?

Slide 17

Slide 17 text

/ 37 9 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 18

Slide 18 text

/ 37 10 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 19

Slide 19 text

/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11

Slide 20

Slide 20 text

/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)

Slide 21

Slide 21 text

/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1) Sum up these 2 procedures as 1: + y * x

Slide 22

Slide 22 text

/ 37 Learning hyperplane: Illustrated 12 Figure from .

Slide 23

Slide 23 text

/ 37 Implementing a perceptron 13

Slide 24

Slide 24 text

/ 37 Implementing a perceptron 13

Slide 25

Slide 25 text

/ 37 Implementing a perceptron 13 loss (w, x, y) = max(0, -ywx)

Slide 26

Slide 26 text

/ 37 Implementing a perceptron 13 loss (w, x, y) = max(0, -ywx)

Slide 27

Slide 27 text

/ 37 Implementing a perceptron 13 ∂loss(w, x, y) / ∂w = - yx loss (w, x, y) = max(0, -ywx)

Slide 28

Slide 28 text

/ 37 14 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 29

Slide 29 text

/ 37 15 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 30

Slide 30 text

/ 37 16

Slide 31

Slide 31 text

/ 37 SVM [Vapnik&Cortes 1995] - Perceptron, plus ... -Margin maximizing. -(Kernel). 17 Support Vector Machine

Slide 32

Slide 32 text

/ 37 Which one looks better, and why? 18 All 3 classify correctly but ... the middle one seems the best.

Slide 33

Slide 33 text

/ 37 Which one looks better, and why? 18 All 3 classify correctly but ... the middle one seems the best.

Slide 34

Slide 34 text

/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.

Slide 35

Slide 35 text

/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.

Slide 36

Slide 36 text

/ 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.

Slide 37

Slide 37 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.

Slide 38

Slide 38 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 39

Slide 39 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 40

Slide 40 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 41

Slide 41 text

/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [਺ݪ 2011, p.34-39]

Slide 42

Slide 42 text

/ 37 Perceptron and SVM 20 loss(w, x, y) ∂loss(w, x, y) / ∂w Perceptron SVM max(0, -ywx) max(0, -yx) max(0, λ - ywx) + α * w^2 / 2 - yx + αw x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.

Slide 43

Slide 43 text

/ 37 Implementing SVM 21

Slide 44

Slide 44 text

/ 37 Implementing SVM 21

Slide 45

Slide 45 text

/ 37 Implementing SVM 21 loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 46

Slide 46 text

/ 37 Implementing SVM 21 loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 47

Slide 47 text

/ 37 Implementing SVM 21 ∂loss(w, x, y) / ∂w = - tx + αw loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2

Slide 48

Slide 48 text

/ 37 Soft-margin - Sometimes impossible to linearly separate. - → Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22

Slide 49

Slide 49 text

/ 37 23 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 50

Slide 50 text

/ 37 24 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 51

Slide 51 text

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm

Slide 52

Slide 52 text

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change

Slide 53

Slide 53 text

/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change ... and correctly classify.

Slide 54

Slide 54 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from .

Slide 55

Slide 55 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from . Do nothing.

Slide 56

Slide 56 text

/ 37 Passive & Aggressive: Illustrated 26 Figure from . Do nothing. Move minimally to classify correctly.

Slide 57

Slide 57 text

/ 37 Implementing PA 27

Slide 58

Slide 58 text

/ 37 PA vs. Perceptron & SVM 28 - PA always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efficient, but is weaker to noise than Perceptron&SVM.

Slide 59

Slide 59 text

/ 37 PA, or MIRA? 29 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 60

Slide 60 text

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸ ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 61

Slide 61 text

/ 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸ ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]

Slide 62

Slide 62 text

/ 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30

Slide 63

Slide 63 text

/ 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30 Used for Gmailʼs “priority inbox”.

Slide 64

Slide 64 text

/ 37 31 Perceptron Support Vector Machine Passive Aggressive Algorithm

Slide 65

Slide 65 text

/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

Slide 66

Slide 66 text

/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

Slide 67

Slide 67 text

/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

Slide 68

Slide 68 text

/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32

Slide 69

Slide 69 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 70

Slide 70 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 71

Slide 71 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 72

Slide 72 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 73

Slide 73 text

/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33

Slide 74

Slide 74 text

/ 37 Software packages - OLL - Perceptron, Averaged Perceptron, Passive Aggressive, ALMA, Confidence Weighted. - LIBSVM - AROW++ - ... 34

Slide 75

Slide 75 text

/ 37 For further study: Books - “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012. Chapter 5, Appendix 4 & 5. - “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010. Chapter 4. - “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998. Chapter 2 & 3. - “An Introduction to SVM” Christiani & Shawe-Talyor, 2000. - “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ 35

Slide 76

Slide 76 text

/ 37 For further study: on Web - slides and article - “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti - “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig. - “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011. - “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ. - “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009. 36

Slide 77

Slide 77 text

/ 37 For further study: on Web - blog articles - ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ ͋Μͪ΂ʂ - ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ EchizenBlog-Zwei - ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ EchizenBlog-Zwei - ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ EchizenBlog-Zwei 37