Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Perceptron, Support Vector Machine, and Passive Aggressive Algorithm.

Original presentation at Computational Linguistics Lab, Nara Institute of Science and Technology.

奈良先端科学技術大学院大学(NAIST)自然言語処理学研究室のプログラミング勉強会にて行った発表のスライド資料です。

Sorami Hisamoto

May 14, 2013
Tweet

More Decks by Sorami Hisamoto

Other Decks in Research

Transcript

  1. / 37 2 Disclaimer This material gives a brief impression

    of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.
  2. / 37 What is “Machine Learning” ? 3 “Field of

    study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.
  3. / 37 Types of Machine Learning Algorithms - Supervised Learning

    ڭࢣ͋Γֶश - Unsupervised Learning ڭࢣͳֶ͠श - Semi-supervised Learning ൒ڭࢣ͋Γֶश - Reinforcement Learning ڧԽֶश - Active Learning ೳಈֶश - ... 4 by the property of data.
  4. / 37 Types of Machine Learning Algorithms - Binary Classification

    ೋ஋෼ྨ - Regression ճؼ - Multi Class Classification ଟ஋෼ྨ - Sequential Labeling ܥྻϥϕϦϯά - Learning to Rank ϥϯΫֶश - ... 5 by the property of problem.
  5. / 37 Types of Machine Learning Algorithms - Batch Learning

    - Online Learning 6 by the parameter optimization strategy.
  6. / 37 Linear binary classification 7 Given X, predict Y.

    Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
  7. / 37 1. Get features from the input. 2. Calculate

    inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
  8. / 37 1. Get features from the input. 2. Calculate

    inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
  9. / 37 1. Get features from the input. 2. Calculate

    inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
  10. / 37 1. Get features from the input. 2. Calculate

    inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ?
  11. / 37 1. Get features from the input. 2. Calculate

    inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ? !
  12. / 37 Perceptron [Rosenblatt 1957] - For every sample: -

    If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11
  13. / 37 Perceptron [Rosenblatt 1957] - For every sample: -

    If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)
  14. / 37 Perceptron [Rosenblatt 1957] - For every sample: -

    If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬཰తޯ഑߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1) Sum up these 2 procedures as 1: + y * x
  15. / 37 Implementing a perceptron 13 ∂loss(w, x, y) /

    ∂w = - yx loss (w, x, y) = max(0, -ywx)
  16. / 37 SVM [Vapnik&Cortes 1995] - Perceptron, plus ... -Margin

    maximizing. -(Kernel). 17 Support Vector Machine
  17. / 37 Which one looks better, and why? 18 All

    3 classify correctly but ... the middle one seems the best.
  18. / 37 Which one looks better, and why? 18 All

    3 classify correctly but ... the middle one seems the best.
  19. / 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if

    margin is maximized, classification performance on unknown data will be maximized.
  20. / 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if

    margin is maximized, classification performance on unknown data will be maximized.
  21. / 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures

    that if margin is maximized, classification performance on unknown data will be maximized.
  22. / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

    ensures that if margin is maximized, classification performance on unknown data will be maximized.
  23. / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

    ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
  24. / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

    ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
  25. / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

    ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
  26. / 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory”

    ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [਺ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1
  27. / 37 Perceptron and SVM 20 loss(w, x, y) ∂loss(w,

    x, y) / ∂w Perceptron SVM max(0, -ywx) max(0, -yx) max(0, λ - ywx) + α * w^2 / 2 - yx + αw x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
  28. / 37 Implementing SVM 21 ∂loss(w, x, y) / ∂w

    = - tx + αw loss(w, x, y) = max(0, λ - ywx) + α * w^2 / 2
  29. / 37 Soft-margin - Sometimes impossible to linearly separate. -

    → Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22
  30. / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

    do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm
  31. / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

    do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change
  32. / 37 PA [Crammer+ 2006] - Passive: If prediction correct,

    do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change ... and correctly classify.
  33. / 37 PA vs. Perceptron & SVM 28 - PA

    always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efficient, but is weaker to noise than Perceptron&SVM.
  34. / 37 PA, or MIRA? 29 MIRA (Margin Infused Relaxed

    Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
  35. / 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸

    ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
  36. / 37 PA, or MIRA? 29 “... MIRA͸ઢܗ෼཭Մೳͳ໰୊ʹ͔͠ରԠ͓ͯ͠Βͣɺ ͜ΕΛൃలͤͨ͞΋ͷ͕PA[2]ͱͳ͍ͬͯΔɻ ·ͨMIRAΛར༻ͨ͠ͱᨳ͍ͬͯΔݚڀͷ΄ͱΜͲ͸

    ࣮ࡍʹ͸[2]Λ࢖͍ͬͯΔɻ” “... ΦϦδφϧͷMIRA͸Ϟσϧͷߋ৽ͷେ͖͞Λ࠷খԽ͢Δ ͷͰ͸ͳ͘ɺߋ৽ޙͷύϥϝʔλͷϊϧϜΛ࠷খԽ͢Δͱ͍͏ ఺͕ҟͳΔɻ” [தᖒ 2009] https://twitter.com/taku910/status/243760585030901761 MIRA (Margin Infused Relaxed Algorithm) [Crammer+ 2003] PA (Passive Aggressive Algorithm) [Crammer+ 2006]
  37. / 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted

    Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30
  38. / 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted

    Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30 Used for Gmailʼs “priority inbox”.
  39. / 37 ... and which one should we use? (1)

    [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
  40. / 37 ... and which one should we use? (1)

    [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
  41. / 37 ... and which one should we use? (1)

    [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
  42. / 37 ... and which one should we use? (1)

    [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
  43. / 37 ... and which one should we use? (2)

    [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
  44. / 37 ... and which one should we use? (2)

    [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
  45. / 37 ... and which one should we use? (2)

    [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
  46. / 37 ... and which one should we use? (2)

    [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
  47. / 37 ... and which one should we use? (2)

    [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
  48. / 37 Software packages - OLL https://code.google.com/p/oll/wiki/OllMainJa - Perceptron, Averaged

    Perceptron, Passive Aggressive, ALMA, Confidence Weighted. - LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ - AROW++ https://code.google.com/p/arowpp/ - ... 34
  49. / 37 For further study: Books - “೔ຊޠೖྗΛࢧ͑Δٕज़” ಙӬ୓೭, 2012.

    Chapter 5, Appendix 4 & 5. - “ݴޠॲཧͷͨΊͷػցֶशೖ໳” ߴଜେ໵, 2010. Chapter 4. - “Θ͔Γ΍͍͢ύλʔϯೝࣝ” ੴҪ݈Ұ࿠+, 1998. Chapter 2 & 3. - “An Introduction to SVM” Christiani & Shawe-Talyor, 2000. - “αϙʔτϕΫλʔϚγϯೖ໳” ΫϦεςΟΞʔχ&ςΠϥʔ ੺ຊ 35
  50. / 37 For further study: on Web - slides and

    article - “਺ࣜΛҰ੾࢖༻͠ͳ͍SVMͷ࿩” rti http://prezi.com/9cozgxlearff/svmsvm/ - “ύʔηϓτϩϯΞϧΰϦζϜ” Graham Neubig. http://www.phontron.com/slides/nlp-programming-ja-03-perceptron.pdf - “ύʔηϓτϩϯͰָ͍͠஥͕ؒΆΆΆΆʙΜ” ਺ݪྑ඙, 2011. http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1 - “౷ܭతػցֶशೖ໳ 5. αϙʔτϕΫλʔϚγϯ” த઒༟ࢤ. http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/SML1/kernel1.pdf - “MIRA (Margin Infused Relaxed Algorithm)” தᖒහ໌, 2009. http://nlp.ist.i.kyoto-u.ac.jp/member/nakazawa/pubdb/other/MIRA.pdf 36
  51. / 37 For further study: on Web - blog articles

    - ςΩετϚΠχϯάͷͨΊͷػցֶश௒ೖ໳ ೋ໷໨ ύʔηϓτϩϯ ͋Μͪ΂ʂ http://d.hatena.ne.jp/AntiBayesian/20111125/1322202138 - ػցֶश௒ೖ໳II ʙGmailͷ༏ઌτϨΠͰ΋࢖͍ͬͯΔPA๏Λ30෼Ͱशಘ͠Α͏ʂʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110120/1295547335 - ػցֶश௒ೖ໳III ʙػցֶशͷجૅɺύʔηϓτϩϯΛ30෼Ͱ࡞ֶͬͯͿʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110606/1307378609 - ػցֶश௒ೖ໳IV ʙSVM(αϙʔτϕΫλʔϚγϯ)ͩͬͯ30෼Ͱ࡞ΕͪΌ͏ˑʙ EchizenBlog-Zwei http://d.hatena.ne.jp/echizen_tm/20110627/1309188711 37