/ 37 2 Disclaimer This material gives a brief impression of how these algorithms work. This material may contain wrong explanations and errors. Please refer to other materials for more detailed and reliable information.
/ 37 What is “Machine Learning” ? 3 “Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, 1959.
/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not.
/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ?
/ 37 1. Get features from the input. 2. Calculate inner product of the feature vector and weights. 3. If result ≧ 0 output is +1, else -1. Linear binary classification 7 Given X, predict Y. Output Y is binary; e.g. +1 or -1. e.g. X - email. Y - spam or not. ? !
/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11
/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬తޯ߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1)
/ 37 Perceptron [Rosenblatt 1957] - For every sample: - If prediction is correct, do nothing. - If label=+1 and prediction=-1, add feature vector to the weights. - If label=-1 and prediction=+1, subtract feature vector from the weights. 11 Hinge loss ώϯδଛࣦ loss (w, x, y) = max(0, -ywx) Stochastic gradient descent method ֬తޯ߱Լ๏ ∂loss() / ∂w = max(0, -yx) x: input vector, w: weight vector, y: correct label (+1 or -1) Sum up these 2 procedures as 1: + y * x
/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.
/ 37 Margin maximizing 19 “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.
/ 37 Margin maximizing 19 support vector “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.
/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized.
/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters.
/ 37 Margin maximizing 19 support vector margin “Vapnik–Chervonenkis theory” ensures that if margin is maximized, classification performance on unknown data will be maximized. SVM’s loss function max(0, λ - ywx) + α * w^2 / 2 As w^2 becomes smaller, margin (λ / w^2) becomes bigger. If prediction correct BUT score < λ , then get penalty. margin(w) = λ / w^2 x: input vector, w: weight vector, y: correct label (+1 or -1), λ&α: hyperparameters. For a detailed explanation, refer to other materials; e.g. [ݪ 2011, p.34-39] http://d.hatena.ne.jp/sleepy_yoshi/20110423/p1
/ 37 Soft-margin - Sometimes impossible to linearly separate. - → Soft-margin - Permit violation of margin. - If margin negative, give penalty. - Minimize penalty and maximize margin. 22
/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm
/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change
/ 37 PA [Crammer+ 2006] - Passive: If prediction correct, do nothing. - Aggressive: If prediction wrong, minimally update the weights to correctly classify. 25 Passive Aggressive Algorithm minimally change ... and correctly classify.
/ 37 PA vs. Perceptron & SVM 28 - PA always correctly classify the last-seen data. - But not with Perceptron & SVM, as the update size is constant. - → PA seems more efficient, but is weaker to noise than Perceptron&SVM.
/ 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30
/ 37 Expansion of PA - PA-I, PA-II - Confidence-Weighted Algorithm (CW) [Dredze+ 2008] - If a feature appeared frequently in the past, likely to be more reliable (hence update less). - Fast convergence. - Adaptive Regularization of Weight Vectors (AROW) [Crammer+ 2009] - More tolerant to noise than CW. - Exact Soft Confidence-Weight Learning (SCW) [Zhao+ 2012] - ... 30 Used for Gmailʼs “priority inbox”.
/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
/ 37 ... and which one should we use? (1) [ಙӬ 2012, p.286] 1. Perceptron - Easiest to implement, as a baseline for other algorithms. 2. SVM with FOBOS optimization - Almost same implementation as perceptron, but the result should go up. 3. Logistic regression 4. If not enough... (next slide) 32
/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33
/ 37 ... and which one should we use? (2) [ಙӬ 2012, p.286] - If learning speed not enough, try PA, CW, AROW, etc. - But be aware they are sensitive to the noise. - If accuracy not enough, first pinpoint the cause. - Difference between #Positive/Negative examples large. - Give special treat to the smaller size examples, give large margin. - Reconstruct learning data, and make the positive/negative example size about the same. - Too noisy. - Reconsider data and features. - Difficult to linearly classify? - Devise better features. - Non-linear classifier. 33