Topics-ML

Introduction to Machine Learning – ECE-S436 – Gregory Ditzler Drexel
University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA [email protected] http://github.com/gditzler/eces436-week1 April 1, 2014

Overview Topics of Discussion Examples Classiﬁcation Regression Types of Learning
Image Rights Many of the images used in this presentation are from Christopher M. Bishop’s “Pattern Recognition and Machine Learning” (2006) text book. http://research.microsoft.com/en-us/um/people/cmbishop/prml/index.htm

Text Prediction Given a word w(t) and some history h(t),
what is the next word (i.e., w(t + 1))? what is the probability distribution over the next word (i.e., P(w(t + 1)|w(t), h(t)))? I love --? Can you pick up milk at the --?

Optical Character Recognition Bishop (2006)

Prediction of low/high risk loans savings income ✓1 ✓2 High-risk
Low-risk + + + + + + + -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ -‐ if (income > θ1 AND savings > θ2 ) then {low-risk} else {high-risk}

Terminology I feature: a variable, x, believed to carry information
about the task. example, cholesterol level. feature vector: collection of variables, or features, x = [x1 , . . . , xM ]T. example, collection of medical tests for a patient. feature space: M-dimensional vector space where the vectors x lie. example, x ∈ RM + class: a category/value assigned to a feature vector. in general we can refer to this as the target variable (t). example, t = cancer or t = 10.2 ◦C. pattern: a collection of features of an object under consideration, along with the correct class information of that object. defined by, {xn , tn }. training data: data used during training of a classifier for which the correct labels are a priori known. testing/validation data: data not used during training, but rather set aside to estimate the true (generalization) performance of a classifier, for which correct labels are also a priori known. cost function: a quantitative measure that represents the cost of making an error. a model is produced to minimize this function. is zero error always a good thing?

Terminology II classifier: a parametric or nonparametric model which adjusts
its parameters or weights to find the mapping from the feature space to the outcome (class) space. f : X → T . y(x) = wTx + b y(x) = σ(WTx + b) where σ is a soft-max y(x) = σ(QTν(WTx + b) + q) where σ is a soft-max and ν is a sigmoid We need to optimize parameters Q, W, w, b, q and/or b to minimize a cost model: a simplified mathematical / statistical construct that mimics (acts like) the underlying physical phenomenon that generated the original data

Measuring Error Bishop (2006) t x y(xn , w) tn
xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2

Overﬁtting y(x, w) = w0 + w1 x + w2
x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N

Overﬁtting y(x, w) = w0 + w1 x + w2
x2 + . . . + wM xM = M j=0 wj xj M ERMS 0 3 6 9 0 0.5 1 Training Test Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N

Keeping overfitting under control Many models and prediction algorithms suffer
from overfitting; however, we can try to avoid overfitting by taking certain precautions. Using a Bayesian approach can avoid overfitting even when the number of parameters exceeds the number of data points for training. Regularization is the most commonly used approach to control overfitting. Essentially, we can add a penalty to the error function that discourages the solution vector to take on large values. Yup, its that simple! (for the most part). Example, 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 Example, 1-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ1 w 1 Example, 1 & 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 + λ1 w 1

1 and 2-norm regularization Bishop (2006) w1 w2 w w1
w2 w Estimation for 1-norm (left) and 2-norm (right) regularization on w. We see the contours of the error function and the regularization constraint on w 1 ≤ τ and w 2 2 ≤ τ2.

How much data do I need for a good ﬁt?
Bishop (2006) x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 The green line is the target function, the red function is the result of a 9th order polynomial minimizing ERMS , and the blue points are observations sampled from the target function.

Bayes Decision Theory Probability Theory Pattern recognition requires that we
have a way to deal with uncertainty, which arises from noise in data and finite sample sizes. Three things in life are certain: (1) death, (2) taxes, and (3) noise in your data! Some definitions Evidence: The probability of making such an observation. Prior: Our degree of belief that the event is plausible in the first place Likelihood: The likelihood of making an observation, under the condition that the event has occurred. Let us define some notation. Let X and Y be random variables. For example, X is a collection of medical measurements and Y is the healthy/unhealthy. Recall that there are three axioms of probability that must hold: P(E) = 1, P(E) ≥ 0 ∀E ∈ E, P (∪n i=1 Ei ) = n i=1 P(Ei ) where events Ei are mutually exclusive (i.e., Ei ∩ Ej = ∅ for all i = j). Also, if X and Y are independent then we have P(X, Y ) = P(X)P(Y ).

Sum, Product and Bayes Rule Sum Rule The marginal probability
of a single random variable can be computed by integrating (or summing) out the other random variables in the joint distribution. P(X) = Y ∈Y P(X, Y ) Product Rule A joint probability can be written as the product of a conditional and marginal probability. P(X, Y ) = P(Y )P(X|Y ) = P(X)P(Y |X) Bayes Rule A simple manipulation of the product rule gives rise to the Bayes rule. P(Y |X) = P(Y )P(X|Y ) P(X) = P(Y )P(X|Y ) Y ∈Y P(X, Y ) = P(Y )P(X|Y ) Y ∈Y P(Y )P(X|Y )

Bayes Rule & Decision Making Bayes Rule A simple manipulation
of the product and sum rule gave us the Bayes rule. Posterior - P(Y |X): The probability of Y given that I have observed X Example: The probability that a patient has cancer given that their medical measurements are in X. posterior P(Y |X) = prior P(Y ) likelihood P(X|Y ) P(X) evidence Decision Making Choosing the outcome with the highest posterior probability is the decision that results in the smallest probability of error. ω = arg max Y ∈Y P(Y )P(X|Y ) P(X) = arg max Y ∈Y P(Y )P(X|Y )

Na¨ ıve Bayes The Idiot Bayes Rule a.k.a. na¨ ıve
Bayes Computing the likelihood function, P(x|ω), can be an extremely daunting task and sometimes infeasible if we do not have enough data. One reason it is diﬃcult is because we are computing the joint likelihood function (i.e., P(x1 , x2 , . . . , xM |ω)). Solution: Assume all feature variables are independent! The posterior becomes: P(ω|x) ∝ P(ω) M i=1 P(xi |ω) modeling each feature independently is much easier is the assumption wrong? probably! works well in practice

Other classifiers Should we stop here? The Bayes classifier produces
a decision with the smallest probability of error! So does that mean we are done? Some things to think about What if we do not know the form of P(X|Y )? What if we incorrectly assume the form of the distribution? What if we have a small sample size? Can we be confident in P(Y )? More features generally means we need more data to “accurately” estimate any of the models parameters.

Decision Trees Tree-based Models Classification and regression trees, or CART
(Breiman et al., 1984), and C4.5 (Quinlan, 1986) are two of the more popular methods for generating decision trees. Decision trees provide a natural setting to handle data for containing categorical variables, but can still use continuous variables. Pros & Cons Decision trees are unstable classifiers – a small change on the input can produce a large change on the output. con Prone to overfitting. con Easy to interpret! pro

A binary-split decision tree and feature space partition Bishop (2006)
A B C D E θ1 θ4 θ2 θ3 x1 x2 x1 > θ1 x2 > θ3 x1 θ4 x2 θ2 A B C D E Illustration of the feature space partitioning of x = [x1 , x2 ]T into ﬁve regions (left). The binary-tree used to partition the feature space is done using the binary tree on the right.

Support Vector Machines Overview Support vector machines (SVMs) are binary
classiﬁcation models that maximize the margin between the two-classes. The determination of an SVM model is found via a convex optimization problem. Kernel methods can be used to solve non-linear problems with the SVM. The output of the SVM given by y(x) = wTφ(x) + b, where φ(·) is a non-linear feature transform. Overview Support vector machines are binary classiﬁcation models that maximize the margin between two classes. Let tn ∈ {±1}. The determination of the solution parameters to the SVM is found via a convex optimization problem. For a convex function, if we have a local solution then it is a global solution.

Support Vector Machines I The decision of the SVM is
given by y(x) = wTφ(x) + b for some nonlinear function φ. For the moment assume that the data are linearly separable. The decision function can be interpreted as y(xn ) > 0 if tn = 1 and y(xn ) < 0 if tn = −1 for all of the training instances. For correct predictions we have tn y(xn ) > 0 and incorrect predictions are tn y(xn ) < 0. For linearly separable data there exits inﬁnitely many solutions. However, the one we seek maximizes the margin. Bishop (2006) y = 1 y = 0 y = −1 margin y = 1 y = 0 y = −1

Support Vector Machines II The perpendicular distance of a point
x from a hyperplane deﬁned by y(x) = 0 is given by r = |y(x)|/ w . For correctly classiﬁed instances we have, tn y(xn ) w = tn wTφ(xn ) + b w −→ r∗ = 1 w The margin is given by the perpendicular distance to the closest point xn from the data set. The margin is 2/ w . So we want to maximize 1/ w subject to tn wTφ(xn ) + b ≥ 1. We can cast this maximization problem as a minimization one, given by: min 1 2 w 2 subject to tn wTφ(xn ) + b ≥ 1 This can be solved using Lagrange multipliers (αn ). L(w, b, α) = 1 2 w 2 + N n=1 αn tn wTφ(xn ) + b − 1

Support Vector Machines III Setting the derivative of L(w, b,
α) w.r.t. w and b, respectively yields: w = N n=1 αn tn φ(xn ), N n=1 αn tn = 0 Substituting these results into L(w, b, α) yields the dual: L(α) = n n=1 αn − 1 2 N n=1 N m=1 αn αm tn tm φ(xn )Tφ(xm ) subject to αn ≥ 0 and N n=1 αn tn . The optimization problem shown above is a quadratic program and can be solved relatively easily. We can handle the more realistic nonlinearly separable issue by introducing slack terms ξn on the margin. The optimization problem is nearly identical to the one above; however, we have a modiﬁed constraint on αn which becomes 0 ≤ αn ≤ C (for C > 0). Any xn corresponding to αn > 0 is referred to as a support vector.

SVM example using φ(x)Tφ(z) = exp( x − z 2
2 /σ) Bishop (2006) −2 0 2 −2 0 2 (left) linearly separable, and (right) non-linearly separable. The green circles represent the support vectors.

Adaptive Boosting (Adaboost) Adaboost is a popular ensemble methods that
achieves its power through the strategic learning and combination of weak classifiers. training error decreases exponentially with the addition of new classifiers the classifiers used in Adaboost are weak and very easy to generate Adaboost quite possibly the most successful ensemble algorithm Relatively easy to implement in practice and is quite robust to overfitting. Experiments have shown that Adaboost generalizes very well, even when trained to the same level as other algorithms.

Boosting Adaboost Input: Training set (x1 , t1 ), .
. . , (xN , tN ) and tj ∈ {±1}, Distribution D1 (j) = 1/N ∀j ∈ [N], Iteration variable m = 1. 1. Find a weak hypothesis (“rule of thumb”) hm = arg min h∈H Px∼Dm (h(x) = t) and m is hm ’s error on the distribution Dm . 2. Choose αm = 1 2 log 1− m m 3. Update the distribution Dm+1 (j) = Dm (j) Zm exp (−αm tj hm (xj )) 4. m = m + 1 Output a Final Hypothesis Hﬁnal (x) = sign T m=1 αm hm (x)

Boosting Example Bishop (2006) m = 1 −1 0 1
2 −2 0 2 m = 2 −1 0 1 2 −2 0 2 m = 3 −1 0 1 2 −2 0 2 m = 6 −1 0 1 2 −2 0 2 m = 10 −1 0 1 2 −2 0 2 m = 150 −1 0 1 2 −2 0 2

Adaboost on the Ionosphere data set err(H) ≤ 2T m
(1 − m ) ≤ exp −2 γ2 m 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 size of the ensemble error

Cross Validation Divide the data into k disjoint sets. The
train on k − 1 sets and test of the kth. The process of cross-validation produces k error estimates. block 1 block 2 block 3 . . . . . . block k − 1 block k block 1 block 2 block 3 . . . . . . block k − 1 block k 1 block 1 block 2 block 3 . . . . . . block k − 1 block k 2 . . . block 1 block 2 block 3 . . . . . . block k − 1 block k k−2 block 1 block 2 block 3 . . . . . . block k − 1 block k k−1 block 1 block 2 block 3 . . . . . . block k − 1 block k k The k-fold CV error is given by: err = 1 k k i=1 k Refer to Demˇ sar’s JMLR (2006) article on comparing multiple classiﬁers for statistical signiﬁcance.

Figures of Merit Confusion Matrix true + − predicted +
TP FP − FN TN Some commonly used ﬁgures of merit: recall = TP TP + FP , precision = TP TP + FN , f-measure = 2 × precision × recall precision + recall Receiver operating characteristic (ROC) curves are used to show the trade oﬀ between true positive and false positive rates.

Topics-ML

Topics-ML

Gregory Ditzler

More Decks by Gregory Ditzler

Featured

Transcript

Introduction to Machine Learning – ECE-S436 – Gregory Ditzler Drexel

Overview Topics of Discussion Examples Classiﬁcation Regression Types of Learning

Text Prediction Given a word w(t) and some history h(t),

Optical Character Recognition Bishop (2006)

Prediction of low/high risk loans savings income ✓1 ✓2 High-risk

Terminology I feature: a variable, x, believed to carry information

Terminology II classiﬁer: a parametric or nonparametric model which adjusts

Measuring Error Bishop (2006) t x y(xn , w) tn

Overﬁtting y(x, w) = w0 + w1 x + w2

Overﬁtting y(x, w) = w0 + w1 x + w2

Keeping overﬁtting under control Many models and prediction algorithms suﬀer

1 and 2-norm regularization Bishop (2006) w1 w2 w w1

How much data do I need for a good ﬁt?

Bayes Decision Theory Probability Theory Pattern recognition requires that we

Sum, Product and Bayes Rule Sum Rule The marginal probability

Bayes Rule & Decision Making Bayes Rule A simple manipulation

Na¨ ıve Bayes The Idiot Bayes Rule a.k.a. na¨ ıve

Other classiﬁers Should we stop here? The Bayes classiﬁer produces

Decision Trees Tree-based Models Classiﬁcation and regression trees, or CART

A binary-split decision tree and feature space partition Bishop (2006)

Support Vector Machines Overview Support vector machines (SVMs) are binary

Support Vector Machines I The decision of the SVM is

Support Vector Machines II The perpendicular distance of a point

Support Vector Machines III Setting the derivative of L(w, b,

SVM example using φ(x)Tφ(z) = exp( x − z 2

Adaptive Boosting (Adaboost) Adaboost is a popular ensemble methods that

Boosting Adaboost Input: Training set (x1 , t1 ), .

Boosting Example Bishop (2006) m = 1 −1 0 1

Adaboost on the Ionosphere data set err(H) ≤ 2T m

Cross Validation Divide the data into k disjoint sets. The

Figures of Merit Confusion Matrix true + − predicted +