65

# Topics-ML

April 01, 2014

## Transcript

1. ### Introduction to Machine Learning – ECE-S436 – Gregory Ditzler Drexel

University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA gregory.ditzler@gmail.com http://github.com/gditzler/eces436-week1 April 1, 2014
2. ### Overview Topics of Discussion Examples Classiﬁcation Regression Types of Learning

Image Rights Many of the images used in this presentation are from Christopher M. Bishop’s “Pattern Recognition and Machine Learning” (2006) text book. http://research.microsoft.com/en-us/um/people/cmbishop/prml/index.htm
3. ### Text Prediction Given a word w(t) and some history h(t),

what is the next word (i.e., w(t + 1))? what is the probability distribution over the next word (i.e., P(w(t + 1)|w(t), h(t)))? I love --? Can you pick up milk at the --?

5. ### Prediction of low/high risk loans savings income ✓1 ✓2 High-risk

Low-risk +   +   +   +   +   +   +   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   if (income > θ1 AND savings > θ2 ) then {low-risk} else {high-risk}
6. ### Terminology I feature: a variable, x, believed to carry information

about the task. example, cholesterol level. feature vector: collection of variables, or features, x = [x1 , . . . , xM ]T. example, collection of medical tests for a patient. feature space: M-dimensional vector space where the vectors x lie. example, x ∈ RM + class: a category/value assigned to a feature vector. in general we can refer to this as the target variable (t). example, t = cancer or t = 10.2 ◦C. pattern: a collection of features of an object under consideration, along with the correct class information of that object. deﬁned by, {xn , tn }. training data: data used during training of a classiﬁer for which the correct labels are a priori known. testing/validation data: data not used during training, but rather set aside to estimate the true (generalization) performance of a classiﬁer, for which correct labels are also a priori known. cost function: a quantitative measure that represents the cost of making an error. a model is produced to minimize this function. is zero error always a good thing?
7. ### Terminology II classiﬁer: a parametric or nonparametric model which adjusts

its parameters or weights to ﬁnd the mapping from the feature space to the outcome (class) space. f : X → T . y(x) = wTx + b y(x) = σ(WTx + b) where σ is a soft-max y(x) = σ(QTν(WTx + b) + q) where σ is a soft-max and ν is a sigmoid We need to optimize parameters Q, W, w, b, q and/or b to minimize a cost model: a simpliﬁed mathematical / statistical construct that mimics (acts like) the underlying physical phenomenon that generated the original data
8. ### Measuring Error Bishop (2006) t x y(xn , w) tn

xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2
9. ### Overﬁtting y(x, w) = w0 + w1 x + w2

x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N
10. ### Overﬁtting y(x, w) = w0 + w1 x + w2

x2 + . . . + wM xM = M j=0 wj xj M ERMS 0 3 6 9 0 0.5 1 Training Test Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N
11. ### Keeping overﬁtting under control Many models and prediction algorithms suﬀer

from overﬁtting; however, we can try to avoid overﬁtting by taking certain precautions. Using a Bayesian approach can avoid overﬁtting even when the number of parameters exceeds the number of data points for training. Regularization is the most commonly used approach to control overﬁtting. Essentially, we can add a penalty to the error function that discourages the solution vector to take on large values. Yup, its that simple! (for the most part). Example, 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 Example, 1-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ1 w 1 Example, 1 & 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 + λ1 w 1
12. ### 1 and 2-norm regularization Bishop (2006) w1 w2 w w1

w2 w Estimation for 1-norm (left) and 2-norm (right) regularization on w. We see the contours of the error function and the regularization constraint on w 1 ≤ τ and w 2 2 ≤ τ2.
13. ### How much data do I need for a good ﬁt?

Bishop (2006) x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 The green line is the target function, the red function is the result of a 9th order polynomial minimizing ERMS , and the blue points are observations sampled from the target function.
14. ### Bayes Decision Theory Probability Theory Pattern recognition requires that we

have a way to deal with uncertainty, which arises from noise in data and ﬁnite sample sizes. Three things in life are certain: (1) death, (2) taxes, and (3) noise in your data! Some deﬁnitions Evidence: The probability of making such an observation. Prior: Our degree of belief that the event is plausible in the ﬁrst place Likelihood: The likelihood of making an observation, under the condition that the event has occurred. Let us deﬁne some notation. Let X and Y be random variables. For example, X is a collection of medical measurements and Y is the healthy/unhealthy. Recall that there are three axioms of probability that must hold: P(E) = 1, P(E) ≥ 0 ∀E ∈ E, P (∪n i=1 Ei ) = n i=1 P(Ei ) where events Ei are mutually exclusive (i.e., Ei ∩ Ej = ∅ for all i = j). Also, if X and Y are independent then we have P(X, Y ) = P(X)P(Y ).
15. ### Sum, Product and Bayes Rule Sum Rule The marginal probability

of a single random variable can be computed by integrating (or summing) out the other random variables in the joint distribution. P(X) = Y ∈Y P(X, Y ) Product Rule A joint probability can be written as the product of a conditional and marginal probability. P(X, Y ) = P(Y )P(X|Y ) = P(X)P(Y |X) Bayes Rule A simple manipulation of the product rule gives rise to the Bayes rule. P(Y |X) = P(Y )P(X|Y ) P(X) = P(Y )P(X|Y ) Y ∈Y P(X, Y ) = P(Y )P(X|Y ) Y ∈Y P(Y )P(X|Y )
16. ### Bayes Rule & Decision Making Bayes Rule A simple manipulation

of the product and sum rule gave us the Bayes rule. Posterior - P(Y |X): The probability of Y given that I have observed X Example: The probability that a patient has cancer given that their medical measurements are in X. posterior P(Y |X) = prior P(Y ) likelihood P(X|Y ) P(X) evidence Decision Making Choosing the outcome with the highest posterior probability is the decision that results in the smallest probability of error. ω = arg max Y ∈Y P(Y )P(X|Y ) P(X) = arg max Y ∈Y P(Y )P(X|Y )
17. ### Na¨ ıve Bayes The Idiot Bayes Rule a.k.a. na¨ ıve

Bayes Computing the likelihood function, P(x|ω), can be an extremely daunting task and sometimes infeasible if we do not have enough data. One reason it is diﬃcult is because we are computing the joint likelihood function (i.e., P(x1 , x2 , . . . , xM |ω)). Solution: Assume all feature variables are independent! The posterior becomes: P(ω|x) ∝ P(ω) M i=1 P(xi |ω) modeling each feature independently is much easier is the assumption wrong? probably! works well in practice
18. ### Other classiﬁers Should we stop here? The Bayes classiﬁer produces

a decision with the smallest probability of error! So does that mean we are done? Some things to think about What if we do not know the form of P(X|Y )? What if we incorrectly assume the form of the distribution? What if we have a small sample size? Can we be conﬁdent in P(Y )? More features generally means we need more data to “accurately” estimate any of the models parameters.
19. ### Decision Trees Tree-based Models Classiﬁcation and regression trees, or CART

(Breiman et al., 1984), and C4.5 (Quinlan, 1986) are two of the more popular methods for generating decision trees. Decision trees provide a natural setting to handle data for containing categorical variables, but can still use continuous variables. Pros & Cons Decision trees are unstable classiﬁers – a small change on the input can produce a large change on the output. con Prone to overﬁtting. con Easy to interpret! pro
20. ### A binary-split decision tree and feature space partition Bishop (2006)

A B C D E θ1 θ4 θ2 θ3 x1 x2 x1 > θ1 x2 > θ3 x1 θ4 x2 θ2 A B C D E Illustration of the feature space partitioning of x = [x1 , x2 ]T into ﬁve regions (left). The binary-tree used to partition the feature space is done using the binary tree on the right.
21. ### Support Vector Machines Overview Support vector machines (SVMs) are binary

classiﬁcation models that maximize the margin between the two-classes. The determination of an SVM model is found via a convex optimization problem. Kernel methods can be used to solve non-linear problems with the SVM. The output of the SVM given by y(x) = wTφ(x) + b, where φ(·) is a non-linear feature transform. Overview Support vector machines are binary classiﬁcation models that maximize the margin between two classes. Let tn ∈ {±1}. The determination of the solution parameters to the SVM is found via a convex optimization problem. For a convex function, if we have a local solution then it is a global solution.
22. ### Support Vector Machines I The decision of the SVM is

given by y(x) = wTφ(x) + b for some nonlinear function φ. For the moment assume that the data are linearly separable. The decision function can be interpreted as y(xn ) > 0 if tn = 1 and y(xn ) < 0 if tn = −1 for all of the training instances. For correct predictions we have tn y(xn ) > 0 and incorrect predictions are tn y(xn ) < 0. For linearly separable data there exits inﬁnitely many solutions. However, the one we seek maximizes the margin. Bishop (2006) y = 1 y = 0 y = −1 margin y = 1 y = 0 y = −1
23. ### Support Vector Machines II The perpendicular distance of a point

x from a hyperplane deﬁned by y(x) = 0 is given by r = |y(x)|/ w . For correctly classiﬁed instances we have, tn y(xn ) w = tn wTφ(xn ) + b w −→ r∗ = 1 w The margin is given by the perpendicular distance to the closest point xn from the data set. The margin is 2/ w . So we want to maximize 1/ w subject to tn wTφ(xn ) + b ≥ 1. We can cast this maximization problem as a minimization one, given by: min 1 2 w 2 subject to tn wTφ(xn ) + b ≥ 1 This can be solved using Lagrange multipliers (αn ). L(w, b, α) = 1 2 w 2 + N n=1 αn tn wTφ(xn ) + b − 1
24. ### Support Vector Machines III Setting the derivative of L(w, b,

α) w.r.t. w and b, respectively yields: w = N n=1 αn tn φ(xn ), N n=1 αn tn = 0 Substituting these results into L(w, b, α) yields the dual: L(α) = n n=1 αn − 1 2 N n=1 N m=1 αn αm tn tm φ(xn )Tφ(xm ) subject to αn ≥ 0 and N n=1 αn tn . The optimization problem shown above is a quadratic program and can be solved relatively easily. We can handle the more realistic nonlinearly separable issue by introducing slack terms ξn on the margin. The optimization problem is nearly identical to the one above; however, we have a modiﬁed constraint on αn which becomes 0 ≤ αn ≤ C (for C > 0). Any xn corresponding to αn > 0 is referred to as a support vector.
25. ### SVM example using φ(x)Tφ(z) = exp( x − z 2

2 /σ) Bishop (2006) −2 0 2 −2 0 2 (left) linearly separable, and (right) non-linearly separable. The green circles represent the support vectors.

achieves its power through the strategic learning and combination of weak classiﬁers. training error decreases exponentially with the addition of new classiﬁers the classiﬁers used in Adaboost are weak and very easy to generate Adaboost quite possibly the most successful ensemble algorithm Relatively easy to implement in practice and is quite robust to overﬁtting. Experiments have shown that Adaboost generalizes very well, even when trained to the same level as other algorithms.
27. ### Boosting Adaboost Input: Training set (x1 , t1 ), .

. . , (xN , tN ) and tj ∈ {±1}, Distribution D1 (j) = 1/N ∀j ∈ [N], Iteration variable m = 1. 1. Find a weak hypothesis (“rule of thumb”) hm = arg min h∈H Px∼Dm (h(x) = t) and m is hm ’s error on the distribution Dm . 2. Choose αm = 1 2 log 1− m m 3. Update the distribution Dm+1 (j) = Dm (j) Zm exp (−αm tj hm (xj )) 4. m = m + 1 Output a Final Hypothesis Hﬁnal (x) = sign T m=1 αm hm (x)
28. ### Boosting Example Bishop (2006) m = 1 −1 0 1

2 −2 0 2 m = 2 −1 0 1 2 −2 0 2 m = 3 −1 0 1 2 −2 0 2 m = 6 −1 0 1 2 −2 0 2 m = 10 −1 0 1 2 −2 0 2 m = 150 −1 0 1 2 −2 0 2
29. ### Adaboost on the Ionosphere data set err(H) ≤ 2T m

(1 − m ) ≤ exp −2 γ2 m 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 size of the ensemble error
30. ### Cross Validation Divide the data into k disjoint sets. The

train on k − 1 sets and test of the kth. The process of cross-validation produces k error estimates. block 1 block 2 block 3 . . . . . . block k − 1 block k block 1 block 2 block 3 . . . . . . block k − 1 block k 1 block 1 block 2 block 3 . . . . . . block k − 1 block k 2 . . . block 1 block 2 block 3 . . . . . . block k − 1 block k k−2 block 1 block 2 block 3 . . . . . . block k − 1 block k k−1 block 1 block 2 block 3 . . . . . . block k − 1 block k k The k-fold CV error is given by: err = 1 k k i=1 k Refer to Demˇ sar’s JMLR (2006) article on comparing multiple classiﬁers for statistical signiﬁcance.
31. ### Figures of Merit Confusion Matrix true + − predicted +

TP FP − FN TN Some commonly used ﬁgures of merit: recall = TP TP + FP , precision = TP TP + FN , f-measure = 2 × precision × recall precision + recall Receiver operating characteristic (ROC) curves are used to show the trade oﬀ between true positive and false positive rates.