Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Topics-ML

Gregory Ditzler
April 01, 2014
120

 Topics-ML

Gregory Ditzler

April 01, 2014
Tweet

Transcript

  1. Introduction to Machine Learning – ECE-S436 – Gregory Ditzler Drexel

    University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA [email protected] http://github.com/gditzler/eces436-week1 April 1, 2014
  2. Overview Topics of Discussion Examples Classification Regression Types of Learning

    Image Rights Many of the images used in this presentation are from Christopher M. Bishop’s “Pattern Recognition and Machine Learning” (2006) text book. http://research.microsoft.com/en-us/um/people/cmbishop/prml/index.htm
  3. Text Prediction Given a word w(t) and some history h(t),

    what is the next word (i.e., w(t + 1))? what is the probability distribution over the next word (i.e., P(w(t + 1)|w(t), h(t)))? I love --? Can you pick up milk at the --?
  4. Prediction of low/high risk loans savings income ✓1 ✓2 High-risk

    Low-risk +   +   +   +   +   +   +   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   -­‐   if (income > θ1 AND savings > θ2 ) then {low-risk} else {high-risk}
  5. Terminology I feature: a variable, x, believed to carry information

    about the task. example, cholesterol level. feature vector: collection of variables, or features, x = [x1 , . . . , xM ]T. example, collection of medical tests for a patient. feature space: M-dimensional vector space where the vectors x lie. example, x ∈ RM + class: a category/value assigned to a feature vector. in general we can refer to this as the target variable (t). example, t = cancer or t = 10.2 ◦C. pattern: a collection of features of an object under consideration, along with the correct class information of that object. defined by, {xn , tn }. training data: data used during training of a classifier for which the correct labels are a priori known. testing/validation data: data not used during training, but rather set aside to estimate the true (generalization) performance of a classifier, for which correct labels are also a priori known. cost function: a quantitative measure that represents the cost of making an error. a model is produced to minimize this function. is zero error always a good thing?
  6. Terminology II classifier: a parametric or nonparametric model which adjusts

    its parameters or weights to find the mapping from the feature space to the outcome (class) space. f : X → T . y(x) = wTx + b y(x) = σ(WTx + b) where σ is a soft-max y(x) = σ(QTν(WTx + b) + q) where σ is a soft-max and ν is a sigmoid We need to optimize parameters Q, W, w, b, q and/or b to minimize a cost model: a simplified mathematical / statistical construct that mimics (acts like) the underlying physical phenomenon that generated the original data
  7. Measuring Error Bishop (2006) t x y(xn , w) tn

    xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2
  8. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N
  9. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj M ERMS 0 3 6 9 0 0.5 1 Training Test Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N
  10. Keeping overfitting under control Many models and prediction algorithms suffer

    from overfitting; however, we can try to avoid overfitting by taking certain precautions. Using a Bayesian approach can avoid overfitting even when the number of parameters exceeds the number of data points for training. Regularization is the most commonly used approach to control overfitting. Essentially, we can add a penalty to the error function that discourages the solution vector to take on large values. Yup, its that simple! (for the most part). Example, 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 Example, 1-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ1 w 1 Example, 1 & 2-norm penalty E(w) = 1 2 N n=1 (y(xn, w) − tn)2 + λ2 2 w 2 2 + λ1 w 1
  11. 1 and 2-norm regularization Bishop (2006) w1 w2 w w1

    w2 w Estimation for 1-norm (left) and 2-norm (right) regularization on w. We see the contours of the error function and the regularization constraint on w 1 ≤ τ and w 2 2 ≤ τ2.
  12. How much data do I need for a good fit?

    Bishop (2006) x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 The green line is the target function, the red function is the result of a 9th order polynomial minimizing ERMS , and the blue points are observations sampled from the target function.
  13. Bayes Decision Theory Probability Theory Pattern recognition requires that we

    have a way to deal with uncertainty, which arises from noise in data and finite sample sizes. Three things in life are certain: (1) death, (2) taxes, and (3) noise in your data! Some definitions Evidence: The probability of making such an observation. Prior: Our degree of belief that the event is plausible in the first place Likelihood: The likelihood of making an observation, under the condition that the event has occurred. Let us define some notation. Let X and Y be random variables. For example, X is a collection of medical measurements and Y is the healthy/unhealthy. Recall that there are three axioms of probability that must hold: P(E) = 1, P(E) ≥ 0 ∀E ∈ E, P (∪n i=1 Ei ) = n i=1 P(Ei ) where events Ei are mutually exclusive (i.e., Ei ∩ Ej = ∅ for all i = j). Also, if X and Y are independent then we have P(X, Y ) = P(X)P(Y ).
  14. Sum, Product and Bayes Rule Sum Rule The marginal probability

    of a single random variable can be computed by integrating (or summing) out the other random variables in the joint distribution. P(X) = Y ∈Y P(X, Y ) Product Rule A joint probability can be written as the product of a conditional and marginal probability. P(X, Y ) = P(Y )P(X|Y ) = P(X)P(Y |X) Bayes Rule A simple manipulation of the product rule gives rise to the Bayes rule. P(Y |X) = P(Y )P(X|Y ) P(X) = P(Y )P(X|Y ) Y ∈Y P(X, Y ) = P(Y )P(X|Y ) Y ∈Y P(Y )P(X|Y )
  15. Bayes Rule & Decision Making Bayes Rule A simple manipulation

    of the product and sum rule gave us the Bayes rule. Posterior - P(Y |X): The probability of Y given that I have observed X Example: The probability that a patient has cancer given that their medical measurements are in X. posterior P(Y |X) = prior P(Y ) likelihood P(X|Y ) P(X) evidence Decision Making Choosing the outcome with the highest posterior probability is the decision that results in the smallest probability of error. ω = arg max Y ∈Y P(Y )P(X|Y ) P(X) = arg max Y ∈Y P(Y )P(X|Y )
  16. Na¨ ıve Bayes The Idiot Bayes Rule a.k.a. na¨ ıve

    Bayes Computing the likelihood function, P(x|ω), can be an extremely daunting task and sometimes infeasible if we do not have enough data. One reason it is difficult is because we are computing the joint likelihood function (i.e., P(x1 , x2 , . . . , xM |ω)). Solution: Assume all feature variables are independent! The posterior becomes: P(ω|x) ∝ P(ω) M i=1 P(xi |ω) modeling each feature independently is much easier is the assumption wrong? probably! works well in practice
  17. Other classifiers Should we stop here? The Bayes classifier produces

    a decision with the smallest probability of error! So does that mean we are done? Some things to think about What if we do not know the form of P(X|Y )? What if we incorrectly assume the form of the distribution? What if we have a small sample size? Can we be confident in P(Y )? More features generally means we need more data to “accurately” estimate any of the models parameters.
  18. Decision Trees Tree-based Models Classification and regression trees, or CART

    (Breiman et al., 1984), and C4.5 (Quinlan, 1986) are two of the more popular methods for generating decision trees. Decision trees provide a natural setting to handle data for containing categorical variables, but can still use continuous variables. Pros & Cons Decision trees are unstable classifiers – a small change on the input can produce a large change on the output. con Prone to overfitting. con Easy to interpret! pro
  19. A binary-split decision tree and feature space partition Bishop (2006)

    A B C D E θ1 θ4 θ2 θ3 x1 x2 x1 > θ1 x2 > θ3 x1 θ4 x2 θ2 A B C D E Illustration of the feature space partitioning of x = [x1 , x2 ]T into five regions (left). The binary-tree used to partition the feature space is done using the binary tree on the right.
  20. Support Vector Machines Overview Support vector machines (SVMs) are binary

    classification models that maximize the margin between the two-classes. The determination of an SVM model is found via a convex optimization problem. Kernel methods can be used to solve non-linear problems with the SVM. The output of the SVM given by y(x) = wTφ(x) + b, where φ(·) is a non-linear feature transform. Overview Support vector machines are binary classification models that maximize the margin between two classes. Let tn ∈ {±1}. The determination of the solution parameters to the SVM is found via a convex optimization problem. For a convex function, if we have a local solution then it is a global solution.
  21. Support Vector Machines I The decision of the SVM is

    given by y(x) = wTφ(x) + b for some nonlinear function φ. For the moment assume that the data are linearly separable. The decision function can be interpreted as y(xn ) > 0 if tn = 1 and y(xn ) < 0 if tn = −1 for all of the training instances. For correct predictions we have tn y(xn ) > 0 and incorrect predictions are tn y(xn ) < 0. For linearly separable data there exits infinitely many solutions. However, the one we seek maximizes the margin. Bishop (2006) y = 1 y = 0 y = −1 margin y = 1 y = 0 y = −1
  22. Support Vector Machines II The perpendicular distance of a point

    x from a hyperplane defined by y(x) = 0 is given by r = |y(x)|/ w . For correctly classified instances we have, tn y(xn ) w = tn wTφ(xn ) + b w −→ r∗ = 1 w The margin is given by the perpendicular distance to the closest point xn from the data set. The margin is 2/ w . So we want to maximize 1/ w subject to tn wTφ(xn ) + b ≥ 1. We can cast this maximization problem as a minimization one, given by: min 1 2 w 2 subject to tn wTφ(xn ) + b ≥ 1 This can be solved using Lagrange multipliers (αn ). L(w, b, α) = 1 2 w 2 + N n=1 αn tn wTφ(xn ) + b − 1
  23. Support Vector Machines III Setting the derivative of L(w, b,

    α) w.r.t. w and b, respectively yields: w = N n=1 αn tn φ(xn ), N n=1 αn tn = 0 Substituting these results into L(w, b, α) yields the dual: L(α) = n n=1 αn − 1 2 N n=1 N m=1 αn αm tn tm φ(xn )Tφ(xm ) subject to αn ≥ 0 and N n=1 αn tn . The optimization problem shown above is a quadratic program and can be solved relatively easily. We can handle the more realistic nonlinearly separable issue by introducing slack terms ξn on the margin. The optimization problem is nearly identical to the one above; however, we have a modified constraint on αn which becomes 0 ≤ αn ≤ C (for C > 0). Any xn corresponding to αn > 0 is referred to as a support vector.
  24. SVM example using φ(x)Tφ(z) = exp( x − z 2

    2 /σ) Bishop (2006) −2 0 2 −2 0 2 (left) linearly separable, and (right) non-linearly separable. The green circles represent the support vectors.
  25. Adaptive Boosting (Adaboost) Adaboost is a popular ensemble methods that

    achieves its power through the strategic learning and combination of weak classifiers. training error decreases exponentially with the addition of new classifiers the classifiers used in Adaboost are weak and very easy to generate Adaboost quite possibly the most successful ensemble algorithm Relatively easy to implement in practice and is quite robust to overfitting. Experiments have shown that Adaboost generalizes very well, even when trained to the same level as other algorithms.
  26. Boosting Adaboost Input: Training set (x1 , t1 ), .

    . . , (xN , tN ) and tj ∈ {±1}, Distribution D1 (j) = 1/N ∀j ∈ [N], Iteration variable m = 1. 1. Find a weak hypothesis (“rule of thumb”) hm = arg min h∈H Px∼Dm (h(x) = t) and m is hm ’s error on the distribution Dm . 2. Choose αm = 1 2 log 1− m m 3. Update the distribution Dm+1 (j) = Dm (j) Zm exp (−αm tj hm (xj )) 4. m = m + 1 Output a Final Hypothesis Hfinal (x) = sign T m=1 αm hm (x)
  27. Boosting Example Bishop (2006) m = 1 −1 0 1

    2 −2 0 2 m = 2 −1 0 1 2 −2 0 2 m = 3 −1 0 1 2 −2 0 2 m = 6 −1 0 1 2 −2 0 2 m = 10 −1 0 1 2 −2 0 2 m = 150 −1 0 1 2 −2 0 2
  28. Adaboost on the Ionosphere data set err(H) ≤ 2T m

    (1 − m ) ≤ exp −2 γ2 m 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 size of the ensemble error
  29. Cross Validation Divide the data into k disjoint sets. The

    train on k − 1 sets and test of the kth. The process of cross-validation produces k error estimates. block 1 block 2 block 3 . . . . . . block k − 1 block k block 1 block 2 block 3 . . . . . . block k − 1 block k 1 block 1 block 2 block 3 . . . . . . block k − 1 block k 2 . . . block 1 block 2 block 3 . . . . . . block k − 1 block k k−2 block 1 block 2 block 3 . . . . . . block k − 1 block k k−1 block 1 block 2 block 3 . . . . . . block k − 1 block k k The k-fold CV error is given by: err = 1 k k i=1 k Refer to Demˇ sar’s JMLR (2006) article on comparing multiple classifiers for statistical significance.
  30. Figures of Merit Confusion Matrix true + − predicted +

    TP FP − FN TN Some commonly used figures of merit: recall = TP TP + FP , precision = TP TP + FN , f-measure = 2 × precision × recall precision + recall Receiver operating characteristic (ROC) curves are used to show the trade off between true positive and false positive rates.