Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Lectures - Unsupervised Learning

Gregory Ditzler
February 24, 2024
7

Machine Learning Lectures - Unsupervised Learning

Gregory Ditzler

February 24, 2024
Tweet

Transcript

  1. Overview 1. Motivation 2. Parametric Estimation 3. Nonparametric Methods 4.

    Parzen Windows 5. k-NN Estimation 6. K-means Clustering 7. Expectation Maximization 2 / 69
  2. Topics for Today Reading Assignments: Duda et al.’s Pattern Classification,

    Chapters 3 & 4. Many of the topics discussed today fall under the umbrella of unsupervised learning. That is learning from unlabeled data. Topics • Parametric: maximum likelihood estimator, Bayesian estimation • Non–parametric: kernel density estimation, mixture models, generalized view of EM • Clustering: k–means, expectation maximization About the figures: many of the figures were collected from the Pattern Classification & PRML text, along with some figures generated by custom Matlab and Python scripts. Also, much of the content of the lecture is derived from previous years lectures of this course. 4 / 69
  3. What are we working on? • Many approaches make assumptions

    on the probability distribution over the data (training and/or testing). Unfortunately, we rarely have access to the probability distribution. • What did you assume about p(x|ω) for the na¨ ıve Bayes classifier? Why didn’t you assume a multinomial model or some other model? • On top of making assumptions on the probability distribution, what about the estimation of the parameters? • Are there sufficient data to estimate the parameters? • The curse of dimensionality is still a problem! x1 D = 1 x1 x2 D = 2 x1 x2 x3 D = 3 • What if we cannot assume the form of a probability distribution? 5 / 69
  4. Parameter Estimation What are parameters? Parameters are terms involved in

    the properties and computation of functions such as p(x) that govern the behavior of a probability distribution. For example, if X is distributed as a Gaussian random variable, then the probability distribution for X has parameters µ and σ2. The distribution is of the form, p(X = x) = 1 √ 2πσ e− (x−µ)2 2σ2 • The Bayes classifier required that you know P(ω), p(x|ω), and p(x) • p(x) can be computed using the total probability theorem • What if the distribution on ω and/or the x’s conditional distribution on ω is unknown? • We can assume the form of the distribution, but how do we know if it is correct or approximately correct? • One option is to use the Kolmorogrov-Smirnov test to test against a distribution • How much data are sufficient to estimate the parameters? • If we can (cannot) assume the form of the distribution, we use parametric (nonparametric) techniques 7 / 69
  5. Maximum-Likelihood Estimation Formulating the MLE Assume we have c data

    sets Dj for j = {1, . . . , c} (i.e., one for each class – p(x|ωj )). Dj is sampled from a probability distribution with parameters θj (e.g., for a Gaussian distribution θj = {µj , Σj }). Suppose a data set D has n samples drawn iid. The likelihood of the data for a set of parameters is, p(D|θ) = n k=1 p(xk |θ) ⇒ log p(D|θ) = n k=1 log p(xk |θ) • p(D|θ) is the likelihood of D given θ • The maximum-likelihood estimate of θ is given by the ˆ θ that satisfies, ˆ θ = arg max θ∈Θ {p(D|θ)} = arg max θ∈Θ {log p(D|θ)} = arg max θ∈Θ {l(θ)} • ˆ θ best supports the observed data 8 / 69
  6. Maximum-Likelihood Estimation Maximizing the log-likelihood, l(θ), provides equivalent results with

    less work. Finding the MLE, ˆ θ, is found by setting the gradient of l(θ) equal to zero. That is, ∇θ l(θ) = n k=1 ∇θ log p(xk |θ) Solving for ˆ θ A solution to ˆ θ is found by setting the gradient equal to zero ∇θ l(θ) = 0 • The solution to ˆ θ could be a global max, local min or max or (rarely) an inflection point, so you must check. ˆ θ is only an estimate! 9 / 69
  7. Gaussian Examples Example 1. Consider a Gaussian distribution with an

    unknown θ = µ parameter and known Σ. log p(xk |θ) = − 1 2 log (2π)d|Σ| − 1 2 (xk − θ)T Σ−1 (xk − θ) Then ˆ µ = ˆ θ = 1 n n k=1 xn (Refer to lecture notes for a proof) 11 / 69
  8. Gaussian Examples Example 2. Consider a Gaussian distribution with an

    unknown mean and covariance, θ = {µ, σ2} = {θ1 , θ2 }. For now we consider only a 1D variable. log p(xk |θ) = − 1 2 log {2πθ2 } − 1 2θ2 (xk − θ1 )2 Then θ1 = 1 n n k=1 xk , θ2 = 1 n n k=1 (xk − ¯ µ)2 (Refer to lecture notes for a proof) 12 / 69
  9. Bias & Variance of the Estimator • How good and

    reliable are these estimates? We assess the goodness and reliability through • Bias: How close is the estimate to the true value? E[ˆ θ] = θ? • Variance: How much would this estimate change, had we tried this again with a different dataset also drawn from the same distribution? E[σ2] = E 1 n n k=1 (xk − ¯ x)2 = E 1 n n k=1 ((xk − µ) − (¯ x − µ))2 = E 1 n n k=1 (x2 k − µ)2 − 2(¯ x − µ) 1 n n k=1 (xk − µ) + (¯ x − µ)2 = E 1 n n k=1 (xk − µ)2 − (¯ x − µ)2 = σ2 − E[(¯ x − µ)2] < σ2 13 / 69
  10. Bias & Variance of the Estimator • The covariance estimate

    is biased; however we can make it unbiased by using σ2 ∗ = n n−1 ¯ σ2. The sample mean is an unbiased estimator for µ. • As n → ∞, the estimator’s bias is negligible; however, it will always be biased. • One of the very important properties of the maximum likelihood estimate is that it is invariant to non-linear transformations • Other estimators exist for parameter estimation such as the minimum variance unbiased estimator (MVUE) 14 / 69
  11. Bayesian Estimation • θ for the MLE are considered to

    be fixed, but unknown. Bayesian Estimation (BE) assumes θ are random variables and instead of finding ˆ θ we find p(θ|D). • BE generally provides us with more information; however, computing θBE may be more involved than θMLE. • Bayesian estimation is not estimating θ, rather we are finding the distribution p(x|D) based on the observation of D • For most practical applications, if the assumptions are correct, and there is sufficient data MLE gives good results • MLE: Frequentist approach • BE: Bayesian approach • The MLE method found the estimator that maximized the log-likelihood function, p(D|θ). Why didn’t we use the a priori density p(θ|D)? Such methods that use p(θ|D) are referred to as Bayesian estimation techniques. 15 / 69
  12. Bayesian Estimation Bayesian estimation and the assumptions 1. The form

    of the density p(x|θ) is assumed to be known, but the value of the parameter vector θ is not known exactly. 2. Our initial knowledge about θ is contained in a known a priori density p(θ). 3. The rest of our knowledge about θ is contained in a set D of n samples x1 , . . . , xn drawn independently according to the unknown probability density p(x). According to Bayes theorem and independence of x we have, p(θ|D) = p(θ)p(D|θ) p(θ)p(D|θ)dθ , p(D|θ) = N k=1 p(xk |θ) Suppose that p(D|θ) reaches a sharp peak at θ = ˆ θ. If the prior density p(θ) is not zero at θ = ˆ θ and does not change much in the surrounding neighborhood, then p(θ|D) also peaks at that point. 16 / 69
  13. Are you a Bayesian or a Frequentist? The Frequentist (MLE)

    Mentality Choose the parameters that maximize the likelihood of the data being observed, θMLE = arg max θ∈Θ p(D|θ) The Frequentist (MLE) Mentality Choose the parameters that maximize the posterior probability of θ given the observed data θMAP = arg max θ∈Θ p(θ|D) 17 / 69
  14. Nonparameteric Methods • MLE and BE assumed that the data

    are sampled from a probability distribution with a known form • each probability distribution has a set of parameters to be estimated from the data • probability distributions: Gaussian, Binomial, geometric, exponential, Gibbs, Poisson, Rademacher, χ2, noncentral χ2, Laplace, etc. Problems with parametric techniques • Assume forms of distributions functions are known • Most known forms of distributions are unimodal • In most cases individual features are assumed to be independent Nonparametric methods Nonparametric methods for density estimation do not assume the data fit a parametric form (i.e., we can deal with arbitrary probability distributions) 19 / 69
  15. A Simple Density Estimator – the histogram! Partition the data

    into distinct bins with width ∆i. Count number of elements in the bins and the probability of observing a sample x in the ith bin is, pi = ni N∆i ∆ = 0.04 0 0.5 1 0 5 ∆ = 0.08 0 0.5 1 0 5 ∆ = 0.25 0 0.5 1 0 5 20 / 69
  16. Density Estimation Motivating Density Estimation • The basic ideas behind

    many density estimation methods are very simple. The most fundamental techniques rely on the fact that the probability P that a vector x will fall in a region R is given by P = R p(x′)dx′ • P is a smoothed or averaged version of the density function p(x) • we can estimate this smoothed value of p by estimating the probability P 21 / 69
  17. Density Estimation 6 4 2 0 2 4 6 6

    4 2 0 2 4 6 0 50 100 0 50 100 • By considering the data from each class separately we can compute the conditional probabilities, p(x|ω). • If we employ the assumption that all features in x are independent we can compute the conditional probabilities for each feature separately rather than trying to estimate the joint probability. That is, p(x|ω) = d k=1 p(x(k)|ω) 22 / 69
  18. Density Estimation • Assume that we are provided n examples

    and let k be the number of examples that fall in a region R. Then, R p(x′)dx′ ≃ p(x) ∗ V • where x is a point within R and V is the volume enclosed by R • The random variable defined by whether or not x falls in R follows a Binomial distribution. Then E[k] = nP. p(x) can be solved for yielding, p(x) = k/n V Questions about our density estimation methods Are there any problems with this formulation? If so, what are they? What happens as n → ∞? What happens if V → 0? Are these two results the same? 23 / 69
  19. Density Estimation • We can examine this problem of density

    estimation in one of two ways. First, we could fix V and let n grow to infinity, or fix n and let V approach zero. Lets examine what happens. • If we fix the volume V and take more and more training samples, the ratio k/n will converge (in probability), but we have only obtained an estimate of the space-averaged value of p(x). • If we want to obtain p(x) rather than just an averaged version of it, we must be prepared to let V approach zero. Thus we must fix n and let V approach zero. However, eventually p(x) ≃ 0 since V → 0. Thus, letting V → 0 is a useless result. • Since V → 0 is not an option, we must live with a finite sample estimation of p(x) that will contain some variance in the estimation. How can we choose V ? • It should be large enough to contain plenty of samples in R • It should be small enough to justify the assumption of p(x) be constant within the chosen V / R. 24 / 69
  20. Density Estimation Let pn (x) = (kn /n)/Vn. If pn

    (x) is to converge to p(x), 3 conditions are required: Case 1 This condition assures us that the space averaged P/V will converge to p(x), provided that the regions shrink uniformly. lim n→∞ Vn = 0 Case 2 This condition, which only makes sense if p(x) ̸= 0, assures us that the frequency ratio will converge (in probability) to the probability P. lim n→∞ kn = ∞ Case 3 This condition is clearly necessary if pn (x) is to converge at all. lim n→∞ kn n = 0 • There are two common ways of obtaining sequences of regions that satisfy these conditions 1. Shrink an initial region by specifying the volume Vn as some function of n (Parzen Window) 2. Specify kn as some function of n (Nearest Neighbor) 25 / 69
  21. Window vs. Neighbor Methods Determining the density of the square

    by shrinking Vn or growing kn Vn = 1 p n kn = p n n = 1 n = 4 n = 9 n = 100 V1 = 1 V4 = 1 2 V9 = 1 3 V100 = 1 10 k100 = 10 k9 = 3 k4 = 2 k1 = 1 26 / 69
  22. Parzen Windows What is a Parzen window? The Parzen-window approach

    to estimating densities can be introduced by temporarily assuming that the region Rn is a d-dimensional hypercube. The number of samples falling into Rn is obtained by a windowing function, hence the name Parzen windows. • Rn has volume hd n and ϕ(·) is a kernel function that counts the number of samples that fall in Rn • The graphic below shows kn = 3 (u) = ⇢ 1 |uj |  1 2 for j 2 [d] 0 otherwise (u) 1 2 1 2 1 2 (x) h 2 h 2 h 2 ✓ x xi h ◆ = 1 | {z } xi is in the hypercube kn = n X i=1 ✓ x xi hn ◆ , 28 / 69
  23. Parzen Windows Recall that, p(x) = k/n V , kn

    = n i=1 ϕ x − xi hn Combining these results, we have ˆ p(x) = 1 nV n i=1 ϕ x − xi hn • n: cardinality of the data • hn: width of the window function • x: point where we are to compute p(x) • V : volume of the hypercube • ϕ(·): kernel function indicating if an instance is in the hypercube • ˆ p(x): estimation of p(x) at x using Parzen window density estimators 29 / 69
  24. Changing the Kernel of the Density Estimator Current form of

    the density estimator ˆ p(x) = 1 nhd n n i=1 ϕ x − xi hn • Consider ϕ(·) as a general smoothing function, such that ϕ(x) ≥ 0 (nonnegativity), and ϕ(x)dx = 1 (normalization). The general expression for ˆ p(x) remains unchanged with these constraints. • ˆ p(x) can be viewed as a superposition of the ϕ(·)’s, which is a measurement of how far x is from xi, around the point x which we are trying to estimate. • xi ’s are the training data in D and ˆ p(x) is an interpolation of the contribution of each sample. The kernel determines the value of xi ’s “relatedness” to x. • Given the constraints above ϕ is a distribution function, ˆ p(x) converges in probability with n → ∞. 30 / 69
  25. Changing the Kernel of the Density Estimator What is a

    popular choice for ϕ(·)? ϕ(x) = 1 √ 2πσ e−(x−µ)2 2σ2 31 / 69
  26. Parzen Example: growing n −10 −8 −6 −4 −2 0

    2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 32 / 69
  27. Selecting the spread parameter (h) The Radial Basis Function Kernel

    (Gaussian Kernel) ϕ(x) = 1 √ 2πh e−(x−xi)2 2h2 • The spread, or kernel bandwidth, parameter has a large effect on the accuracy of the Parzen window density estimator. Consider the two extreme cases: • h is too small: If h is too small, the function ϕ becomes extremely focused at the point xi. In such situations the function ˆ p(x) may change erratically. In some cases, similar inputs do not give similar outputs. A small h is susceptible to noise in the data. • h is too large: If h is too large the we lose the finer details of ˆ p(x). For example, if ˆ p(·) is multi-modal then it is quite possible that one of the modes in the density function is “lost” in the estimation. 33 / 69
  28. Parzen Example: varying σ −10 −5 0 5 10 0

    0.2 0.4 0.6 0.8 σ = 0.01 −10 −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 σ = 0.1 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 σ = 0.5 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 σ = 1 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 σ = 2 −10 −5 0 5 10 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 σ = 10 35 / 69
  29. True distribution vs. a Parzen windows estimation h = 0.005

    0 0.5 1 0 5 h = 0.07 0 0.5 1 0 5 h = 0.2 0 0.5 1 0 5 36 / 69
  30. Parzen Example Note that regardless of the h parameter, as

    n approaches ∞ the results are the same. 37 / 69
  31. Product Kernel One simplification to estimate densities is by multiplying

    1D kernels (product kernels). ˆ pp (x) = 1 n n i=1 ϕ(x, xi , h1 , . . . , hd ) where ϕ(x, xi , h1 , . . . , hd ) = 1 h1 h2 · · · hd d j=1 ϕj x(j) − xi (j) hj Kernel Independence Kernel independence is assumed above – which does NOT imply feature independence, ˆ pp (x) = d j=1 1 nhj n i=1 ϕj x(j) − xi (j) hj 38 / 69
  32. Classification & Parzen Windows The Bayes Classifier The Bayes decision

    rule is implemented as follows, ω∗ = arg max ω∈Ω p(x|ω)P(ω) p(x) = arg max ω∈Ω p(x|ω)P(ω) • Use the data in D to estimate p(x|ω) for ω ∈ Ω using the Parzen window density estimator. Define a Bernoulli random variable zi,c that takes value 1 if xi belongs to ωc and 0 if it does not. The MLE for the prior probabilities (yes, you can use the MLE on probabilities – see DHS Ch. 2, exercise 3) is given by, P(ωc ) = 1 n n i=1 zi,c. 39 / 69
  33. Classification & Parzen Windows Pros & Cons • Advantages: Does

    not assume prior knowledge on p(x|ω) • Disadvantages: Need (lots of)n data to make sure that the estimate converges to the true distribution • More data are required to accurately estimate p(x|ω) as the dimensionality increases ⇒ Curse of Dimensionality! • Overfitting is an issue if the kernel bandwidth parameter is made too small! Training error may be low, but generalization error is poor! 40 / 69
  34. Parzen Classification A small h leads to boundaries that are

    more complicated than for large h on same data set 41 / 69
  35. k-Nearest Neighbor Estimation • What is the “best” windowing function

    to use and what should the parameters of the function be set to (e.g., σ, h) • A remedy to this problem is to let the volume be a function of the training data. That is fix the value of the k-nearest neighbors and determine the minimum volume that encloses the k samples. • If the density near x is high them volume will be small (small value for the kernel), and if the density near x is low, the volume will be large (large value for the kernel). Thus, k-NN is “automatically” determining the window size for the training data. k-Nearest Neighbor Estimation Algorithm 1. Select an initial volume around x to estimate p(x). 2. Grow the window until k samples fall in the region R. The samples in R are the nearest neighbors of x. 3. Estimate the density based on: ˆ p(x) = k/n V 43 / 69
  36. How to choose k? As k ∝ √ n the

    difference between the estimate and true distribution becomes arbitrarily small. Typically, for classification problems, we adjust k (or h for Parzen windows), until the classifier has the lowest error on a validation dataset 45 / 69
  37. Estimation of Posterior Probabilities The Bayes Classifier ω∗ = arg

    max ωi∈Ω p(x, ωi ) c j=1 p(x, ωj ) • The joint probability, p(x, ωi ), is estimated by placing a volume, V , around point x until k samples fall in the region. Of the k samples ki of them belong to class ωi. The obvious estimate for the joint probability is, pn (x, ωi ) = ki /n V , P(ωi |x) = p(x, ωi ) c j=1 p(x, ωj ) = ki k On your own: Derive the MLE for P(ωi ) and the k-nearest neighbors estimate of p(x) & p(x|ωi ) to show that P(ωi |x) = ki /k. 46 / 69
  38. Selecting nearest neighbors x6 x7 K = 1 0 1

    2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 48 / 69
  39. Fisher’s Iris Data 3-NN Example 4 5 6 7 8

    9 1 2 3 4 5 3-Class classification 4 5 6 7 8 9 1 2 3 4 5 3-Class classification 3-NN implemented in the Python Scikit Learn Machine Learning package. The data set is Fisher’s Iris measurements. (left) distance weight nearest neighbor algorithm, (right) standard k-NN algorithm 49 / 69
  40. k-NN Summary The Lazy Learner • The k-NN is a

    known as a lazy learning because there is no “learning” implemented after the training data set. There are no processing steps required before making a prediction other than receiving the training data. • Since there is no learning implemented in the k-NN and it is based solely on the training data, k-NN is also called memory based learning, or instance based learning. • Computational cost arises from the testing phase and algorithms require larger memory requirements. 51 / 69
  41. K-means Clustering A Motivation for Unsupervised Learning Throughout most of

    the semester nearly all the algorithms studied assumed that data are labeled. However, supposed we only have access to D = {x1 , . . . , xN } and we need to group them into K sets, or clusters. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. • Clusters, k = 1, . . . , K are associated with a prototype vector, µk . • Our goal is to find the assignment of data samples to clusters such that sum of the squares of the distances of each data point to its closest vector µk , is a minimum. Distortion Minimization min J = min N n=1 K k=1 rnk ∥xn − µk ∥2 • rnk ∈ {0, 1} are binary variables corresponding to whether or not sample n is in cluster k. Another word to describe this variable is a latent variable. • Goal: find rnk and µk so as to minimize the distortion. 53 / 69
  42. Two-Step Iterative Algorithm to Find rnk and µk General Algorithm

    for Finding rnk and µk Initialize µk (e.g., select K instances at random) 1. Minimize J with respect to the rnk, keeping the µk fixed 2. Minimize J with respect to the µk , keeping rnk fixed (repeat until convergence) • These two steps updating rnk and updating µk correspond respectively to the E (expectation) and M (maximization) steps of the EM algorithm. 54 / 69
  43. K-means derivation Step #1 J is a linear function of

    rnk, this optimization can be performed easily get a closed form solution. The terms involving different n are independent and so we can optimize for each n separately by choosing rnk to be 1 for whichever value of k gives the minimum value of ∥xn − µj ∥2. That is, rnk = 1 if arg minj ∥xn − µj ∥2 0 otherwise. 55 / 69
  44. K-means derivation Step #2 Now optimize J w.r.t. µk with

    the rnk held fixed. The objective function J is a quadratic function of µk , and it can be minimized by setting its derivative with respect to µk to zero giving, 2 N n=1 rnk (xn − µk ) = 0 ⇒ µk = N n=1 rnk xn N n=1 rnk 56 / 69
  45. K-means applied to the Old Faithful data set (a) −2

    0 2 −2 0 2 (b) −2 0 2 −2 0 2 (c) −2 0 2 −2 0 2 (d) −2 0 2 −2 0 2 (e) −2 0 2 −2 0 2 (f) −2 0 2 −2 0 2 (g) −2 0 2 −2 0 2 (h) −2 0 2 −2 0 2 (i) −2 0 2 −2 0 2 57 / 69
  46. K-means image segmentation K = 2 K = 3 K

    = 10 Original image 59 / 69
  47. Mixtures of Gaussians (GMMs) • Many times distributions contain several

    modes and we cannot identify a parametric distribution to associate them with. • GMMs are a superposition of Gaussians with mixing coefficients that control the “contribution” of each distribution. x p(x) Gaussian Mixture Distribution p(x) = K k=1 πk N(x|µk , Σk ) where N(·|µk , Σk ) is a multivariate Gaussian density function with mean µk and covariance Σk, and πk are the mixing coefficients. {πk } form a probability mass function. 61 / 69
  48. Latent Variables & EM Consider a K-dimensional binary vector, z

    ∈ {0, 1}K , which a particular element zk is equal to 1 and all other elements equal to 0. There are K possible outcomes for this binary vector. Let πk = p(zk = 1) (i.e., the mixing coefficient), and recall that πk is a pmf (i.e., πk ∈ [0, 1] and πk = 1). The probability of z is, p(z) = p(z1 , . . . , zk ) = K k=1 πzk k Similarly, the conditional distribution of x given a particular value for z is a Gaussian – p(x|zk = 1) = N(x|µk , Σk ). Then, p(x|z) = p(x|z1 , . . . , zk ) = K k=1 N(x|µk , Σk )zk Marginalizing p(x) yeilds p(x) = z∈Z p(x|z)p(z) = K k=1 πk N(x|µk , Σk ) Well isn’t that interesting! The marginal on x can be determined using a set of latent variables. Generally it is easier to work with p(x, z) over p(x). 62 / 69
  49. Latent Variables & EM Another quantity that will play an

    important role is the conditional probability of z given x. We shall use γ(zk ) to denote p(zk = 1|x), whose value can be found using Bayes’ theorem, γ(zk ) ≡ p(zk = 1|x) = p(zk = 1)p(x|zk = 1) p(x) = πk N(x|µk , Σk ) K k=1 πk N(x|µj , Σj ) • γ(zk ) is a measurement of the responsibility that mixture k claims for the observation x. Why did we just introduce this confusing notation? The latent variables will make our lives easier; however, at the moment life does not seem any easier. Next, let maximize the log-likelihood function for a data set. log p(D|{πk }, {µk }, {Σk }) = N n=1 log K k=1 πk N(xn |µk , Σk ) 63 / 69
  50. Latent Variables & EM Maximize w.r.t. µk Computing the derivative

    with respect to µk and setting the derivative equal to zero yields, 0 = − N n=1 πk N(xn |µk , Σk ) K j=1 πj N(xn |µj , Σj ) γ(znk) Σk (xn − µk ) Assumption Σ−1 k is nonsingular! (Work this out own your own, or ask me after class for a full derivation) Solving for µk µk = 1 Nk N n=1 γ(znk )xn , Nk = N n=1 γ(znk ) 64 / 69
  51. Latent Variables & EM We still need to maximize the

    log-likelihood function w.r.t. Σk and πk. Maximizing w.r.t. Σk is straight forward; however, we must be careful with the maximization w.r.t. πk because the solution set is constrained. We must use Lagrange multipliers to find the solution. After all this we find that, Σk = 1 Nk N n=1 γ(znk ) (xn − µk ) (xn − µk )T and πk = Nk N 65 / 69
  52. EM for Gaussian Mixtures • Initialize the means µk ,

    covariances Σk and mixing coefficients πk and evaluate the initial log likelihood function. E Step Evaluate the responsibilities using the current parameter values γ(znk ) = πk N(xn |µk , Σk ) K j=1 πj N(xn |µj , Σj ) M Step Re-estimate the parameters using the current responsibilities µnew k = 1 Nk N n=1 γ(znk )xn , Σnew k = 1 Nk N n=1 γ(znk ) (xn − µnew k ) (xn − µnew k )T and πnew k = Nk /N where Nk = N n=1 γ(znk ). Repeat until convergence in the log likelihood. 66 / 69
  53. EM applied to the Old Faithful data set (a) −2

    0 2 −2 0 2 (b) −2 0 2 −2 0 2 (c) L = 1 −2 0 2 −2 0 2 (d) L = 2 −2 0 2 −2 0 2 (e) L = 5 −2 0 2 −2 0 2 (f) L = 20 −2 0 2 −2 0 2 67 / 69