Gregory Ditzler
July 12, 2013
79

# Unsupervised Learning Lecture Notes

Slide from a pattern recognition lecture I gave in the fall of 2012.

July 12, 2013

## Transcript

1. ### Density Estimation, K-means and Expectation Maximization Gregory Ditzler Dept. of

Electrical & Computer Engineering Drexel University, Philadelphia, PA 19104 gregory.ditzler@gmail.com March 14, 2013 Gregory Ditzler Pattern Recognition & Machine Learning
2. ### Topics for Today Reading Assignments: Duda et al.’s Pattern Classiﬁcation,

Chapters 3 & 4. Many of the topics discussed today fall under the umbrella of unsupervised learning. That is learning from unlabeled data. Topics Parametric: maximum likelihood estimator, Bayesian estimation Non–parametric: kernel density estimation, mixture models, generalized view of EM Clustering: k–means, expectation maximization About the ﬁgures: many of the ﬁgures were collected from the Pattern Classiﬁcation & PRML text, along with some ﬁgures generated by custom Matlab and Python scripts. Also, much of the content of the lecture is derived from previous years lectures of this course. Gregory Ditzler Pattern Recognition & Machine Learning
3. ### What are we working on? Many approaches make assumptions on

the probability distribution over the data (training and/or testing). Unfortunately, we rarely have access to the probability distribution. What did you assume about p(x|ω) for the naïve Bayes classiﬁer? Why didn’t you assume a multinomial model or some other model? On top of making assumptions on the probability distribution, what about the estimation of the parameters? Are there suﬃcient data to estimate the parameters? The curse of dimensionality is still a problem! x1 D = 1 x1 x2 D = 2 x1 x2 x3 D = 3 What if we cannot assume the form of a probability distribution? Gregory Ditzler Pattern Recognition & Machine Learning
4. ### Parameter Estimation What are parameters? Parameters are terms involved in

the properties and computation of functions such as p(x) that govern the behavior of a probability distribution. For example, if X is distributed as a Gaussian random variable, then the probability distribution for X has parameters µ and σ2. The distribution is of the form, p(X = x) = 1 √ 2πσ e− (x−µ)2 2σ2 The Bayes classiﬁer required that you know P(ω), p(x|ω), and p(x) p(x) can be computed using the total probability theorem What if the distribution on ω and/or the x’s conditional distribution on ω is unknown? We can assume the form of the distribution, but how do we know if it is correct or approximately correct? One option is to use the Kolmorogrov-Smirnov test to test against a distribution How much data are suﬃcient to estimate the parameters? If we can (cannot) assume the form of the distribution, we use parametric (nonparametric) techniques Gregory Ditzler Pattern Recognition & Machine Learning
5. ### Maximum-Likelihood Estimation Formulating the MLE Assume we have c data

sets Dj for j = {1, . . . , c} (i.e., one for each class – p(x|ωj )). Dj is sampled from a probability distribution with parameters θj (e.g., for a Gaussian distribution θj = {µj , Σj }). Suppose a data set D has n samples drawn iid. The likelihood of the data for a set of parameters is, p(D|θ) = n k=1 p(xk |θ) ⇒ log p(D|θ) = n k=1 log p(xk |θ) p(D|θ) is the likelihood of D given θ The maximum-likelihood estimate of θ is given by the ˆ θ that satisﬁes, ˆ θ = arg max θ∈Θ {p(D|θ)} = arg max θ∈Θ {log p(D|θ)} = arg max θ∈Θ {l(θ)} ˆ θ best supports the observed data Gregory Ditzler Pattern Recognition & Machine Learning
6. ### Maximum-Likelihood Estimation Maximizing the log-likelihood, l(θ), provides equivalent results with

less work. Finding the MLE, ˆ θ, is found by setting the gradient of l(θ) equal to zero. That is, ∇θ l(θ) = n k=1 ∇θ log p(xk |θ) Solving for ˆ θ A solution to ˆ θ is found by setting the gradient equal to zero ∇θ l(θ) = 0 The solution to ˆ θ could be a global max, local min or max or (rarely) an inﬂection point, so you must check. ˆ θ is only an estimate! Gregory Ditzler Pattern Recognition & Machine Learning

Learning
8. ### Gaussian Examples 1 Consider a Gaussian distribution with an unknown

θ = µ parameter and known Σ. log p(xk |θ) = − 1 2 log (2π)d|Σ| − 1 2 (xk − θ)T Σ−1 (xk − θ) Then ˆ µ = ˆ θ = 1 n n k=1 xn (Refer to lecture notes for a proof) Gregory Ditzler Pattern Recognition & Machine Learning
9. ### Gaussian Examples 2 Consider a Gaussian distribution with an unknown

mean and covariance, θ = {µ, σ2} = {θ1 , θ2 }. For now we consider only a 1D variable. log p(xk |θ) = − 1 2 log {2πθ2 } − 1 2θ2 (xk − θ1 )2 Then θ1 = 1 n n k=1 xk , θ2 = 1 n n k=1 (xk − ¯ µ)2 (Refer to lecture notes for a proof) Gregory Ditzler Pattern Recognition & Machine Learning
10. ### Bias & Variance of the Estimator How good and reliable

are these estimates? We assess the goodness and reliability through Bias: How close is the estimate to the true value? E[ˆ θ] = θ? Variance: How much would this estimate change, had we tried this again with a diﬀerent dataset also drawn from the same distribution? Gregory Ditzler Pattern Recognition & Machine Learning
11. ### Bias & Variance of the Estimator The covariance estimate is

biased; however we can make it unbiased by using σ2 ∗ = n n−1 ¯ σ2. The sample mean is an unbiased estimator for µ. As n → ∞, the estimator’s bias is negligible; however, it will always be biased. One of the very important properties of the maximum likelihood estimate is that it is invariant to non-linear transformations Other estimators exist for parameter estimation such as the minimum variance unbiased estimator (MVUE) Gregory Ditzler Pattern Recognition & Machine Learning
12. ### Bayesian Estimation The parameters θ for the MLE are considered

to be ﬁxed, but unknown. Bayesian Estimation (BE) assumes θ are random variables and instead of ﬁnding ˆ θ we ﬁnd p(θ|D). BE generally provides us with more information; however, computing θBE may be more involved than θMLE . Bayesian estimation is not estimating θ, rather we are ﬁnding the distribution p(x|D) based on the observation of D For most practical applications, if the assumptions are correct, and there is suﬃcient data MLE gives good results MLE: Frequentist approach BE: Bayesian approach The MLE method found the estimator that maximized the log-likelihood function, p(D|θ). Why didn’t we use the a priori density p(θ|D)? Such methods that use p(θ|D) are referred to as Bayesian estimation techniques. Gregory Ditzler Pattern Recognition & Machine Learning
13. ### Bayesian Estimation Bayesian estimation and the assumptions 1 The form

of the density p(x|θ) is assumed to be known, but the value of the parameter vector θ is not known exactly. 2 Our initial knowledge about Îÿ is assumed to be contained in a known a priori density p(θ). 3 The rest of our knowledge about θ is contained in a set D of n samples x1 , . . . , xn drawn independently according to the unknown probability density p(x). According to Bayes theorem and independence of x we have, p(θ|D) = p(θ)p(D|θ) p(θ)p(D|θ)dθ , p(D|θ) = N k=1 p(xk |θ) Suppose that p(D|θ) reaches a sharp peak at θ = ˆ θ. If the prior density p(θ) is not zero at θ = ˆ θ and does not change much in the surrounding neighborhood, then p(θ|D) also peaks at that point. Gregory Ditzler Pattern Recognition & Machine Learning
14. ### Are you a Bayesian or a Frequentist? The Frequentist (MLE)

Mentality Choose the parameters that maximize the likelihood of the data being observed, θMLE = arg max θ∈Θ p(D|θ) The Frequentist (MLE) Mentality Choose the parameters that maximize the posterior probability of θ given the observed data θMAP = arg max θ∈Θ p(θ|D) Gregory Ditzler Pattern Recognition & Machine Learning
15. ### Nonparameteric Methods MLE and BE assumed that the data are

sampled from a probability distribution with a known form each probability distribution has a set of parameters to be estimated from the data probability distributions: Gaussian, Binomial, geometric, exponential, Gibbs, Poisson, Rademacher, χ2, noncentral χ2, Laplace, etc. Problems with parametric techniques Assume forms of distributions functions are known Most known forms of distributions are unimodal In most cases individual features are assumed to be independent Nonparametric methods Nonparametric methods for density estimation do not assume the data ﬁt a parametric form (i.e., we can deal with arbitrary probability distributions) Gregory Ditzler Pattern Recognition & Machine Learning
16. ### A Simple Density Estimator – the histogram! Partition the data

into distinct bins with width ∆i. Count number of elements in the bins and the probability of observing a sample x in the ith bin is, pi = ni N∆i ∆ = 0.04 0 0.5 1 0 5 ∆ = 0.08 0 0.5 1 0 5 ∆ = 0.25 0 0.5 1 0 5 Gregory Ditzler Pattern Recognition & Machine Learning
17. ### Density Estimation Motivating Density Estimation The basic ideas behind many

density estimation methods are very simple. The most fundamental techniques rely on the fact that the probability P that a vector x will fall in a region R is given by P = R p(x )dx P is a smoothed or averaged version of the density function p(x) we can estimate this smoothed value of p by estimating the probability P 6 4 2 0 2 4 6 6 4 2 0 2 4 6 0 50 100 0 50 100 By considering the data from each class separately we can compute the conditional probabilities, p(x|ω). If we employ the assumption that all features in x are independent we can compute the conditional probabilities for each feature separately rather than trying to estimate the joint probability. That is, p(x|ω) = d k=1 p(x(k)|ω) Gregory Ditzler Pattern Recognition & Machine Learning
18. ### Density Estimation Assume that we are provided n examples and

let k be the number of examples that fall in a region R. Then, R p(x )dx p(x) ∗ V where x is a point within R and V is the volume enclosed by R The random variable deﬁned by whether or not x falls in R follows a Binomial distribution. Then E[k] = nP. p(x) can be solved for yielding, p(x) = k/n V Questions about our density estimation methods Are there any problems with this formulation? If so, what are they? What happens as n → ∞? What happens if V → 0? Are these two results the same? Gregory Ditzler Pattern Recognition & Machine Learning
19. ### Density Estimation We can examine this problem of density estimation

in one of two ways. First, we could ﬁx V and let n grow to inﬁnity, or ﬁx n and let V approach zero. Lets examine what happens. If we ﬁx the volume V and take more and more training samples, the ratio k/n will converge (in probability), but we have only obtained an estimate of the space-averaged value of p(x). If we want to obtain p(x) rather than just an averaged version of it, we must be prepared to let V approach zero. Thus we must ﬁx n and let V approach zero. However, eventually p(x) 0 since V → 0. Thus, letting V → 0 is a useless result. Since V → 0 is not an option, we must live with a ﬁnite sample estimation of p(x) that will contain some variance in the estimation. How can we choose V? It should be large enough to contain plenty of samples in R It should be small enough to justify the assumption of p(x) be constant within the chosen V / R. Gregory Ditzler Pattern Recognition & Machine Learning
20. ### Density Estimation Let pn(x) = (kn/n)/Vn . If pn(x) is

to converge to p(x), 3 conditions are required: Case 1 This condition assures us that the space averaged P/V will converge to p(x), provided that the regions shrink uniformly. lim n→∞ Vn = 0 Case 2 This condition, which only makes sense if p(x) = 0, assures us that the frequency ratio will converge (in probability) to the probability P. lim n→∞ kn = ∞ Case 3 This condition is clearly necessary if pn(x) is to converge at all. lim n→∞ kn n = 0 There are two common ways of obtaining sequences of regions that satisfy these conditions 1 Shrink an initial region by specifying the volume Vn as some function of n (Parzen Window) 2 Specify kn as some function of n (Nearest Neighbor) Gregory Ditzler Pattern Recognition & Machine Learning
21. ### Window vs. Neighbor Methods Determining the density of the square

by shrinking Vn or growing kn Vn = 1 p n kn = p n n = 1 n = 4 n = 9 n = 100 V1 = 1 V4 = 1 2 V9 = 1 3 V100 = 1 10 k100 = 10 k9 = 3 k4 = 2 k1 = 1 Gregory Ditzler Pattern Recognition & Machine Learning
22. ### Parzen Windows What is a Parzen window? The Parzen-window approach

to estimating densities can be introduced by temporarily assuming that the region Rn is a d-dimensional hypercube. The number of samples falling into Rn is obtained by a windowing function, hence the name Parzen windows. Rn has volume hd n and φ(·) is a kernel function that counts the number of samples that fall in Rn The graphic below shows kn = 3 (u) = ⇢ 1 |uj |  1 2 for j 2 [d] 0 otherwise (u) 1 2 1 2 1 2 (x) h 2 h 2 h 2 ✓ x xi h ◆ = 1 | {z } xi is in the hypercube kn = n X i=1 ✓ x xi hn ◆ , Gregory Ditzler Pattern Recognition & Machine Learning
23. ### Parzen Windows Recall that, p(x) = k/n V , kn

= n i=1 φ x − xi hn Combining these results, we have ˆ p(x) = 1 nV n i=1 φ x − xi hn n: cardinality of the data hn : width of the window function x: point where we are to compute p(x) V: volume of the hypercube φ(·): kernel function indicating if an instance is in the hypercube ˆ p(x): estimation of p(x) at x using Parzen window density estimators Gregory Ditzler Pattern Recognition & Machine Learning
24. ### Changing the Kernel of the Density Estimator Current form of

the density estimator ˆ p(x) = 1 nhd n n i=1 φ x − xi hn Consider φ(·) as a general smoothing function, such that φ(x) ≥ 0 (nonnegativity), and φ(x)dx = 1 (normalization). The general expression for ˆ p(x) remains unchanged with these constraints. ˆ p(x) can be viewed as a superposition of the φ(·)’s, which is a measurement of how far x is from xi , around the point x which we are trying to estimate. xi ’s are the training data in D and ˆ p(x) is an interpolation of the contribution of each sample. The kernel determines the value of xi ’s “relatedness” to x. Given the constraints above φ is a distribution function, ˆ p(x) converges in probability with n → ∞. What is a popular choice for φ(·)? φ(x) = 1 √ 2πσ e− (x−µ)2 2σ2 Gregory Ditzler Pattern Recognition & Machine Learning
25. ### Parzen Example: growing n −10 −8 −6 −4 −2 0

2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 −10 −8 −6 −4 −2 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Gregory Ditzler Pattern Recognition & Machine Learning
26. ### Selecting the spread parameter (h) The Radial Basis Function Kernel

(Gaussian Kernel) φ(x) = 1 √ 2πh e− (x−xi)2 2h2 The spread, or kernel bandwidth, parameter has a large eﬀect on the accuracy of the Parzen window density estimator. Consider the two extreme cases: h is too small: If h is too small, the function φ becomes extremely focused at the point xi . In such situations the function ˆ p(x) may change erratically. In some cases, similar inputs do not give similar outputs. A small h is susceptible to noise in the data. h is too large: If h is too large the we lose the ﬁner details of ˆ p(x). For example, if ˆ p(·) is multi-modal then it is quite possible that one of the modes in the density function is “lost” in the estimation. Gregory Ditzler Pattern Recognition & Machine Learning
27. ### The Gaussian Kernel (Duda, Hart, Stork) Gregory Ditzler Pattern Recognition

& Machine Learning
28. ### Parzen Example: varying σ −10 −5 0 5 10 0

0.2 0.4 0.6 0.8 σ = 0.01 −10 −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 σ = 0.1 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 σ = 0.5 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 σ = 1 −10 −5 0 5 10 0 0.05 0.1 0.15 0.2 0.25 σ = 2 −10 −5 0 5 10 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 σ = 10 Gregory Ditzler Pattern Recognition & Machine Learning
29. ### True distribution vs. a Parzen windows estimation (Bishop, PRML) h

= 0.005 0 0.5 1 0 5 h = 0.07 0 0.5 1 0 5 h = 0.2 0 0.5 1 0 5 Gregory Ditzler Pattern Recognition & Machine Learning
30. ### Parzen Example (Duda, Hart, Stork) Note that regardless of the

h parameter, as n approaches ∞ the results are the same. Gregory Ditzler Pattern Recognition & Machine Learning
31. ### How do choose h? Dr. P’s suggestions from 2009 Trial

and error: Try a number of diﬀerent h values and decide on the one that appears to give the best looking density → Obviously very subjective and impractical in large dimensional problems. For a known Gaussian distribution, the hopt can be obtained as the maximum likelihood estimate of h that minimizes the error between the actual density and the estimated density. This gives hopt = 1.06σ(n)− 1 5 where σ is the sample variance and n is the size of the sample. If the distribution is in fact non-Gaussian but still unimodal, this estimate may be good enough (Silverman ’86): hopt = 4 d + 2 1 d+4 σn− 1 d+4 For multi-modal densities in one dimension, the above estimate is empirically updated to give a decent estimate: hopt = 0.9n− 1 5 · min σ, IRQ 1.34 where IQR is the interquartile range (fourth spread, diﬀerence between 25th and 75th percentiles of the data), a robust estimate of the standard deviation. Gregory Ditzler Pattern Recognition & Machine Learning
32. ### Product Kernel One simpliﬁcation in estimating multi-dimensional densities is by

multiplying one- dimensional kernels, called product kernels. ˆ pp(x) = 1 n n i=1 φ(x, xi, h1 , . . . , hd ) where φ(x, xi, h1 , . . . , hd ) = 1 h1 h2 · · · hd d j=1 φj x(j) − xi(j) hj Kernel Independence Kernel independence is assumed above – which does NOT imply feature independence, which would give ˆ pp(x) = d j=1 1 nhj n i=1 φj x(j) − xi(j) hj Gregory Ditzler Pattern Recognition & Machine Learning
33. ### Classiﬁcation & Parzen Windows The Bayes Classiﬁer The Bayes decision

rule is implemented as follows, ω∗ = arg max ω∈Ω p(x|ω)P(ω) p(x) = arg max ω∈Ω p(x|ω)P(ω) Use the data in D to estimate p(x|ω) for ω ∈ Ω using the Parzen window density estimator. Deﬁne a Bernoulli random variable zi,c that takes value 1 if xi belongs to ωc and 0 if it does not. The MLE for the prior probabilities (yes, you can use the MLE on probabilities – see DHS Ch. 2, exercise 3) is given by, P(ωc) = 1 n n i=1 zi,c . Pros & Cons Advantages: Does not assume prior knowledge on p(x|ω) Disadvantages: Need (lots of)n data to make sure that the estimate converges to the true distribution More data are required to accurately estimate p(x|ω) as the dimensionality increases ⇒ Curse of Dimensionality! Overﬁtting is an issue if the kernel bandwidth parameter is made too small! Training error may be low, but generalization error is poor! Gregory Ditzler Pattern Recognition & Machine Learning
34. ### Parzen Classiﬁcation (Duda, Hart, Stork) A small h leads to

boundaries that are more complicated than for large h on same data set Gregory Ditzler Pattern Recognition & Machine Learning
35. ### Probabilistic Neural Network (PNN) PNN Training Algorithm begin initialize: j

= 0, n = #patterns do j ← j + 1 normalize: xjk ← xjk / d i=1 x2 ji 1/2 train: wjk = xjk , k ∈ [d] if x ∈ ωi then aji ← 1 until j = n end PNN Tesing Algorithm begin initialize: j = 0, x = test pattern do j ← j + 1 zj ← wT j x if aj,i = 1 then gi ← gi + e (zj−1) σ2 until j = n end . . . !1 !2 !c . . . x1 . . . x2 x3 xd Sparsely Connected Connected Fully wjk aji PNN Decision Rule ω∗ = arg max i∈[c] gi(x) Gregory Ditzler Pattern Recognition & Machine Learning
36. ### Probabilistic Neural Network (PNN) The learning algorithm for the PNN

is very fast since the weight vectors are determined by wj = xj . Only a single pass at each example in D is required for learning. This means that the PNN can be used as an online classiﬁer. The space complexity (i.e., how many connections between nodes) is O((n + 1)d). Clearly the storage complexity can be very large. Again, we need to ﬁnd a reasonable selection of σ (i.e., kernel bandwidth) to avoid over ﬁtting or under ﬁtting the data. The kernel used in the PNN is the radial basis function (RBF). Matlab has a PNN (refer to NEWPNN.m) Gregory Ditzler Pattern Recognition & Machine Learning
37. ### k-Nearest Neighbor Estimation What is the “best” windowing function to

use and what should the parameters of the function be set to (e.g., σ, h) A remedy to this problem is to let the volume be a function of the training data. That is ﬁx the value of the k-nearest neighbors and determine the minimum volume that encloses the k samples. If the density near x is high them volume will be small (small value for the kernel), and if the density near x is low, the volume will be large (large value for the kernel). Thus, k-NN is “automatically” determining the window size for the training data. k-Nearest Neighbor Estimation Algorithm 1 Select an initial volume around x to estimate p(x). 2 Grow the window until k samples fall in the region R. The samples in R are the nearest neighbors of x. 3 Estimate the density based on: ˆ p(x) = k/n V Gregory Ditzler Pattern Recognition & Machine Learning
38. ### k-NN Example (Duda, Hart, Stork) Gregory Ditzler Pattern Recognition &

Machine Learning
39. ### How to choose k? (Duda, Hart, Stork) As k ∝

√ n the diﬀerence between the estimate and true distribution becomes arbitrarily small. Typically, for classiﬁcation problems, we adjust k (or h for Parzen windows), until the classiﬁer has the lowest error on a validation dataset Gregory Ditzler Pattern Recognition & Machine Learning
40. ### Estimation of Posterior Probabilities The Bayes Classiﬁer ω∗ = arg

max ωi∈Ω p(x, ωi) c j=1 p(x, ωj ) The joint probability, p(x, ωi), is estimated by placing a volume, V, around point x until k samples fall in the region. Of the k samples ki of them belong to class ωi . The obvious estimate for the joint probability is, pn(x, ωi) = ki/n V thus estimating the posterior becomes, P(ωi|x) = p(x, ωi) c j=1 p(x, ωj ) = ki k On your own: Derive the MLE for P(ωi) and the k-nearest neighbors estimate of p(x) & p(x|ωi) to show that P(ωi|x) = ki/k. Gregory Ditzler Pattern Recognition & Machine Learning
41. ### Selecting nearest neighbors (Bishop, PRML) 3-NN 1-NN x1 x2 (a)

x1 x2 (b) Gregory Ditzler Pattern Recognition & Machine Learning
42. ### Selecting nearest neighbors (Bishop, PRML) x6 x7 K = 1

0 1 2 0 1 2 x6 x7 K = 3 0 1 2 0 1 2 x6 x7 K = 31 0 1 2 0 1 2 Gregory Ditzler Pattern Recognition & Machine Learning
43. ### Fisher’s Iris Data 3-NN Example 3-NN implemented in the Python

Scikit Learn Machine Learning package. The data set is Fisher’s Iris measurements. (left) distance weight nearest neighbor algorithm, (right) standard k-NN algorithm Gregory Ditzler Pattern Recognition & Machine Learning
44. ### Nearest Neighbor Regression Nearest neighbor regression implemented in the Python

Scikit Learn Machine Learning package. Gregory Ditzler Pattern Recognition & Machine Learning
45. ### k-NN Summary The Lazy Learner The k-NN is a known

as a lazy learning because there is no “learning” implemented after the training data set. There are no processing steps required before making a prediction other than receiving the training data. Since there is no learning implemented in the k-NN and it is based solely on the training data, k-NN is also called memory based learning, or instance based learning (see Weka’s implementation). Computational cost arises from the testing phase and algorithms require larger memory requirements. Gregory Ditzler Pattern Recognition & Machine Learning
46. ### K-means Clustering A Motivation for Unsupervised Learning Throughout most of

the semester nearly all the algorithms studied assumed that data are labeled. However, supposed we only have access to D = {x1 , . . . , xN } and we need to group them into K sets, or clusters. Intuitively, we might think of a cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside of the cluster. Clusters, k = 1, . . . , K are associated with a prototype vector, µk . Our goal is to ﬁnd the assignment of data samples to clusters such that sum of the squares of the distances of each data point to its closest vector µk , is a minimum. Distortion Minimization min J = min N n=1 K k=1 rnk xn − µk 2 rnk ∈ {0, 1} are binary variables corresponding to whether or not sample n is in cluster k. Another word to describe this variable is a latent variable. Goal: ﬁnd rnk and µk so as to minimize the distortion. Gregory Ditzler Pattern Recognition & Machine Learning
47. ### A Two-Step Iterative Algorithm for Finding rnk and µk General

Algorithm for Finding rnk and µk Initialize µk (e.g., select K instances at random) 1 Minimize J with respect to the rnk , keeping the µk ﬁxed 2 Minimize J with respect to the µk , keeping rnk ﬁxed (repeat until convergence) These two steps updating rnk and updating µk correspond respectively to the E (expectation) and M (maximization) steps of the EM algorithm. Gregory Ditzler Pattern Recognition & Machine Learning
48. ### K-means derivation Step #1 J is a linear function of

rnk , this optimization can be performed easily to give a closed form solution. The terms involving diﬀerent n are independent and so we can optimize for each n separately by choosing rnk to be 1 for whichever value of k gives the minimum value of xn − µj 2. That is, rnk = 1 if arg minj xn − µj 2 0 otherwise. Step #2 Now optimize J w.r.t. µk with the rnk held ﬁxed. The objective function J is a quadratic function of µk , and it can be minimized by setting its derivative with respect to µk to zero giving, 2 N n=1 rnk (xn − µk ) = 0 ⇒ µk = N n=1 rnk xn N n=1 rnk Gregory Ditzler Pattern Recognition & Machine Learning
49. ### K-means applied to the Old Faithful data set (Bishop, PRML)

(a) −2 0 2 −2 0 2 (b) −2 0 2 −2 0 2 (c) −2 0 2 −2 0 2 (d) −2 0 2 −2 0 2 (e) −2 0 2 −2 0 2 (f) −2 0 2 −2 0 2 (g) −2 0 2 −2 0 2 (h) −2 0 2 −2 0 2 (i) −2 0 2 −2 0 2 Gregory Ditzler Pattern Recognition & Machine Learning
50. ### K-means applied to the Old Faithful data set (Bishop, PRML)

J 1 2 3 4 0 500 1000 Gregory Ditzler Pattern Recognition & Machine Learning
51. ### K-means image segmentation (Bishop, PRML) K = 2 K =

3 K = 10 Original image Gregory Ditzler Pattern Recognition & Machine Learning
52. ### Mixtures of Gaussians Many times probability distributions contain several modes

and we cannot precisely identify a parametric distribution to associate them with. This was a major motivating factor for the nonparametric methods, because they do not make assumptions about the data. Gaussian mixture models (GMMs) are a superposition of Gaussians with mixing coeﬃcients that control the “contribution” of each distribution. x p(x) Gaussian Mixture Distribution p(x) = K k=1 πk N(x|µk , Σk ) where N(·|µk , Σk ) is a multivariate Gaussian density function with mean µk and covariance Σk , and πk are the mixing coeﬃcients. {πk } form a probability mass function. Gregory Ditzler Pattern Recognition & Machine Learning
53. ### Latent Variables & EM Consider a K-dimensional binary vector, z

∈ {0, 1}K, which a particular element zk is equal to 1 and all other elements equal to 0. There are K possible outcomes for this binary vector. Let πk = p(zk = 1) (i.e., the mixing coeﬃcient), and recall that πk is a pmf (i.e., πk ∈ [0, 1] and πk = 1). The probability of z is, p(z) = p(z1 , . . . , zk ) = K k=1 πzk k Similarly, the conditional distribution of x given a particular value for z is a Gaussian – p(x|zk = 1) = N(x|µk , Σk ). Then, p(x|z) = p(x|z1 , . . . , zk ) = K k=1 N(x|µk , Σk )zk Marginalizing p(x) yeilds p(x) = z∈Z p(x|z) = K k=1 πk N(x|µk , Σk ) Well isn’t that interesting! The marginal on x can be determined using a set of latent variables. Generally it is easier to work with p(x, z) over p(x). Gregory Ditzler Pattern Recognition & Machine Learning
54. ### Latent Variables & EM Another quantity that will play an

important role is the conditional probability of z given x. We shall use γ(zk ) to denote p(zk = 1|x), whose value can be found using Bayes’ theorem, γ(zk ) ≡ p(zk = 1|x) = p(zk = 1)p(x|zk = 1) p(x) = πk N(x|µk , Σk ) K k=1 πk N(x|µj , Σj ) γ(zk ) is a measurement of the responsibility that mixture k claims for the observation x. Why did we just introduce this confusing notation? The latent variables will make our lives easier; however, at the moment life does not seem any easier. Next, let maximize the log-likelihood function for a data set. log p(D|{πk }, {µk }, {Σk }) = N n=1 log K k=1 πk N(xn|µk , Σk ) Gregory Ditzler Pattern Recognition & Machine Learning
55. ### Latent Variables & EM Maximize w.r.t. µk Computing the derivative

with respect to µk and setting the derivative equal to zero yields, 0 = − N n=1 πk N(xn|µk , Σk ) K j=1 πj N(xn|µj , Σj ) γ(znk) Σk (xn − µk ) Assumption Σ−1 k is nonsingular! (Work this out own your own, or ask me after class for a full derivation) Solving for µk µk = 1 Nk N n=1 γ(znk )xn, Nk = N n=1 γ(znk ) Gregory Ditzler Pattern Recognition & Machine Learning
56. ### Latent Variables & EM We still need to maximize the

log-likelihood function w.r.t. Σk and πk . Maximizing w.r.t. Σk is straight forward; however, we must be careful with the maximization w.r.t. πk because the solution set is constrained. We must use Lagrange multipliers to ﬁnd the solution. After all this we ﬁnd that, Σk = 1 Nk N n=1 γ(znk ) (xn − µk ) (xn − µk )T and πk = Nk N Gregory Ditzler Pattern Recognition & Machine Learning
57. ### EM for Gaussian Mixtures Initialize the means µk , covariances

Σk and mixing coeﬃcients πk and evaluate the initial log likelihood function. E Step Evaluate the responsibilities using the current parameter values γ(znk ) = πk N(xn|µk , Σk ) K j=1 πj N(xn|µj , Σj ) M Step Re-estimate the parameters using the current responsibilities µnew k = 1 Nk N n=1 γ(znk )xn, Σnew k = 1 Nk N n=1 γ(znk ) (xn − µnew k ) (xn − µnew k )T and πnew k = Nk /N where Nk = N n=1 γ(znk ). Repeat until convergence in the log likelihood. Gregory Ditzler Pattern Recognition & Machine Learning
58. ### EM applied to the Old Faithful data set (Bishop, PRML)

(a) −2 0 2 −2 0 2 (b) −2 0 2 −2 0 2 (c) L = 1 −2 0 2 −2 0 2 (d) L = 2 −2 0 2 −2 0 2 (e) L = 5 −2 0 2 −2 0 2 (f) L = 20 −2 0 2 −2 0 2 Gregory Ditzler Pattern Recognition & Machine Learning
59. ### GMMs in nonstationary environments: Weight Estimation Algorithm (WEA) (Simpliﬁed pseudo-code

from Ditzler & Polikar, IJCNN (2011)) Input training data set Dt , and unlabeled (ﬁeld) data Bt . Deﬁne Kc as the number of components of for the GMM of the cth class. for t = 1, 2, . . . do 1 Generate classiﬁer ht on Dt and GMMs Mt c for each class using Dt 2 Generate a GMM, Nt, with K = Kc centers using Bt 3 Compute Bhattacharyya distance between the components in Ntand the components in Mt c . Assign each component in Nt with the label of the closest component in Mt c . Refer to this mixture as Nt c 4 Estimate the loss of hk (k ∈ [t]) on data from Nt c , call this loss ˆt k . 5 Compute voting weights, Wt k ∝ log 1 − ˆt k ˆt k 6 Classify ﬁeld data using weight majority vote end for Gregory Ditzler Pattern Recognition & Machine Learning
60. ### WEA tested on a synthetic Gaussian data set G. Ditzler

and R. Polikar, “Semi-supervised learning in nonstationary environments,” in International Joint Conference on Neural Networks, 2011, San Jose, CA, pp. 2471–2478. Gregory Ditzler Pattern Recognition & Machine Learning
61. ### Software for implementing some of the methods in todays lecture

Matlab’s Statistics Toolbox http://www.mathworks.com/products/statistics/ Netlab Neural Network Software (Matlab) http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/ Shogun: Large Scale Machine Learning (Python, Java, Matlab, C#,...) http://www.shogun-toolbox.org/ Weka 3 (Java) http://www.cs.waikato.ac.nz/ml/weka/ Scikit Learn (Python) – my favorite machine learning suite! http://scikit-learn.org/stable/ Gregory Ditzler Pattern Recognition & Machine Learning
62. ### Statistical Comparisons of Multiple Classiﬁers Refer to your lecture notes

on Demšar’s Journal of Machine Learning Research article on hypothesis testing with multiple classiﬁer over multiple data sets. http://jmlr.csail.mit.edu/papers/volume7/demsar06a/demsar06a.pdf Gregory Ditzler Pattern Recognition & Machine Learning