distribution p(x) of a random var x, given a finite set of observations. • Parametric and non-parametric models • Frequentist and Bayesian treatments • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 2/84
(unknown) • y: value of a sensor (known) %FQUI4FOTPS 5BSHFU p(x) 1SJPS p(y|x) 0CTFSWBUJPO p(x|y) 1PTUFSJPS &TUJNBUF • Goal: Estimate the location using observations • All computations are analytically tractable for Gaussians. 3/84
§2.3.1 Conditional Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 4/84
convenient in many situations Λ ≡ Σ−1 (2.68) Λ = ( Λaa Λab Λba Λbb ) (2.69) • Because the inverse of a symmetric matrix is also symmetric (proof: ex. 2.22), Λ⊤ aa = Λaa , Λ⊤ bb = Λbb (i.e. symmetry) Λ⊤ ba = Λab • NOTE: Generally, for instance, Λaa ̸= Σ−1 aa 9/84
(A = A⊤). The inverse matrix A−1 satisfies AA−1 = I. By taking the transpose of both sides of this equation, we obtain (A−1)⊤A⊤ = I. From the definition of inverse matrix, we obtain (A−1)⊤ = A−1. Therefore, A−1 is also symmetric matrix. 10/84
Recall: p(xa |xb ) = N(xa |µa|b , Σa|b ) Σa|b = Λ−1 aa (2.73) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.75) • The results (2.73) and (2.75) are expressed in terms of the partitioned precision matrix. • We can also express these results in terms of the corresponding partitioned covariance matrix. ( Σaa Σab Σba Σbb )−1 = ( Λaa Λab Λba Λbb ) (2.78) 15/84
target (unknown) • y: value of a sensor (known) %FQUI4FOTPS 5BSHFU p(x) 1SJPS p(y|x) 0CTFSWBUJPO p(x|y) 1PTUFSJPS &TUJNBUF • Goal: Estimate the location using observations • All computations are analytically tractable for Gaussians. 26/84
we shall suppose that we are given: p(x) = N(x|µ, Λ−1) (2.99) p(y|x) = N(y|Ax + b, L−1) (2.100) z = ( x y ) (2.101) • Variables: x ∈ RM and y ∈ RD • Params governing the means: µ ∈ RM , A ∈ RM×D and b ∈ RD • Precision matrices: Λ ∈ RM×M and L ∈ RD×D • We wish find an expression for • The joint distribution p(x, y) = p(z) • The marginal distribution p(y) • The conditional distribution p(x|y) 27/84
Recall: p(x) = N(x|µ, Λ−1) (2.99) p(y|x) = N(y|Ax + b, L−1) (2.100) • We can interpret • p(x) as a prior over x • p(y|x) as a likelihood of y • If y is observed, then p(x|y) represents the corresponding posterior over x given by Bayes’ theorem: p(x|y) = p(y|x) p(y) p(x) 28/84
probability, p(z) = p(x, y) Unknown = p(y|x) Known × p(x) Known Take the logarithm of the joint distribution, ln p(z) = ln p(x) + ln p(y|x) = − 1 2 (x − µ)⊤Λ(x − µ) − 1 2 (y − Ax − b)⊤L(y − Ax − b) + const. (2.102) Where, (2.102) is a quadratic function of x and y. ⇒ p(z) is Gaussian distribuiton. ⇒ Completing the square! 29/84
conditional distribution, p(x|y) Unknown = p(z) Known p(y) • Recall the conditional distribution of a partitioned Gaussian: p(xa |xb ) = N(xa |µa|b , Λ−1 aa ) (2.96) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.97) • The mean and precision of p(z): E [z] = ( µ Aµ + b ) (2.108) R = ( Λ + A⊤LA −A⊤L −LA L ) (2.104) (2.112) 34/84
§2.3.1 Conditional Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 37/84
dist of a random var using parametric dist, given observations, • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 38/84
regard to µ and Σ−1 (i.e. precision Λ) p(X|µ, Σ) = N ∏ n=1 1 (2π)D/2 1 |Σ|1/2 exp ( − 1 2 (xn − µ)⊤Σ−1(xn − µ) ) By taking logarithm, we obtain: ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 N ∑ n=1 (xn − µ)⊤Σ−1(xn − µ) (2.118) We see that the log-likelihood depends on the dataset only the quantities (the proof is next slide): N ∑ n=1 xn N ∑ n=1 xn x⊤ n (2.119) These are know as the sufficient statistics for the Gauss. 39/84
prev slide): ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ (2.118’) By setting the derivative of the log-likelihood with respect to µ and Σ−1 to zero, like ∂ ∂µ ln p(X|µ, Σ) = 0 ∂ ∂Σ−1 ln p(X|µ, Σ) = O Then, we obtain the solution for ML given by: µML = 1 N ⟨x⟩ = 1 N N ∑ n=1 xn (2.121) ΣML = 1 N ⟨ xx⊤ ⟩ − µµ⊤ = 1 N N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.122) 41/84
∂Σ−1 ln p(X|µ, Σ) to zero and solving it in terms of Σ, we have ΣML = 1 N ⟨ xx⊤ ⟩ − µMLµ⊤ ML Use µML = ⟨x⟩ /N = 1 N N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.122) 44/84
distribution, we obtain: E[µML] = µ (2.123) E[ΣML] = N − 1 N Σ (2.124) We can correct the bias of E[ΣML] by defining Σ = 1 N − 1 N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.125) ⇒ E[Σ] = Σ 45/84
(2.59): See the text. • Derivation of (2.291) E[xn x⊤ m ] = µµ⊤ + Inm Σ (2.291) (i) In the case n = m, from (2.62), we have E[xn x⊤ m ] = µµ⊤ + Σ (ii) In the case n ̸= m, E[xn x⊤ m ] = ∫∫ p(xn , xm )xn x⊤ m dxn dxm = (∫ p(xn )xn dxn ) (∫ p(xm )x⊤ m dxm ) (∵ i.i.d.) = E[xn ]E[xm ]⊤ = µµ⊤ • Derivation of (2.124): next slide 46/84
data points x1 , · · · , xN are simultaneously used for the parameter estimation. • Sequential methods allow data points to be processed one at a time and then discarded. • Sequential methods are important for • on-line apps • large N data sets are involved so that batch processing of all data points at once is infeasible 49/84
ML estimator of the mean based on N data points: µ(N) ML = 1 N N ∑ n=1 xn = 1 N xN + 1 N N−1 ∑ n=1 xn = 1 N xN + N − 1 N · 1 N − 1 N−1 ∑ n=1 xn µ(N−1) ML = µ(N−1) ML + 1 N ( xN − µ(N−1) ML ) (2.126) 50/84
ML estimator of the mean based on N data points: µ(N) ML = 1 N N ∑ n=1 xn = 1 N xN + 1 N N−1 ∑ n=1 xn = 1 N xN + N − 1 N · 1 N − 1 N−1 ∑ n=1 xn µ(N−1) ML = µ(N−1) ML old estimator + 1 N inversely prop to N ( xN − µ(N−1) ML ) contrib of N-th data (2.126) 50/84
random variables θ and z governed by p(z, θ). • The regression function is given by: f(θ) ≡ E[z|θ] = ∫ zp(z|θ) dz (2.127) θ z θ f(θ) Figure 2.10 • Now, we want to find the root θ∗ at which f(θ∗) = 0 without modeling f(θ). • A general procedure to solve such problems was given by Robbins and Monro (1951). 51/84
E[(z − f)2|θ] < ∞ (2.128) • f(θ) > 0 for θ > θ∗ and f(θ) < 0 for θ < θ∗ • A unique root actually exists. θ z θ f(θ) Figure 2.10 Procedure for estimating the root θ∗ is given by θ(N) = θ(N−1) − aN−1 z(θ(N−1)) (2.129) where z(θ(N−1)) is an observed value of z when θ takes the value θ(N). 52/84
θ∗: θ(N) = θ(N−1) − aN−1 z(θ(N−1)) (2.129) where, a sequence of positive numbers {aN } satisfies: lim N→∞ aN = 0, ∞ ∑ N=1 aN = ∞, and ∞ ∑ N=1 a2 N < ∞ (2.130-132) The conditions ensure that the sequence of estimates converge to the root with probability one (Robbins and Monro, 1951;Fukunaga, 1990). 53/84
By definition, the ML solution θML is a stationary point of the log-likelihood and hence satisfies: ∂ ∂θ { − 1 N N ∑ n=1 ln p(xn |θ) } θML = 0 (2.133) Exchanging the derivative and sum, and taking the limit as N → ∞, we have: − lim N→∞ 1 N N ∑ n=1 ∂ ∂θ ln p(xn |θ) = Ex [ − ∂ ∂θ ln p(x|θ) ] regression function (2.134) and so we see that finding the ML solution corresponds to finding the root of a regression function. We can therefore apply the Robbins-Monro procedure! 54/84
By applying the Robbins-Monro procedure, which now takes the form: θ(N) = θ(N−1) − aN−1 ∂ ∂θ(N−1) ln p(xN |θ(N−1)) (2.135) Example of estimating µ(N) ML of a Gaussian : z = − ∂ ∂µML ln p(x|µML, σ2) = − 1 σ2 (x − µML) (2.136) 55/84
dist of a random var using parametric dist, given observations, • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 57/84
binomial distribution: Bin(m|N, µ) = ( N m ) µm(1 − µ)N−m (2.9) In order to develop a Bayesian treatment to model the distribution of observations, we need to introduce a prior p(µ). p(µ|N, m) ∝ Bin(m|N, µ)p(µ) If we choose a prior to be cµβ(1 − µ)γ, the corresponding posterior p(µ|N, m) will have the same functional form as the prior (conjugacy). The beta distribution satisfies the conjugacy Beta(µ|a, b) = Γ(a + b) Γ(a)Γ(b) µa−1(1 − µ)b−1 (2.13) 58/84
is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given µ): p(X|µ) = N ∏ n=1 p(xn |µ) = 1 (2πσ2)N/2 exp { − 1 2σ2 N ∑ n=1 (xn − µ)2 } (2.137) • NOTE: p(X|µ) is not a prob dist over µ and not normalized. • By introducing a prior p(µ), the posterior given by p(µ|X) ∝ p(X|µ)p(µ) (2.139) • What prior p(µ) should we choose? 59/84
is known) • Recall: p(X|µ) = N ∏ n=1 p(xn |µ) = 1 (2πσ2)N/2 exp { − 1 2σ2 N ∑ n=1 (xn − µ)2 } The exp of a quadratic form in µ (2.137) • The likelihood takes the form of the exp of a quadratic form in µ. • Thus, if we choose a Gaussian as the prior, the posterior will also be Gaussian. • We therefore take our prior to be p(µ) = N(µ|µ0 , σ2 0 ). (2.138) 60/84
N ) (2.140) µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML µN intervene between µ0 and µML (2.141) 1 σ2 N = 1 σ2 0 + N σ2 Monotone function of N (2.142) N = 0 N = 1 N = 2 N = 10 −1 0 1 0 5 Figure 2.12 • If we have no observations (N = 0), the posterior mean µN = µ0 . • N → ∞ ⇒ µN → µML and σ2 N → 0. • σ2 0 → ∞ (i.e. no prior) ⇒ µN → µML and σ2 N → σ2/N. 64/84
the inference problem. p(µ|X) ∝ p(µ) N ∏ n=1 p(xn |µ) = [ p(µ) N−1 ∏ n=1 p(xn |µ) ] The posterior after observing N − 1 data p(xN |µ) The likelihood with N-th data (2.144) 65/84
the inference problem. p(µ|X) ∝ p(µ) N ∏ n=1 p(xn |µ) = [ p(µ) N−1 ∏ n=1 p(xn |µ) ] The posterior after observing N − 1 data p(xN |µ) The likelihood with N-th data (2.144) • The posterior after observing N − 1 data can be viewed as a prior to arriving a posterior after observing N-th data. • The sequential view is very general and applies to any problem in which the observed data are assumed to be i.i.d. 65/84
mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 exp { − λ 2 N ∑ n=1 (xn − µ)2 } (2.145) • The corresponding prior p(λ) should be proportional to (2.145). 67/84
mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 a power of λ exp { − λ 2 N ∑ n=1 (xn − µ)2 } the exp of a linear function of λ (2.145) • The corresponding prior p(λ) should be proportional to (2.145). 67/84
mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 a power of λ exp { − λ 2 N ∑ n=1 (xn − µ)2 } the exp of a linear function of λ (2.145) • The corresponding prior p(λ) should be proportional to (2.145). • We therefore take our prior to be the gamma distribution: Gam(λ|a, b) = 1 Γ(a) baλa−1 exp(−bλ) (2.146) 67/84
(2.146) E[λ] = a b (2.147) var[λ] = a b2 (2.148) λ a = 0.1 b = 0.1 0 1 2 0 1 2 λ a = 1 b = 1 0 1 2 0 1 2 λ a = 4 b = 6 0 1 2 0 1 2 • (2.146) is correctly normalized by Γ(a) (ex. 2.41). • If a > 0, the distribution has a finite integral. • If a ⩾ 1, the distribution itself is finite. • Derivation of (2.147) and (2.148): ex. (2.42) 68/84
mean µ is known) • Therefore, the posterior is p(λ|X) = Gam(λ|aN , bN ) aN = a0 + N 2 (2.150) bN = b0 + 1 2 N ∑ n=1 (xn − µ)2 = b0 + N 2 σ2 ML (2.151) • Interpretation of the posterior’s params • aN and bN can be interpret like an effective number of observations (see §2.2). • From (2.150) and (2.151), we see that introducing a prior Gam(λ|a0 , b0 ) corresponds to we have the 2a0 effective observations having variance b0 /a0 in the prior. 72/84
precision are unknown To find a conjugate prior p(µ, λ), we consider the dependence of the likelihood on µ and λ, p(X|µ, λ) = N ∏ n=1 ( λ 2π )1/2 exp { − λ 2 (xn − µ)2 } ∝ [ λ1/2 exp ( − λµ2 2 )]N exp { λµ N ∑ n=1 xn − λ 2 N ∑ n=1 x2 n } (2.152) The prior p(µ, λ) that has the same functional dependence on µ and λ as the likelihood and that should therefore take the form: p(µ, λ) ∝ [ λ1/2 exp ( − λµ2 2 )]β exp {cλµ − dλ} = exp { − βλ 2 (µ − c/β)2 } λβ/2 exp { − ( d − c2 2β ) λ } (2.153) 73/84
precision are unknown • Recall we can always write p(µ, λ) = p(µ|λ) p(λ) . p(µ, λ) ∝ exp { − βλ 2 (µ − c/β)2 } Gaussian λβ/2 exp { − ( d − c2 2β ) λ } gamma distribution (2.153) • By defining new constants µ0 = c/β, a = (1 + β)/2 and b = d − c2/2β, and normalizing (2.153), we obtain the normal-gamma distribution: p(µ, λ) = N(µ|µ0 , (βλ)−1)Gam(λ|a, b) (2.154) • NOTE: This dist is not simply the product of an independent Gaussian prior and a gamma prior. 74/84
• For unknown mean µ and known precision λ, the conjugate prior is a Gaussian: p(µ) = N(µ|µ0 , λ−1 0 ) • For known mean µ and unknown precision λ, the conjugate prior is the gamma distribution: p(λ) = Gam(λ|a0 , b0 ) • For both the mean and the precision are unknown, the conjugate prior is the normal-gamma distribution: p(µ, λ) = N(µ|µ0 , (βλ)−1)Gam(λ|a, b) (2.154) 75/84
• For unknown mean µ and known precision Λ, the conjugate prior is a Gaussian: p(µ) = N(µ|µ0 , Λ−1 0 ) • For known mean µ and unknown precision Λ, the conjugate prior is the Wishart distribution: p(Λ) = W(Λ|W, ν) • For both the mean and the precision are unknown, the conjugate prior is the normal-Wishart distribution: p(µ, Λ|µ0 , β, W, ν) = N(µ|µ0 , (βΛ)−1)W(Λ|W, ν) (2.157) 76/84
Instead of working with the precision, we can consider the variance (covariance) itself. The conjugate priors are called: • the inverse gamma distribution (the univariate Gaussian case) • the inverse Wishart distribution (the multivariate Gaussian case) • We shall not discuss this further because we will find it more convenient to work with the precision. 78/84
seen that the conj prior for the precision of Gaussian is given by a gamma dist. • If we have N(x|µ, τ−1) together with a Gamma prior Gam(τ|a, b) and integrate out the precision, we obtain the marginal dist of x: (derivation: ex. 2.46) p(x|µ, a, b) = ∫ p(x, τ|µ, a, b) dτ = ∫ ∞ 0 N(x|µ, τ−1)Gam(τ|a, b) dτ (2.158) = ba Γ(a) ( 1 2π )1/2 [ b + (x − µ)2 2 ]−a−1/2 Γ(a + 1/2) • By convention we define ν = 2a and λ = a/b, and obtain the Student’s t-distribution: St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2) ( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 (2.159) 79/84
( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 (2.159) • λ: the precision of t-distribution (NOTE: it is not in general equal to the inverse of the variance) • ν: the degrees of freedom • When ν = 1, it reduces to the Cauchy dist. • When ν → ∞, it becomes a Gaussian N(x|µ, λ−1) (ex. 2.47). ν → ∞ ν = 1.0 ν = 0.1 −5 0 5 0 0.1 0.2 0.3 0.4 0.5 81/84
Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 84/84