Pro Yearly is on sale from $80 to $50! »

Probability Distributions (PRML 2.3.1-2.3.7)

A7381cb5cf18c259c48cc3d5fe8f4fac?s=47 eqs
June 03, 2019

Probability Distributions (PRML 2.3.1-2.3.7)

PRML Reading Club @ Mathematical Informatics Lab, NAIST

A7381cb5cf18c259c48cc3d5fe8f4fac?s=128

eqs

June 03, 2019
Tweet

Transcript

  1. Probability Distributions (PRML §2.3.1-2.3.7) Satoshi Murashige PRML Reading Club (June

    3, 2019) Mathematical Informatics Lab., NAIST
  2. Motivation • Purpose: density estimation • To model the probability

    distribution p(x) of a random var x, given a finite set of observations. • Parametric and non-parametric models • Frequentist and Bayesian treatments • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 2/84
  3. Image for Probabilistic Estimation • x: location of a target

    (unknown) • y: value of a sensor (known) %FQUI4FOTPS 5BSHFU p(x) 1SJPS p(y|x) 0CTFSWBUJPO p(x|y) 1PTUFSJPS &TUJNBUF • Goal: Estimate the location using observations • All computations are analytically tractable for Gaussians. 3/84
  4. Table of Contents • Important properties of Gaussian distributions •

    §2.3.1 Conditional Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 4/84
  5. https: //twitter.com/nara_genkai/status/1059076997060583426 5/84

  6. §2.3.1 Conditional Gaussian Distribution

  7. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 • Example: exp ( − 1 2 x2 + 2x + const. ) 6/84
  8. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) 6/84
  9. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) 6/84
  10. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 = (x − 3)2 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) 6/84
  11. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 = (x − 3)2 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) = exp { − 1 2 (x2 − 4x + 4) + const. } 6/84
  12. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 = (x − 3)2 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) = exp { − 1 2 (x2 − 4x + 4) + const. } = exp { − (x − 2)2 2 } · exp(const.) 6/84
  13. Completing the square (for univariate case) • Basic idea: x2

    − 6x + 2 = x2 − 6x + 9 − 7 = (x − 3)2 − 7 • Example: exp ( − 1 2 x2 + 2x + const. ) = exp { − 1 2 (x2 − 4x + 4) + const. } = exp { − (x − 2)2 2 } Unnormalized Gaussian · exp(const.) 6/84
  14. Important properties of Gaussian Distribution Now, we consider to derive

    the following properties: p(xa , xb ) = p(x) = N(x|µ, Σ) ⇒ p(xa |xb ) = N(xa |µa|b , Σa|b ) p(xa ) = N(xa |µa , Σa ) xa xb = 0.7 xb p(xa,xb ) 0 0.5 1 0 0.5 1 xa p(xa ) p(xa |xb = 0.7) 0 0.5 1 0 5 10 Figure 2.9 7/84
  15. Definition of Notation (1/2) Consider a joint distribution p(x) =

    N(x|µ, Σ) • Separate a D-dimensional vector x ∼ N(x|µ, Σ) into xa ∈ RM and xb ∈ RD−M x = ( xa xb ) (2.65) µ = ( µa µb ) (2.66) Σ = ( Σaa Σab Σba Σbb ) (2.67) • The symmetry Σ⊤ = Σ of the covariance matrix, Σ⊤ aa = Σaa , Σ⊤ bb = Σbb (i.e. symmetry) Σ⊤ ba = Σab 8/84
  16. Definition of Notation (2/2) • The precision matrix Λ is

    convenient in many situations Λ ≡ Σ−1 (2.68) Λ = ( Λaa Λab Λba Λbb ) (2.69) • Because the inverse of a symmetric matrix is also symmetric (proof: ex. 2.22), Λ⊤ aa = Λaa , Λ⊤ bb = Λbb (i.e. symmetry) Λ⊤ ba = Λab • NOTE: Generally, for instance, Λaa ̸= Σ−1 aa 9/84
  17. Solution of Exercise 2.22 Let A be a symmetric matrix

    (A = A⊤). The inverse matrix A−1 satisfies AA−1 = I. By taking the transpose of both sides of this equation, we obtain (A−1)⊤A⊤ = I. From the definition of inverse matrix, we obtain (A−1)⊤ = A−1. Therefore, A−1 is also symmetric matrix. 10/84
  18. Derivation of Conditional Gaussian Distribution Conditional Gaussian Distribution p(xa ,

    xb ) = N(x|µ, Σ) ⇒ p(xa |xb ) = N(xa |µa|b , Σa|b ) From the product rule of probability, p(xa |xb ) = p(xa , xb ) p(xb ) Take the logarithm of both sides, ln p(xa |xb ) = ln p(xa , xb ) + const. = − 1 2 (x − µ)⊤Σ−1(x − µ) + const. 11/84
  19. Derivation of Conditional Gaussian Distribution Conditional Gaussian Distribution p(xa ,

    xb ) = N(x|µ, Σ) ⇒ p(xa |xb ) = N(xa |µa|b , Σa|b ) From the product rule of probability, p(xa |xb ) Unknown = p(xa , xb ) Known p(xb ) Unknown Take the logarithm of both sides, ln p(xa |xb ) = ln p(xa , xb ) + const. = − 1 2 (x − µ)⊤Σ−1(x − µ) + const. 11/84
  20. Derivation of Conditional Gaussian Distribution Because (2.70) is a quadratic

    form of xa , the corresponding conditional distribution p(xa |xb ) will be Gaussian. − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 (xa − µa )⊤Λaa (xa − µa ) − 1 2 (xa − µa )⊤Λab (xb − µb ) − 1 2 (xb − µb )⊤Λba (xa − µa ) − 1 2 (xb − µb )⊤Λbb (xb − µb ) (2.70) 12/84
  21. Derivation of Conditional Gaussian Distribution Because (2.70) is a quadratic

    form of xa , the corresponding conditional distribution p(xa |xb ) will be Gaussian. − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 (xa − µa )⊤Λaa (xa − µa ) − 1 2 (xa − µa )⊤Λab (xb − µb ) − 1 2 (xb − µb )⊤Λba (xa − µa ) − 1 2 (xb − µb )⊤Λbb (xb − µb ) (2.70) = − 1 2 ( x⊤ a Λaa xa − x⊤ a Λaa µa − µ⊤ a Λaa xa + µ⊤ a Λaa µa ) − 1 2 ( x⊤ a Λab xb − x⊤ a Λab µb − µ⊤ a Λab xb + µ⊤ a Λab µb ) − 1 2 ( x⊤ b Λba xa − x⊤ b Λba µa − µ⊤ b Λba xa + µ⊤ b Λba µa ) − 1 2 ( x⊤ b Λbb xb − x⊤ b Λbb µb − µ⊤ b Λbb xb + µ⊤ b Λbb µb ) = − 1 2 x⊤ a Λaa xa The quadratic term + x⊤ a (Λaa µa + Λab µb ) The linear term + const. 12/84
  22. Derivation of Conditional Gaussian Distribution Completing the square − 1

    2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ a Λaa xa + x⊤ a (Λaa µa + Λab µb ) + const. 13/84
  23. Derivation of Conditional Gaussian Distribution Completing the square − 1

    2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ a Λaa xa + x⊤ a (Λaa µa + Λab µb ) + const. = − 1 2 x⊤ a Λaa xa + x⊤ a Λaa Λ−1 aa (Λaa µa + Λab µb ) + const. 13/84
  24. Derivation of Conditional Gaussian Distribution Completing the square − 1

    2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ a Λaa xa + x⊤ a (Λaa µa + Λab µb ) + const. = − 1 2 x⊤ a Λaa Σ−1 xa + x⊤ a Λaa Σ−1 Λ−1 aa (Λaa µa + Λab µb ) µ + const. 13/84
  25. Derivation of Conditional Gaussian Distribution Completing the square − 1

    2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ a Λaa xa + x⊤ a (Λaa µa + Λab µb ) + const. = − 1 2 x⊤ a Λaa Σ−1 xa + x⊤ a Λaa Σ−1 Λ−1 aa (Λaa µa + Λab µb ) µ + const. = − 1 2 (xa − µa|b )⊤Σ−1 a|b (xa − µa|b ) + const. 13/84
  26. Derivation of Conditional Gaussian Distribution Completing the square − 1

    2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ a Λaa xa + x⊤ a (Λaa µa + Λab µb ) + const. = − 1 2 x⊤ a Λaa Σ−1 xa + x⊤ a Λaa Σ−1 Λ−1 aa (Λaa µa + Λab µb ) µ + const. = − 1 2 (xa − µa|b )⊤Σ−1 a|b (xa − µa|b ) + const. Σa|b = Λ−1 aa (2.73) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.75) 13/84
  27. Derivation of Conditional Gaussian Distribution Therefore, ln p(xa |xb )

    = − 1 2 (xa − µa|b )⊤Σ−1 a|b (xa − µa|b ) + const. ∴ p(xa |xb ) ∝ exp { − 1 2 (xa − µa|b )⊤Σ−1 a|b (xa − µa|b ) } Normalization − − − − − − − − − → p(xa |xb ) = N(xa |µa|b , Σa|b ) Σa|b = Λ−1 aa (2.73) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.75) • µa|b is a linear function of xb . • Σa|b is independent of xb . 14/84
  28. Express the results in terms of the covariance matrix •

    Recall: p(xa |xb ) = N(xa |µa|b , Σa|b ) Σa|b = Λ−1 aa (2.73) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.75) • The results (2.73) and (2.75) are expressed in terms of the partitioned precision matrix. • We can also express these results in terms of the corresponding partitioned covariance matrix. ( Σaa Σab Σba Σbb )−1 = ( Λaa Λab Λba Λbb ) (2.78) 15/84
  29. Express the results in terms of the covariance matrix The

    identity for the inverse of a partitioned matrix (Exercise 2.24) ( A B C D )−1 = ( M −MBD−1 −D−1CM D−1 + D−1CMBD−1 ) (2.76) M = (A − BD−1C)−1 (2.77) ( Σaa Σab Σba Σbb )−1 = ( Λaa Λab Λba Λbb ) (2.78) Using (2.76), (2.77) and (2.78), we have Λaa = (Σaa − Σab Σ−1 bb Σba )−1 (2.79) Λab = −(Σaa − Σab Σ−1 bb Σba )−1Σab Σ−1 bb (2.80) and µa|b = µa + Σab Σ−1 bb (xb − µb ) (2.81) Σa|b = Σaa − Σab Σ−1 bb Σba (2.82) 16/84
  30. Solution of Exercise 2.24 Multiply both sides 17/84

  31. §2.3.2 Marginal Gaussian Distribution

  32. Derivation of Marginal Gaussian Distribution Marginal Gaussian Distribution p(xa ,

    xb ) = N(x|µ, Σ) ⇒ p(xa ) = N(xa |µa , Σa ) p(xa ) = ∫ p(xa , xb ) dxb (2.83) = ∫ 1 (2π)D/2|Σ|1/2 exp { − 1 2 (x − µ)⊤Σ−1(x − µ) } dxb = ∫ const. · exp(The terms involving xb ) · exp(Other terms) dxb = const. · exp(Other terms) ∫ exp(The terms involving xb ) dxb 18/84
  33. Derivation of Marginal Gaussian Distribution − 1 2 (x −

    µ)⊤Σ−1(x − µ) = − 1 2 (xa − µa )⊤Λaa (xa − µa ) − 1 2 (xa − µa )⊤Λab (xb − µb ) − 1 2 (xb − µb )⊤Λba (xa − µa ) − 1 2 (xb − µb )⊤Λbb (xb − µb ) (2.70) = − 1 2 ( x⊤ a Λaa xa − x⊤ a Λaa µa − µ⊤ a Λaa xa + µ⊤ a Λaa µa ) − 1 2 ( x⊤ a Λab xb − x⊤ a Λab µb − µ⊤ a Λab xb + µ⊤ a Λab µb ) − 1 2 ( x⊤ b Λba xa − x⊤ b Λba µa − µ⊤ b Λba xa + µ⊤ b Λba µa ) − 1 2 ( x⊤ b Λbb xb − x⊤ b Λbb µb − µ⊤ b Λbb xb + µ⊤ b Λbb µb ) = − 1 2 x⊤ b Λbb xb + x⊤ b {Λbb µb − Λba (xa − µa )} + Other terms = − 1 2 x⊤ b Λbb xb + x⊤ b m + Other terms (m = Λbb µb − Λba (xa − µa )) 19/84
  34. Derivation of Marginal Gaussian Distribution − 1 2 (x −

    µ)⊤Σ−1(x − µ) = − 1 2 x⊤ b Λbb xb + x⊤ b m The terms involving xb + Other terms Completing the square − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤Σ−1x + x⊤Σ−1µ + const. (2.71) − 1 2 (x − µ)⊤Σ−1(x − µ) = − 1 2 x⊤ b Λbb xb + x⊤ b m + Other terms = − 1 2 x⊤ b Λbb Σ−1 xb + x⊤ b Λbb Σ−1 Λ−1 bb m µ + Other terms = − 1 2 (xb − Λ−1 bb m)⊤Λbb (xb − Λ−1 bb m) + 1 2 m⊤Λ−1 bb m + Other terms (2.84’) 20/84
  35. Derivation of Marginal Gaussian Distribution p(xa ) = ∫ p(xa

    , xb ) dxb (2.83) = const. · exp(Other terms) ∫ exp(The terms involving xb ) dxb = const. · exp(Other terms) · exp ( 1 2 m⊤Λ−1 bb m ) · ∫ exp { − 1 2 (xb − Λ−1 bb m)⊤Λbb (xb − Λ−1 bb m) } An unnormalized Gaussian (2.86) dxb = const. · exp(Other terms) · exp ( 1 2 m⊤Λ−1 bb m ) The terms involving xa 21/84
  36. Derivation of Marginal Gaussian Distribution Where, Other terms = −

    1 2 x⊤ a Λaa xa + x⊤ a Λaa µa + x⊤ a Λab µb + const. Independent of xa = − 1 2 (xa − µa )⊤Λaa (xa − µa ) + x⊤ a Λab µb + const. 1 2 m⊤Λ−1 bb m = 1 2 (xa − µa )⊤Λab Λ−1 bb Λba (xa − µa ) − x⊤ a Λab µb + const. ∴ Other terms + 1 2 m⊤Λ−1 bb m = − 1 2 (xa − µa )⊤(Λaa − Λab Λ−1 bb Λba )(xa − µa ) + const. ⇒ { E[xa ] = µa cov[xa ] = (Λaa − Λab Λ−1 bb Λba )−1 (2.92 and 2.88) 22/84
  37. Express the results in terms of the covariance matrix •

    Recall: p(xa ) = N(xa |E[xa ], cov[xa ]) E[xa ] = µa (2.92) cov[xa ] = (Λaa − Λab Λ−1 bb Λba )−1 (2.88) • The results are expressed in terms of the partitioned precision matrix. • We can also express these results in terms of the corresponding partitioned covariance matrix. ( Σaa Σab Σba Σbb )−1 = ( Λaa Λab Λba Λbb ) (2.90) 23/84
  38. Express the results in terms of the covariance matrix The

    identity for the inverse of a partitioned matrix (Exercise 2.24) ( A B C D )−1 = ( M −MBD−1 −D−1CM D−1 + D−1CMBD−1 ) (2.76) M = (A − BD−1C)−1 (2.77) cov[xa ] = (Λaa − Λab Λ−1 bb Λba )−1 (2.88) ( Λaa Λab Λba Λbb )−1 = ( Σaa Σab Σba Σbb ) (2.90) Using (2.88) and (2.90), we obtain E[xa ] = µa (2.92) cov[xa ] = Σaa (2.93) 24/84
  39. Partitioned Gaussians Given a joint Gaussian distribution p(x) = N(x|µ,

    Σ) with Λ ≡ Σ−1 and x = ( xa xb ) , µ = ( µa µb ) (2.94) Σ = ( Σaa Σab Σba Σbb ) , Λ = ( Λaa Λab Λba Λbb ) (2.95) Conditional distribution: p(xa |xb ) = N(xa |µa|b , Λ−1 aa ) (2.96) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.97) Marginal distribution: p(xa ) = N(xa |µa , Σaa ) (2.98) 25/84
  40. §2.3.3 Bayes’ theorem for Gaussian variables

  41. (Recall) Image for Probabilistic Estimation • x: location of a

    target (unknown) • y: value of a sensor (known) %FQUI4FOTPS 5BSHFU p(x) 1SJPS p(y|x) 0CTFSWBUJPO p(x|y) 1PTUFSJPS &TUJNBUF • Goal: Estimate the location using observations • All computations are analytically tractable for Gaussians. 26/84
  42. Problem: find an expression for the conditional distribution • Here,

    we shall suppose that we are given: p(x) = N(x|µ, Λ−1) (2.99) p(y|x) = N(y|Ax + b, L−1) (2.100) z = ( x y ) (2.101) • Variables: x ∈ RM and y ∈ RD • Params governing the means: µ ∈ RM , A ∈ RM×D and b ∈ RD • Precision matrices: Λ ∈ RM×M and L ∈ RD×D • We wish find an expression for • The joint distribution p(x, y) = p(z) • The marginal distribution p(y) • The conditional distribution p(x|y) 27/84
  43. Why we want to find the conditional dist p(x|y)? •

    Recall: p(x) = N(x|µ, Λ−1) (2.99) p(y|x) = N(y|Ax + b, L−1) (2.100) • We can interpret • p(x) as a prior over x • p(y|x) as a likelihood of y • If y is observed, then p(x|y) represents the corresponding posterior over x given by Bayes’ theorem: p(x|y) = p(y|x) p(y) p(x) 28/84
  44. Derivation of Joint Distribution p(z) From the product rule of

    probability, p(z) = p(x, y) = p(y|x) × p(x) 29/84
  45. Derivation of Joint Distribution p(z) From the product rule of

    probability, p(z) = p(x, y) Unknown = p(y|x) Known × p(x) Known 29/84
  46. Derivation of Joint Distribution p(z) From the product rule of

    probability, p(z) = p(x, y) Unknown = p(y|x) Known × p(x) Known Take the logarithm of the joint distribution, ln p(z) = ln p(x) + ln p(y|x) = − 1 2 (x − µ)⊤Λ(x − µ) − 1 2 (y − Ax − b)⊤L(y − Ax − b) + const. (2.102) 29/84
  47. Derivation of Joint Distribution p(z) From the product rule of

    probability, p(z) = p(x, y) Unknown = p(y|x) Known × p(x) Known Take the logarithm of the joint distribution, ln p(z) = ln p(x) + ln p(y|x) = − 1 2 (x − µ)⊤Λ(x − µ) − 1 2 (y − Ax − b)⊤L(y − Ax − b) + const. (2.102) Where, (2.102) is a quadratic function of x and y. ⇒ p(z) is Gaussian distribuiton. ⇒ Completing the square! 29/84
  48. Derivation of Joint Distribution p(z) ln p(z) = ln p(x)

    + ln p(y|x) = − 1 2 (x − µ)⊤Λ(x − µ) − 1 2 (y − Ax − b)⊤L(y − Ax − b) + const. (2.102) = − 1 2 ( x⊤Λx − x⊤Λµ − µ⊤Λx + µ⊤Λµ) − 1 2 ( y⊤Λy − y⊤LΛx − y⊤Lb − x⊤A⊤Ly − b⊤Ly + x⊤A⊤LAx + x⊤A⊤Lb + b⊤LAx + b⊤Lb) + const. = − 1 2 x⊤(Λ + A⊤LA)x − 1 2 y⊤Ly + 1 2 y⊤LAx + 1 2 x⊤A⊤Ly 2.103 + x⊤(Λµ − A⊤Lb) + y⊤Lb 2.106 + const. 30/84
  49. Derivation of Joint Distribution p(z) ln p(z) = ln p(x)

    + ln p(y|x) = − 1 2 (x − µ)⊤Λ(x − µ) − 1 2 (y − Ax − b)⊤L(y − Ax − b) + const. (2.102) = − 1 2 x⊤(Λ + A⊤LA)x − 1 2 y⊤Ly + 1 2 y⊤LAx + 1 2 x⊤A⊤Ly + x⊤(Λµ − A⊤Lb) + y⊤Lb + const. = − 1 2 ( x y )⊤ ( Λ + A⊤LA −A⊤L −LA L ) R (Precision mat) ( x y ) + ( x y )⊤ ( Λµ − A⊤Lb Lb ) m + const. = − 1 2 z⊤Rz + z⊤m + const. = − 1 2 (z − R−1m)⊤R(z − R−1m) + const. 31/84
  50. Derivation of Joint Distribution p(z) • Recall: ln p(z) =

    ln p(x) + ln p(y|x) (2.102’) = (z − R−1m)⊤R(z − R−1m) + const. R = ( Λ + A⊤LA −A⊤L −LA L ) (2.104) m = ( Λµ − A⊤Lb Lb ) • Because p(z) is Gaussian, we have (derivation of (2.105) and (2.108) is Exercise 2.29 and 2.30, respectively)) cov[z] = R−1 = ( Λ−1 Λ−1A⊤ AΛ−1 L−1 + AΛ−1A⊤ ) (2.105) E[z] = R−1m = ( µ Aµ + b ) (2.108) 32/84
  51. Derivation of Marginal Distribution p(y) • From the definition of

    marginal distribution, p(y) = ∫ p(z) dx 33/84
  52. Derivation of Marginal Distribution p(y) • From the definition of

    marginal distribution, p(y) Unknown = ∫ p(z) Known dx 33/84
  53. Derivation of Marginal Distribution p(y) • From the definition of

    marginal distribution, p(y) Unknown = ∫ p(z) Known dx • Recall the marginal distribution of a partitioned Gaussian: p(xa ) = ∫ p(xa , xb ) dx = N(xa |µa , Σaa ) (2.98) 33/84
  54. Derivation of Marginal Distribution p(y) • From the definition of

    marginal distribution, p(y) Unknown = ∫ p(z) Known dx • Recall the marginal distribution of a partitioned Gaussian: p(xa ) = ∫ p(xa , xb ) dx = N(xa |µa , Σaa ) (2.98) • The mean and covariance of p(z): E [z] = E [( x y )] = ( µ Aµ + b ) (2.108) cov [z] = cov [( x y )] = ( Λ−1 Λ−1A⊤ AΛ−1 L−1 + AΛ−1A⊤ ) (2.105) ∴ E[y] = Aµ + b (2.109) cov[y] = L−1 + AΛ−1A⊤ (2.110) 33/84
  55. Derivation of Marginal Distribution p(y) • From the definition of

    marginal distribution, p(y) Unknown = ∫ p(z) Known dx • Recall the marginal distribution of a partitioned Gaussian: p(xa ) = ∫ p(xa , xb ) dx = N(xa |µa , Σaa ) (2.98) • The mean and covariance of p(z): E [z] = E [( x y )] = ( µ Aµ + b ) (2.108) cov [z] = cov [( x y )] = ( Λ−1 Λ−1A⊤ AΛ−1 L−1 + AΛ−1A⊤ ) (2.105) ∴ E[y] = Aµ + b (2.109) cov[y] = L−1 + AΛ−1A⊤ (2.110) 33/84
  56. Derivation of Conditional Distribution p(x|y) • From the definition of

    conditional distribution, p(x|y) = p(z) p(y) 34/84
  57. Derivation of Conditional Distribution p(x|y) • From the definition of

    conditional distribution, p(x|y) Unknown = p(z) Known p(y) 34/84
  58. Derivation of Conditional Distribution p(x|y) • From the definition of

    conditional distribution, p(x|y) Unknown = p(z) Known p(y) • Recall the conditional distribution of a partitioned Gaussian: p(xa |xb ) = N(xa |µa|b , Λ−1 aa ) (2.96) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.97) 34/84
  59. Derivation of Conditional Distribution p(x|y) • From the definition of

    conditional distribution, p(x|y) Unknown = p(z) Known p(y) • Recall the conditional distribution of a partitioned Gaussian: p(xa |xb ) = N(xa |µa|b , Λ−1 aa ) (2.96) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.97) • The mean and precision of p(z): E [z] = ( µ Aµ + b ) (2.108) R = ( Λ + A⊤LA −A⊤L −LA L ) (2.104) (2.112) 34/84
  60. Derivation of Conditional Distribution p(x|y) • From the definition of

    conditional distribution, p(x|y) Unknown = p(z) Known p(y) • Recall the conditional distribution of a partitioned Gaussian: p(xa |xb ) = N(xa |µa|b , Λ−1 aa ) (2.96) µa|b = µa − Λ−1 aa Λab (xb − µb ) (2.97) • The mean and precision of p(z): E [z] = ( µ Aµ + b ) (2.108) R = ( Λ + A⊤LA −A⊤L −LA L ) (2.104) ∴ E[x|y] = (Λ + A⊤LA)−1 { A⊤L(y − b) + Λµ } (2.111) cov[x|y] = (Λ + A⊤LA)−1 (2.112) 34/84
  61. Marginal and Conditional Gaussians • Given following distributions: p(x) =

    N(x|µ, Λ−1) (2.113) p(y|x) = N(y|Ax + b, L−1) (2.114) • The marginal distribution of y: p(y) = N(y|Aµ + b, L−1 + AΛ−1A⊤) (2.115) • The conditional distribution of x given y: p(x|y) = N(x|Σ { A⊤L(y − b) + Λµ } , Σ) (2.116) where Σ = (Λ + A⊤LA)−1 (2.117) 35/84
  62. Demonstration of a Bayes’ theorem for Gaussian variables See Jupyter

    Notebook. https://gist.github.com/eqs/a9f66d8a702da6ed7ff5a5b92aaa5811 36/84
  63. §2.3.4 Maximum likelihood for the Gaussian

  64. Table of Contents • Important properties of Gaussian distributions •

    §2.3.1 Conditional Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 37/84
  65. A frequentist treatment vs. a Bayesian treatment To model the

    dist of a random var using parametric dist, given observations, • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 38/84
  66. Maximum likelihood estimation for the Gaussian Maximize the likelihood with

    regard to µ and Σ−1 (i.e. precision Λ) p(X|µ, Σ) = N ∏ n=1 1 (2π)D/2 1 |Σ|1/2 exp ( − 1 2 (xn − µ)⊤Σ−1(xn − µ) ) By taking logarithm, we obtain: ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 N ∑ n=1 (xn − µ)⊤Σ−1(xn − µ) (2.118) We see that the log-likelihood depends on the dataset only the quantities (the proof is next slide): N ∑ n=1 xn N ∑ n=1 xn x⊤ n (2.119) These are know as the sufficient statistics for the Gauss. 39/84
  67. Proof: the log-likelihood depends on the dataset only the suffi-

    cient stats − 1 2 N ∑ n=1 (xn − µ)⊤Σ−1(xn − µ) = − 1 2 N ∑ n=1 x⊤ n Σ−1xn + ( N ∑ n=1 xn )⊤ Σ−1µ − N 2 µ⊤Σ−1µ = − 1 2 tr ( Σ−1 N ∑ n=1 xn x⊤ n ) (C.9) and linearity of trace + ( N ∑ n=1 xn )⊤ Σ−1µ − N 2 µ⊤Σ−1µ = − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ Where, ⟨x⟩ ≡ N ∑ n=1 xn ⟨ xx⊤ ⟩ ≡ N ∑ n=1 xn x⊤ n 40/84
  68. Derivation of ML solutions The log-likelihood (definition of notation is

    prev slide): ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ (2.118’) 41/84
  69. Derivation of ML solutions The log-likelihood (definition of notation is

    prev slide): ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ (2.118’) By setting the derivative of the log-likelihood with respect to µ and Σ−1 to zero, like ∂ ∂µ ln p(X|µ, Σ) = 0 ∂ ∂Σ−1 ln p(X|µ, Σ) = O 41/84
  70. Derivation of ML solutions The log-likelihood (definition of notation is

    prev slide): ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ (2.118’) By setting the derivative of the log-likelihood with respect to µ and Σ−1 to zero, like ∂ ∂µ ln p(X|µ, Σ) = 0 ∂ ∂Σ−1 ln p(X|µ, Σ) = O Then, we obtain the solution for ML given by: µML = 1 N ⟨x⟩ = 1 N N ∑ n=1 xn (2.121) ΣML = 1 N ⟨ xx⊤ ⟩ − µµ⊤ = 1 N N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.122) 41/84
  71. Derivation of the ML solution µML Easy! 42/84

  72. Derivation of the ML solution ΣML (1/2) The log-likelihood (definition

    of notation is prev slide): ln p(X|µ, Σ) = − ND 2 ln(2π) − N 2 ln |Σ| − 1 2 tr ( Σ−1 ⟨ xx⊤ ⟩) + ⟨x⟩⊤ Σ−1µ − N 2 µ⊤Σ−1µ (2.118’) By calculating the derivative of the log-likelihood (2.118’) with respect to Σ−1, we obtain ∂ ∂Σ−1 ln p(X|µ, Σ) = N 2 Σ⊤ |Σ|−1 = |Σ−1| and (C.28) − 1 2 ⟨ xx⊤ ⟩⊤ (C.24) + ( µ ⟨x⟩⊤ )⊤ (C.24) − N 2 ( µµ⊤ )⊤ (C.24) = N 2 Σ − 1 2 ⟨ xx⊤ ⟩ ⟨x⟩ µ⊤ − N 2 µµ⊤ 43/84
  73. Derivation of the ML solution ΣML (2/2) By setting ∂

    ∂Σ−1 ln p(X|µ, Σ) to zero and solving it in terms of Σ, we have ΣML = 1 N ⟨ xx⊤ ⟩ − µMLµ⊤ ML Use µML = ⟨x⟩ /N = 1 N N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.122) 44/84
  74. Unbiased estimator By evaluating E[µML] and E[ΣML] under the true

    distribution, we obtain: E[µML] = µ (2.123) E[ΣML] = N − 1 N Σ (2.124) We can correct the bias of E[ΣML] by defining Σ = 1 N − 1 N ∑ n=1 (xn − µML)(xn − µML)⊤ (2.125) ⇒ E[Σ] = Σ 45/84
  75. Solution of Exercise 2.35 (1/3) • Derivation of (2.62) using

    (2.59): See the text. • Derivation of (2.291) E[xn x⊤ m ] = µµ⊤ + Inm Σ (2.291) (i) In the case n = m, from (2.62), we have E[xn x⊤ m ] = µµ⊤ + Σ (ii) In the case n ̸= m, E[xn x⊤ m ] = ∫∫ p(xn , xm )xn x⊤ m dxn dxm = (∫ p(xn )xn dxn ) (∫ p(xm )x⊤ m dxm ) (∵ i.i.d.) = E[xn ]E[xm ]⊤ = µµ⊤ • Derivation of (2.124): next slide 46/84
  76. Solution of Exercise 2.35 (2/3) E[ΣML] = E [ 1

    N N ∑ n=1 (xn − µML)(xn − µML)⊤ ] = 1 N N ∑ n=1 E [ (xn − µML)(xn − µML)⊤ ] = 1 N N ∑ n=1 E [ {(xn − µML) − (µML − µ)} {(xn − µML) − (µML − µ)}⊤ ] = 1 N N ∑ n=1 E [ (xn − µ)(xn − µ)⊤ ] Σ − 2 · 1 N N ∑ n=1 E [ (xn − µ)(µML − µ)⊤ ] E [ (µML − µ)(µML − µ)⊤ ] + 1 N N ∑ n=1 E [ (µML − µ)(µML − µ)⊤ ] 47/84
  77. Solution of Exercise 2.35 (3/3) E[ΣML] = Σ − E

    [ (µML − µ)(µML − µ)⊤ ] = Σ − ( E [ µMLµ⊤ ML ] − µµ⊤ ) = Σ − ( 1 N2 N ∑ n=1 N ∑ m=1 E [ xn x⊤ m ] − µµ⊤ ) = Σ − ( 1 N2 N ∑ n=1 N ∑ m=1 ( µµ⊤ + Inm Σ ) − µµ⊤ ) (∵ 2.291) = Σ − ( 1 N2 ( N2µµ⊤ + NΣ ) − µµ⊤ ) = N − 1 N Σ 48/84
  78. §2.3.5 Sequential estimation

  79. Motivation of Sequential Estimation • In the prev section, whole

    data points x1 , · · · , xN are simultaneously used for the parameter estimation. • Sequential methods allow data points to be processed one at a time and then discarded. • Sequential methods are important for • on-line apps • large N data sets are involved so that batch processing of all data points at once is infeasible 49/84
  80. What is the contribution of the latest data? Consider the

    ML estimator of the mean based on N data points: µ(N) ML = 1 N N ∑ n=1 xn = 1 N xN + 1 N N−1 ∑ n=1 xn = 1 N xN + N − 1 N · 1 N − 1 N−1 ∑ n=1 xn µ(N−1) ML = µ(N−1) ML + 1 N ( xN − µ(N−1) ML ) (2.126) 50/84
  81. What is the contribution of the latest data? Consider the

    ML estimator of the mean based on N data points: µ(N) ML = 1 N N ∑ n=1 xn = 1 N xN + 1 N N−1 ∑ n=1 xn = 1 N xN + N − 1 N · 1 N − 1 N−1 ∑ n=1 xn µ(N−1) ML = µ(N−1) ML old estimator + 1 N inversely prop to N ( xN − µ(N−1) ML ) contrib of N-th data (2.126) 50/84
  82. Derive a more general formulation of sequential learning • Consider

    random variables θ and z governed by p(z, θ). • The regression function is given by: f(θ) ≡ E[z|θ] = ∫ zp(z|θ) dz (2.127) θ z θ f(θ) Figure 2.10 • Now, we want to find the root θ∗ at which f(θ∗) = 0 without modeling f(θ). • A general procedure to solve such problems was given by Robbins and Monro (1951). 51/84
  83. Robbins-Monro Algorithm Assumptions: • Conditional variance of z is finite:

    E[(z − f)2|θ] < ∞ (2.128) • f(θ) > 0 for θ > θ∗ and f(θ) < 0 for θ < θ∗ • A unique root actually exists. θ z θ f(θ) Figure 2.10 Procedure for estimating the root θ∗ is given by θ(N) = θ(N−1) − aN−1 z(θ(N−1)) (2.129) where z(θ(N−1)) is an observed value of z when θ takes the value θ(N). 52/84
  84. Robbins-Monro Algorithm The sequence of successive estimates of the root

    θ∗: θ(N) = θ(N−1) − aN−1 z(θ(N−1)) (2.129) where, a sequence of positive numbers {aN } satisfies: lim N→∞ aN = 0, ∞ ∑ N=1 aN = ∞, and ∞ ∑ N=1 a2 N < ∞ (2.130-132) The conditions ensure that the sequence of estimates converge to the root with probability one (Robbins and Monro, 1951;Fukunaga, 1990). 53/84
  85. Solve a general ML problem sequentially using the Robbins-Monro Algorithm

    By definition, the ML solution θML is a stationary point of the log-likelihood and hence satisfies: ∂ ∂θ { − 1 N N ∑ n=1 ln p(xn |θ) } θML = 0 (2.133) Exchanging the derivative and sum, and taking the limit as N → ∞, we have: − lim N→∞ 1 N N ∑ n=1 ∂ ∂θ ln p(xn |θ) = Ex [ − ∂ ∂θ ln p(x|θ) ] regression function (2.134) and so we see that finding the ML solution corresponds to finding the root of a regression function. We can therefore apply the Robbins-Monro procedure! 54/84
  86. Solve a general ML problem sequentially using the Robbins-Monro Algorithm

    By applying the Robbins-Monro procedure, which now takes the form: θ(N) = θ(N−1) − aN−1 ∂ ∂θ(N−1) ln p(xN |θ(N−1)) (2.135) Example of estimating µ(N) ML of a Gaussian : z = − ∂ ∂µML ln p(x|µML, σ2) = − 1 σ2 (x − µML) (2.136) 55/84
  87. Solve a general ML problem sequentially using the Robbins-Monro Algorithm

    z = − ∂ ∂µML ln p(x|µML, σ2) = − 1 σ2 (x − µML) (2.136) E[z|θ] = 1 σ2 (µ − µML) θ z θ f(θ) µML z p(z|µ) µ Figure 2.10 & 2.11 56/84
  88. §2.3.6 Bayesian inference for the Gaussian

  89. A frequentist treatment vs. a Bayesian treatment To model the

    dist of a random var using parametric dist, given observations, • frequentist: choose specific values of the params by optimizing some criterion (e.g. the likelihood) • Bayesian: introduce prior over the params and compute the corresponding posterior 57/84
  90. Recap: Bayesian inference for the binomial dist and conjugacy The

    binomial distribution: Bin(m|N, µ) = ( N m ) µm(1 − µ)N−m (2.9) In order to develop a Bayesian treatment to model the distribution of observations, we need to introduce a prior p(µ). p(µ|N, m) ∝ Bin(m|N, µ)p(µ) If we choose a prior to be cµβ(1 − µ)γ, the corresponding posterior p(µ|N, m) will have the same functional form as the prior (conjugacy). The beta distribution satisfies the conjugacy Beta(µ|a, b) = Γ(a + b) Γ(a)Γ(b) µa−1(1 − µ)b−1 (2.13) 58/84
  91. The task of inferring the mean µ (the variance σ2

    is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given µ): p(X|µ) = N ∏ n=1 p(xn |µ) = 1 (2πσ2)N/2 exp { − 1 2σ2 N ∑ n=1 (xn − µ)2 } (2.137) • NOTE: p(X|µ) is not a prob dist over µ and not normalized. • By introducing a prior p(µ), the posterior given by p(µ|X) ∝ p(X|µ)p(µ) (2.139) • What prior p(µ) should we choose? 59/84
  92. The task of inferring the mean µ (the variance σ2

    is known) • Recall: p(X|µ) = N ∏ n=1 p(xn |µ) = 1 (2πσ2)N/2 exp { − 1 2σ2 N ∑ n=1 (xn − µ)2 } The exp of a quadratic form in µ (2.137) • The likelihood takes the form of the exp of a quadratic form in µ. • Thus, if we choose a Gaussian as the prior, the posterior will also be Gaussian. • We therefore take our prior to be p(µ) = N(µ|µ0 , σ2 0 ). (2.138) 60/84
  93. The task of inferring the mean µ (the variance σ2

    is known) • The likelihood function: p(X|µ) = N ∏ n=1 p(xn |µ) = 1 (2πσ2)N/2 exp { − 1 2σ2 N ∑ n=1 (xn − µ)2 } (2.137) • The prior distribution: p(µ) = N(µ|µ0 , σ2 0 ) (2.138) • By using p(µ|X) ∝ p(X|µ)p(µ) (2.139) and normalizing it, , we obtain the posterior (derivation: ex. 2.38): p(µ|X) = N(µ|µN , σ2 N ) (2.140) µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML (2.141) 1 σ2 N = 1 σ2 0 + N σ2 (2.142) • The solution in D-dim Gaussian case is ex. 2.40. 61/84
  94. Solution of Exercise 2.38 (1/2) From the product rule of

    probability, ln p(µ|X) = ln p(X|µ) + ln p(µ) + const. = N ∑ n=1 ln p(xn |µ) + ln p(µ) + const. = − 1 2σ2 N ∑ n=1 (xn − µ)2 − 1 2σ2 0 (µ − µ0 )2 + const. = − 1 2σ2 N ∑ n=1 (µ2 − 2µxn + x2 n ) − 1 2σ2 0 (µ2 − 2µµ0 + µ2 0 ) + const. = − 1 2σ2 (Nµ2 − 2µ ⟨x⟩ + ⟨ x2 ⟩ ) − 1 2σ2 0 (µ2 − 2µµ0 + µ2 0 ) + const. ( ⟨x⟩ ≡ N ∑ n=1 xn , ⟨ x2 ⟩ ≡ N ∑ n=1 x2 n ) 62/84
  95. Solution of Exercise 2.38 (2/2) ln p(µ|X) = − 1

    2σ2 (Nµ2 − 2µ ⟨x⟩ + ⟨ x2 ⟩ ) − 1 2σ2 0 (µ2 − 2µµ0 + µ2 0 ) + const. = − 1 2σ2 {( N σ2 + 1 σ2 0 ) µ2 − 2 ( ⟨x⟩ σ2 + µ0 σ2 0 ) µ } + const. = − 1 2 · Nσ2 0 + σ2 σ2σ2 0 ( µ2 − 2 ⟨x⟩ σ2 0 + µ0 σ2 Nσ2 0 + σ2 µ ) + const. = − (µ − µN )2 2σ2 N + const. where, µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML (2.141) 1 σ2 N = 1 σ2 0 + N σ2 (2.142) 63/84
  96. Interpretation of the posterior’s params p(µ|X) = N(µ|µN , σ2

    N ) (2.140) µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML (2.141) 1 σ2 N = 1 σ2 0 + N σ2 (2.142) N = 0 N = 1 N = 2 N = 10 −1 0 1 0 5 Figure 2.12 • If we have no observations (N = 0), the posterior mean µN = µ0 . • N → ∞ ⇒ µN → µML and σ2 N → 0. • σ2 0 → ∞ (i.e. no prior) ⇒ µN → µML and σ2 N → σ2/N. 64/84
  97. Interpretation of the posterior’s params p(µ|X) = N(µ|µN , σ2

    N ) (2.140) µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML µN intervene between µ0 and µML (2.141) 1 σ2 N = 1 σ2 0 + N σ2 Monotone function of N (2.142) N = 0 N = 1 N = 2 N = 10 −1 0 1 0 5 Figure 2.12 • If we have no observations (N = 0), the posterior mean µN = µ0 . • N → ∞ ⇒ µN → µML and σ2 N → 0. • σ2 0 → ∞ (i.e. no prior) ⇒ µN → µML and σ2 N → σ2/N. 64/84
  98. The Bayesian paradigm naturally leads to a sequential view of

    the inference problem. p(µ|X) ∝ p(µ) N ∏ n=1 p(xn |µ) = [ p(µ) N−1 ∏ n=1 p(xn |µ) ] p(xN |µ) (2.144) 65/84
  99. The Bayesian paradigm naturally leads to a sequential view of

    the inference problem. p(µ|X) ∝ p(µ) N ∏ n=1 p(xn |µ) = [ p(µ) N−1 ∏ n=1 p(xn |µ) ] The posterior after observing N − 1 data p(xN |µ) The likelihood with N-th data (2.144) 65/84
  100. The Bayesian paradigm naturally leads to a sequential view of

    the inference problem. p(µ|X) ∝ p(µ) N ∏ n=1 p(xn |µ) = [ p(µ) N−1 ∏ n=1 p(xn |µ) ] The posterior after observing N − 1 data p(xN |µ) The likelihood with N-th data (2.144) • The posterior after observing N − 1 data can be viewed as a prior to arriving a posterior after observing N-th data. • The sequential view is very general and applies to any problem in which the observed data are assumed to be i.i.d. 65/84
  101. Demonstration of a Bayesian sequential estimation See Jupyter Notebook. https://gist.github.com/eqs/a9f66d8a702da6ed7ff5a5b92aaa5811

    66/84
  102. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 exp { − λ 2 N ∑ n=1 (xn − µ)2 } (2.145) • The corresponding prior p(λ) should be proportional to (2.145). 67/84
  103. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 a power of λ exp { − λ 2 N ∑ n=1 (xn − µ)2 } the exp of a linear function of λ (2.145) • The corresponding prior p(λ) should be proportional to (2.145). 67/84
  104. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • A set of N observations: X = {x1 , · · · , xN } • The likelihood function (the prob. of the observed data given λ): p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 a power of λ exp { − λ 2 N ∑ n=1 (xn − µ)2 } the exp of a linear function of λ (2.145) • The corresponding prior p(λ) should be proportional to (2.145). • We therefore take our prior to be the gamma distribution: Gam(λ|a, b) = 1 Γ(a) baλa−1 exp(−bλ) (2.146) 67/84
  105. The gamma distribution Gam(λ|a, b) = 1 Γ(a) baλa−1 exp(−bλ)

    (2.146) E[λ] = a b (2.147) var[λ] = a b2 (2.148) λ a = 0.1 b = 0.1 0 1 2 0 1 2 λ a = 1 b = 1 0 1 2 0 1 2 λ a = 4 b = 6 0 1 2 0 1 2 • (2.146) is correctly normalized by Γ(a) (ex. 2.41). • If a > 0, the distribution has a finite integral. • If a ⩾ 1, the distribution itself is finite. • Derivation of (2.147) and (2.148): ex. (2.42) 68/84
  106. Solution of Exercise 2.41 Easy! 69/84

  107. Solution of Exercise 2.42 Easy! 70/84

  108. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • The likelihood: p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 exp { − λ 2 N ∑ n=1 (xn − µ)2 } (2.145) • Consider a prior distribution: p(λ) = Gam(λ|a0 , b0 ) • If we multiply by the likelihood (2.145), then we obtain a posterior: p(λ|X) ∝ λa0−1λN/2 exp { −b0 λ − λ 2 N ∑ n=1 (xn − µ)2 } (2.149) = λa0+N/2−1 exp { − ( b0 + 1 2 N ∑ n=1 (xn − µ)2 ) λ } 71/84
  109. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • The likelihood: p(X|λ) = N ∏ n=1 p(xn |λ) ∝ λN/2 exp { − λ 2 N ∑ n=1 (xn − µ)2 } (2.145) • Consider a prior distribution: p(λ) = Gam(λ|a0 , b0 ) • If we multiply by the likelihood (2.145), then we obtain a posterior: p(λ|X) ∝ λa0−1λN/2 exp { −b0 λ − λ 2 N ∑ n=1 (xn − µ)2 } (2.149) = λa0+N/2−1 λa−1 exp              − ( b0 + 1 2 N ∑ n=1 (xn − µ)2 ) λ −bλ              71/84
  110. The task of inferring the precision λ ≡ 1/σ2 (the

    mean µ is known) • Therefore, the posterior is p(λ|X) = Gam(λ|aN , bN ) aN = a0 + N 2 (2.150) bN = b0 + 1 2 N ∑ n=1 (xn − µ)2 = b0 + N 2 σ2 ML (2.151) • Interpretation of the posterior’s params • aN and bN can be interpret like an effective number of observations (see §2.2). • From (2.150) and (2.151), we see that introducing a prior Gam(λ|a0 , b0 ) corresponds to we have the 2a0 effective observations having variance b0 /a0 in the prior. 72/84
  111. Find a conjugate prior when both the mean and the

    precision are unknown To find a conjugate prior p(µ, λ), we consider the dependence of the likelihood on µ and λ, p(X|µ, λ) = N ∏ n=1 ( λ 2π )1/2 exp { − λ 2 (xn − µ)2 } ∝ [ λ1/2 exp ( − λµ2 2 )]N exp { λµ N ∑ n=1 xn − λ 2 N ∑ n=1 x2 n } (2.152) The prior p(µ, λ) that has the same functional dependence on µ and λ as the likelihood and that should therefore take the form: p(µ, λ) ∝ [ λ1/2 exp ( − λµ2 2 )]β exp {cλµ − dλ} = exp { − βλ 2 (µ − c/β)2 } λβ/2 exp { − ( d − c2 2β ) λ } (2.153) 73/84
  112. Find a conjugate prior when both the mean and the

    precision are unknown • Recall we can always write p(µ, λ) = p(µ|λ)p(λ). p(µ, λ) ∝ exp { − βλ 2 (µ − c/β)2 } λβ/2 exp { − ( d − c2 2β ) λ } (2.153) 74/84
  113. Find a conjugate prior when both the mean and the

    precision are unknown • Recall we can always write p(µ, λ) = p(µ|λ) p(λ) . p(µ, λ) ∝ exp { − βλ 2 (µ − c/β)2 } Gaussian λβ/2 exp { − ( d − c2 2β ) λ } gamma distribution (2.153) 74/84
  114. Find a conjugate prior when both the mean and the

    precision are unknown • Recall we can always write p(µ, λ) = p(µ|λ) p(λ) . p(µ, λ) ∝ exp { − βλ 2 (µ − c/β)2 } Gaussian λβ/2 exp { − ( d − c2 2β ) λ } gamma distribution (2.153) • By defining new constants µ0 = c/β, a = (1 + β)/2 and b = d − c2/2β, and normalizing (2.153), we obtain the normal-gamma distribution: p(µ, λ) = N(µ|µ0 , (βλ)−1)Gam(λ|a, b) (2.154) • NOTE: This dist is not simply the product of an independent Gaussian prior and a gamma prior. 74/84
  115. The conjugate priors in the case of the univariate Gaussian

    • For unknown mean µ and known precision λ, the conjugate prior is a Gaussian: p(µ) = N(µ|µ0 , λ−1 0 ) • For known mean µ and unknown precision λ, the conjugate prior is the gamma distribution: p(λ) = Gam(λ|a0 , b0 ) • For both the mean and the precision are unknown, the conjugate prior is the normal-gamma distribution: p(µ, λ) = N(µ|µ0 , (βλ)−1)Gam(λ|a, b) (2.154) 75/84
  116. The conjugate priors in the case of the multivariate Gaussian

    • For unknown mean µ and known precision Λ, the conjugate prior is a Gaussian: p(µ) = N(µ|µ0 , Λ−1 0 ) • For known mean µ and unknown precision Λ, the conjugate prior is the Wishart distribution: p(Λ) = W(Λ|W, ν) • For both the mean and the precision are unknown, the conjugate prior is the normal-Wishart distribution: p(µ, Λ|µ0 , β, W, ν) = N(µ|µ0 , (βΛ)−1)W(Λ|W, ν) (2.157) 76/84
  117. The Wishart distribution p(Λ) = W(Λ|W, ν) = B|Λ|(ν−D−1)/2 exp

    ( − 1 2 tr(W−1Λ) ) (2.155) • ν is the number of degrees of freedom. • W ∈ RD×D • B is a nomalization constant (2.156). 77/84
  118. A conjugate prior over the variance and the covariance •

    Instead of working with the precision, we can consider the variance (covariance) itself. The conjugate priors are called: • the inverse gamma distribution (the univariate Gaussian case) • the inverse Wishart distribution (the multivariate Gaussian case) • We shall not discuss this further because we will find it more convenient to work with the precision. 78/84
  119. §2.3.7 Student’s t-distribution

  120. Marginalize the precision using a gamma prior • We have

    seen that the conj prior for the precision of Gaussian is given by a gamma dist. • If we have N(x|µ, τ−1) together with a Gamma prior Gam(τ|a, b) and integrate out the precision, we obtain the marginal dist of x: (derivation: ex. 2.46) p(x|µ, a, b) = ∫ p(x, τ|µ, a, b) dτ = ∫ ∞ 0 N(x|µ, τ−1)Gam(τ|a, b) dτ (2.158) = ba Γ(a) ( 1 2π )1/2 [ b + (x − µ)2 2 ]−a−1/2 Γ(a + 1/2) • By convention we define ν = 2a and λ = a/b, and obtain the Student’s t-distribution: St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2) ( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 (2.159) 79/84
  121. Solution of Exercise 2.46 Use Γ(a) = ∫ ∞ 0

    ta−1e−t dt = ∫ ∞ 0 (Au)a−1e−AuA du. p(x|µ, a, b) = ∫ ∞ 0 N(x|µ, τ−1)Gam(τ|a, b) dτ (2.158) = ∫ ∞ 0 ( τ 2π )1 2 exp ( − τ 2 (x − µ)2 ) · 1 Γ(a) baτa−1 exp(−bτ) dτ = ba Γ(a) ( 1 2π )1/2 ∫ ∞ 0 τ(a+1/2)−1 exp          − ( b + (x − µ)2 2 ) ≡ A τ          dτ = ba Γ(a) ( 1 2π )1/2 1 A(a+1/2)−1A ∫ ∞ 0 (Aτ)(a+1/2)−1 exp { − A τ } A dτ Γ(a + 1/2) = ba Γ(a) ( 1 2π )1/2 [ b + (x − µ)2 2 ]−a−1/2 Γ(a + 1/2) 80/84
  122. Student’s t-distribution St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2)

    ( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 (2.159) • λ: the precision of t-distribution (NOTE: it is not in general equal to the inverse of the variance) • ν: the degrees of freedom • When ν = 1, it reduces to the Cauchy dist. • When ν → ∞, it becomes a Gaussian N(x|µ, λ−1) (ex. 2.47). ν → ∞ ν = 1.0 ν = 0.1 −5 0 5 0 0.1 0.2 0.3 0.4 0.5 81/84
  123. Robustness of Students’s t-distribution The t-distribution has longer ‘tails’ than

    a Gaussian. N(x|µ, σ2) = 1 (2πσ2)1/2 exp { − 1 2σ2 (x − µ)2 } (1.46) St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2) ( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 (2.159) (a) −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 (b) −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 (Figure 2.16) 82/84
  124. Robustness of Students’s t-distribution The t-distribution has longer ‘tails’ than

    a Gaussian. N(x|µ, σ2) = 1 (2πσ2)1/2 exp { − 1 2σ2 (x − µ)2 } exp func of x (1.46) St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2) ( λ πν )1/2 [ 1 + λ(x − µ)2 ν ]−ν/2−1/2 power func of x (2.159) (a) −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 (b) −5 0 5 10 0 0.1 0.2 0.3 0.4 0.5 (Figure 2.16) 82/84
  125. Generalization of the t-distribution to a multivariate case • Recall:

    p(x|µ, a, b) = ∫ ∞ 0 N(x|µ, τ−1)Gam(τ|a, b) dτ (2.158) • By substituting ν = 2a, λ = a/b and η = τb/a, St(x|µ, λ, ν) = ∫ ∞ 0 N(x|µ, (ηλ)−1)Gam(τ|ν/2, ν/2) dη (2.161) • We can generalize (2.162) to the corresponding multivariate Student’s t-distribution: St(x|µ, Λ, ν) = ∫ ∞ 0 N(x|µ, (ηΛ)−1)Gam(η|ν/2, ν/2) dη (2.162) = Γ(D/2 + ν/2) Γ(ν/2) |Λ|1/2 (πν)D/2 [ 1 + ∆2 ν ]−D/2−ν/2 (2.162) ∆2 = (x − µ)⊤Λ(x − µ) (2.163) 83/84
  126. Summary • Important properties of Gaussian distributions • §2.3.1 Conditional

    Gaussian Distribution • §2.3.2 Marginal Gaussian Distribution • §2.3.3 Bayes’ theorem for Gaussian variables • Parameter estimation for the Gaussian • §2.3.4 Maximum likelihood for the Gaussian • §2.3.5 Sequential estimation • §2.3.6 Bayesian inference for the Gaussian • Student’s t-distribution • §2.3.7 Students’s t-distribution 84/84