An Astronomer's Introduction to Gaussian Processes (v2)

An Astronomer's Introduction to Gaussian Processes (v2)

Lecture for the Penn State summer school on "Bayesian Computing for Astronomical Data Analysis".

00c684a144d49f612a51e855eb326d6c?s=128

Dan Foreman-Mackey

June 13, 2014
Tweet

Transcript

  1. AN ASTRONOMER’S INTRODUCTION TO GAUSSIAN PROCESSES Dan Foreman-Mackey CCPP@NYU //

    github.com/dfm // @exoplaneteer // dfm.io
  2. cbnd Flickr user lizphung

  3. github.com/dfm/gp

  4. gaussianprocess.org/gpml Rasmussen & Williams

  5. I write code for good & astrophysics.

  6. I (probably) do data science.

  7. Photo credit James Silvester silvesterphoto.tumblr.com not data science.

  8. cb Flickr user Marcin Wichary data science.

  9. PHYSICS DATA Data Science p(data | physics)

  10. I work with Kepler data.

  11. NOISE I get really passionate about

  12. 635 640 645 650 655 660 time [KBJD] 0.004 0.003

    0.002 0.001 0.000 0.001 0.002 0.003 relative flux KIC 3223000
  13. Kepler 32

  14. Kepler 32

  15. 460 480 500 520 time [KBJD] 0.015 0.010 0.005 0.000

    0.005 0.010 0.015 relative flux Kepler 32
  16. HACKS ALL HACKS and I’m righteous! *

  17. Why not model all the things? with, for example, a

    Gaussian Process
  18. The power of correlated noise. 1

  19. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 y = m x + b
  20. 0 10 20 30 40 0 10 20 30 40

    The true covariance of the observations.
  21. Let’s assume that the noise is… log p(y | x,

    , ✓) = 1 2 N X n=1  [yn f✓(xn)] 2 2 n + log 2⇡ 2 n independent Gaussian with known variance
  22. log p (y | x , , ✓) = 1

    2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently…
  23. log p (y | x , , ✓) = 1

    2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance
  24. log p (y | x , , ✓) = 1

    2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance residual vector r = ⇥ y1 f✓( x1) y2 f✓( x2) · · · yn f✓( xn) ⇤T
  25. log p (y | x , , ✓) = 1

    2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance residual vector r = ⇥ y1 f✓( x1) y2 f✓( x2) · · · yn f✓( xn) ⇤T 0 10 20 30 40 0 10 20 30 40
  26. Linear least-squares. A = 2 6 6 6 4 x1

    1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance
  27. Linear least-squares. A = 2 6 6 6 4 x1

    1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance assuming uniform priors
  28. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 truth
  29. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 truth posterior constraint?
  30. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 truth posterior constraint?
  31. 0 10 20 30 40 0 10 20 30 40

    But we know the true covariance matrix.
  32. 0 10 20 30 40 0 10 20 30 40

    log p (y | x , , ✓) = 1 2 r T C 1 r 1 2 log det C N 2 log 2 ⇡
  33. Linear least-squares. A = 2 6 6 6 4 x1

    1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance 0 10 20 30 40 0 10 20 30 40
  34. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 Before.
  35. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4 After.
  36. cbd Flickr user MyFWCmedia the responsible scientist.

  37. So… we’re finished, right?

  38. In The Real World™, we never know the noise.

  39. Just gotta model it!

  40. 0 10 20 30 40 0 10 20 30 40

  41. Model it! log p (y | x , , ✓

    , ↵) = 1 2 r T K 1 ↵ r 1 2 log det K↵ + C k↵(xi, xj) = a 2 exp ✓ [xi xj] 2 2 l 2 ◆ for example Kij = 2 i ij + k↵( xi, xj) where drop-in replacement for your current log-likelihood function!
  42. it's hammer time! emceethe MCMC Hammer arxiv.org/abs/1202.3665 dan.iel.fm/emcee

  43. 2 1 0 1 2 b 2.5 0.0 2.5 ln

    a 0.0 0.5 1.0 1.5 m 2.5 0.0 2.5 ln s 2 1 0 1 2 b 2.5 0.0 2.5 ln a 2.5 0.0 2.5 ln s
  44. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4
  45. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4
  46. Prediction?

  47. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4
  48. 6 4 2 0 2 4 6 4 3 2

    1 0 1 2 3 4
  49. take a deep breath. cba Flickr user kpjas

  50. The formal Gaussian process. 2

  51. log p (y | x , , ✓ , ↵)

    = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡ where The model. drop-in replacement for your current log-likelihood function! [ K↵( x, )]ij = i 2 ij + k↵( xi, xj)
  52. where The model. drop-in replacement for your current log-likelihood function!

    [ K↵( x, )]ij = i 2 ij + k↵( xi, xj) y ⇠ N ( f✓ ( x ), K↵( x , ))
  53. HUGE the data are drawn from one Gaussian * the

    dimension is the number of data points. *
  54. A generative model a probability distribution for y values y

    ⇠ N ( f✓ ( x ), K↵( x , ))
  55. “Likelihood” samples. 2 1 0 1 2 3 exponential squared

    l = 0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆
  56. “Likelihood” samples. 2 1 0 1 2 3 exponential squared

    l = 0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 exponential squared k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆
  57. “Likelihood” samples. 2 1 0 1 0 2 4 6

    8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 k↵(xi, xj) = " 1 + p 3 | xi xj | ` # exp ✓ | xi xj | ` ◆ cos ✓ 2 ⇡ | xi xj | P ◆
  58. “Likelihood” samples. 2 1 0 1 0 2 4 6

    8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 quasi-periodic k↵(xi, xj) = " 1 + p 3 | xi xj | ` # exp ✓ | xi xj | ` ◆ cos ✓ 2 ⇡ | xi xj | P ◆
  59. The conditional distribution y ⇠ N ( f✓ ( x

    ), K↵( x , ))  y y? ⇠ N ✓ f✓ ( x ) f✓ ( x?) ,  K ↵ , x , x K ↵ , x , ? K ↵ , ?, x K ↵ , ?, ? ◆ y? | y ⇠ N K ↵ , ?, x K 1 ↵ , x , x [ y f✓ ( x )] + f✓ ( x?), K ↵ , ?, ? K ↵ , ?, x K 1 ↵ , x , x K ↵ , x , ? ) just see Rasmussen & Williams (Chapter 2)
  60. What’s the catch? Kepler = Big Data Note: I hate

    myself for this slide too… (by some definition)
  61. Computational complexity. O(N3) naïvely: compute factorization // evaluate log-det //

    apply inverse log p (y | x , , ✓ , ↵) = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡
  62. import numpy as np from scipy.linalg import cho_factor, cho_solve !

    def simple_gp_lnlike(x, y, yerr, a, s): r = x[:, None] - x[None, :] C = np.diag(yerr**2) + a*np.exp(-0.5*r**2/(s*s)) factor, flag = cho_factor(C) logdet = 2*np.sum(np.log(np.diag(factor))) return -0.5 * (np.dot(y, cho_solve((factor, flag), y)) + logdet + len(x)*np.log(2*np.pi))
  63. 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

    N 3 2 1 0 1 2 log10 runtime/seconds
  64. exponential squared quasi-periodic

  65. exponential squared quasi-periodic

  66. exponential squared quasi-periodic t ⌃

  67. “ Aren’t kernel matrices Hierarchical Off-Diagonal Low-Rank? — no astronomer

    ever
  68. K(3) = K3 ⇥ K2 ⇥ K1 ⇥ K0 Full

    rank; Low-rank; Identity matrix; Zero matrix; Ambikasaran, DFM, et al. (arXiv:1403.6015)
  69. github.com/dfm/george

  70. import numpy as np from george import GaussianProcess, kernels !

    def george_lnlike(x, y, yerr, a, s): kernel = a * kernels.RBFKernel(s) gp = GaussianProcess(kernel) gp.compute(x, yerr) return gp.lnlikelihood(y)
  71. 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

    N 3 2 1 0 1 2 log10 runtime/seconds
  72. 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

    N 3 2 1 0 1 2 log10 runtime/seconds
  73. and short cadence data? one month of data in 4

    seconds
  74. Applications to Kepler data. 3

  75. Parameter Recovery

  76. Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

    time since transit
  77. Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

    time since transit
  78. Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

    time since transit
  79. figure generated using: github.com/dfm/triangle.py after median-filter detrending

  80. Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

    time since transit
  81. Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

    time since transit
  82. figure generated using: github.com/dfm/triangle.py using Gaussian process noise model

  83. time [days] Ambikasaran, DFM, et al. (arXiv:1403.6015)

  84. 0.0 0.2 0.4 0.6 0.8 1.0 q1 0.0 0.2 0.4

    0.6 0.8 1.0 q2 9.984 9.992 10.000 10.008 t0 0.475 0.500 0.525 0.550 ⌧ 0.012 0.014 0.016 0.018 r/R? 0.00035 0.00060 0.00085 f? +9.994⇥10 1 0.0 0.2 0.4 0.6 0.8 b 0.0 0.2 0.4 0.6 0.8 1.0 q1 0.0 0.2 0.4 0.6 0.8 1.0 q2 9.984 9.992 10.000 10.008 t0 0.475 0.500 0.525 0.550 ⌧ 0.012 0.014 0.016 0.018 r/R? 0.0 0.2 0.4 0.6 0.8 b
  85. KOI 1474.01 with Bekki Dawson, et al.

  86. None
  87. None
  88. Stellar Rotation with Ruth Angus

  89. figures from Ruth Angus (Oxford)

  90. figures from Ruth Angus (Oxford)

  91. Conclusions & Summary. 4

  92. correlated noise matters. a Gaussian process provides a drop-in replacement

    likelihood function if you can compute it
  93. Resources gaussianprocess.org/gpml github.com/dfm/ gp george danfm@nyu.edu