Dan Foreman-Mackey
June 13, 2014
2.8k

# An Astronomer's Introduction to Gaussian Processes (v2)

Lecture for the Penn State summer school on "Bayesian Computing for Astronomical Data Analysis".

June 13, 2014

## Transcript

1. ### AN ASTRONOMER’S INTRODUCTION TO GAUSSIAN PROCESSES Dan Foreman-Mackey CCPP@NYU //

github.com/dfm // @exoplaneteer // dfm.io

12. ### 635 640 645 650 655 660 time [KBJD] 0.004 0.003

0.002 0.001 0.000 0.001 0.002 0.003 relative ﬂux KIC 3223000

15. ### 460 480 500 520 time [KBJD] 0.015 0.010 0.005 0.000

0.005 0.010 0.015 relative ﬂux Kepler 32

17. ### Why not model all the things? with, for example, a

Gaussian Process

19. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 y = m x + b
20. ### 0 10 20 30 40 0 10 20 30 40

The true covariance of the observations.
21. ### Let’s assume that the noise is… log p(y | x,

, ✓) = 1 2 N X n=1  [yn f✓(xn)] 2 2 n + log 2⇡ 2 n independent Gaussian with known variance
22. ### log p (y | x , , ✓) = 1

2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently…
23. ### log p (y | x , , ✓) = 1

2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance
24. ### log p (y | x , , ✓) = 1

2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance residual vector r = ⇥ y1 f✓( x1) y2 f✓( x2) · · · yn f✓( xn) ⇤T
25. ### log p (y | x , , ✓) = 1

2 r T C 1 r 1 2 log det C N 2 log 2 ⇡ Or equivalently… data covariance residual vector r = ⇥ y1 f✓( x1) y2 f✓( x2) · · · yn f✓( xn) ⇤T 0 10 20 30 40 0 10 20 30 40
26. ### Linear least-squares. A = 2 6 6 6 4 x1

1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance
27. ### Linear least-squares. A = 2 6 6 6 4 x1

1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance assuming uniform priors
28. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 truth
29. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 truth posterior constraint?
30. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 truth posterior constraint?
31. ### 0 10 20 30 40 0 10 20 30 40

But we know the true covariance matrix.
32. ### 0 10 20 30 40 0 10 20 30 40

log p (y | x , , ✓) = 1 2 r T C 1 r 1 2 log det C N 2 log 2 ⇡
33. ### Linear least-squares. A = 2 6 6 6 4 x1

1 x2 1 . . . . . . xn 1 3 7 7 7 5 C = 2 6 6 6 4 2 1 0 · · · 0 0 2 2 · · · 0 . . . . . . ... . . . 0 0 · · · 2 n 3 7 7 7 5 y = 2 6 6 6 4 y1 y2 . . . yn 3 7 7 7 5  m b = S AT C 1 y S = ⇥ AT C 1 A ⇤ 1 maximum likelihood & in this case only mean of posterior posterior covariance 0 10 20 30 40 0 10 20 30 40
34. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 Before.
35. ### 6 4 2 0 2 4 6 4 3 2

1 0 1 2 3 4 After.

41. ### Model it! log p (y | x , , ✓

, ↵) = 1 2 r T K 1 ↵ r 1 2 log det K↵ + C k↵(xi, xj) = a 2 exp ✓ [xi xj] 2 2 l 2 ◆ for example Kij = 2 i ij + k↵( xi, xj) where drop-in replacement for your current log-likelihood function!

43. ### 2 1 0 1 2 b 2.5 0.0 2.5 ln

a 0.0 0.5 1.0 1.5 m 2.5 0.0 2.5 ln s 2 1 0 1 2 b 2.5 0.0 2.5 ln a 2.5 0.0 2.5 ln s

1 0 1 2 3 4

1 0 1 2 3 4

1 0 1 2 3 4

1 0 1 2 3 4

51. ### log p (y | x , , ✓ , ↵)

= 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡ where The model. drop-in replacement for your current log-likelihood function! [ K↵( x, )]ij = i 2 ij + k↵( xi, xj)
52. ### where The model. drop-in replacement for your current log-likelihood function!

[ K↵( x, )]ij = i 2 ij + k↵( xi, xj) y ⇠ N ( f✓ ( x ), K↵( x , ))
53. ### HUGE the data are drawn from one Gaussian * the

dimension is the number of data points. *
54. ### A generative model a probability distribution for y values y

⇠ N ( f✓ ( x ), K↵( x , ))
55. ### “Likelihood” samples. 2 1 0 1 2 3 exponential squared

l = 0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆
56. ### “Likelihood” samples. 2 1 0 1 2 3 exponential squared

l = 0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 exponential squared k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆
57. ### “Likelihood” samples. 2 1 0 1 0 2 4 6

8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 k↵(xi, xj) = " 1 + p 3 | xi xj | ` # exp ✓ | xi xj | ` ◆ cos ✓ 2 ⇡ | xi xj | P ◆
58. ### “Likelihood” samples. 2 1 0 1 0 2 4 6

8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 quasi-periodic k↵(xi, xj) = " 1 + p 3 | xi xj | ` # exp ✓ | xi xj | ` ◆ cos ✓ 2 ⇡ | xi xj | P ◆
59. ### The conditional distribution y ⇠ N ( f✓ ( x

), K↵( x , ))  y y? ⇠ N ✓ f✓ ( x ) f✓ ( x?) ,  K ↵ , x , x K ↵ , x , ? K ↵ , ?, x K ↵ , ?, ? ◆ y? | y ⇠ N K ↵ , ?, x K 1 ↵ , x , x [ y f✓ ( x )] + f✓ ( x?), K ↵ , ?, ? K ↵ , ?, x K 1 ↵ , x , x K ↵ , x , ? ) just see Rasmussen & Williams (Chapter 2)
60. ### What’s the catch? Kepler = Big Data Note: I hate

myself for this slide too… (by some deﬁnition)
61. ### Computational complexity. O(N3) naïvely: compute factorization // evaluate log-det //

apply inverse log p (y | x , , ✓ , ↵) = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡
62. ### import numpy as np from scipy.linalg import cho_factor, cho_solve !

def simple_gp_lnlike(x, y, yerr, a, s): r = x[:, None] - x[None, :] C = np.diag(yerr**2) + a*np.exp(-0.5*r**2/(s*s)) factor, flag = cho_factor(C) logdet = 2*np.sum(np.log(np.diag(factor))) return -0.5 * (np.dot(y, cho_solve((factor, flag), y)) + logdet + len(x)*np.log(2*np.pi))
63. ### 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

N 3 2 1 0 1 2 log10 runtime/seconds

ever
68. ### K(3) = K3 ⇥ K2 ⇥ K1 ⇥ K0 Full

rank; Low-rank; Identity matrix; Zero matrix; Ambikasaran, DFM, et al. (arXiv:1403.6015)

70. ### import numpy as np from george import GaussianProcess, kernels !

def george_lnlike(x, y, yerr, a, s): kernel = a * kernels.RBFKernel(s) gp = GaussianProcess(kernel) gp.compute(x, yerr) return gp.lnlikelihood(y)
71. ### 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

N 3 2 1 0 1 2 log10 runtime/seconds
72. ### 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 log10

N 3 2 1 0 1 2 log10 runtime/seconds

seconds

76. ### Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

time since transit
77. ### Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

time since transit
78. ### Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

time since transit

80. ### Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

time since transit
81. ### Object: KIC 2301306 + injected super-Earth KIC 2301306 + injection

time since transit

84. ### 0.0 0.2 0.4 0.6 0.8 1.0 q1 0.0 0.2 0.4

0.6 0.8 1.0 q2 9.984 9.992 10.000 10.008 t0 0.475 0.500 0.525 0.550 ⌧ 0.012 0.014 0.016 0.018 r/R? 0.00035 0.00060 0.00085 f? +9.994⇥10 1 0.0 0.2 0.4 0.6 0.8 b 0.0 0.2 0.4 0.6 0.8 1.0 q1 0.0 0.2 0.4 0.6 0.8 1.0 q2 9.984 9.992 10.000 10.008 t0 0.475 0.500 0.525 0.550 ⌧ 0.012 0.014 0.016 0.018 r/R? 0.0 0.2 0.4 0.6 0.8 b

86. None
87. None

92. ### correlated noise matters. a Gaussian process provides a drop-in replacement

likelihood function if you can compute it