PyData: Time series analysis, GPs, and exoplanets

TIME SERIES ANALYSIS WITH GAUSSIAN PROCESSES AND THE SEARCH FOR
EARTH 2.0 Dan Foreman-Mackey CCPP@NYU // github.com/dfm // @exoplaneteer // dfm.io

I write code for good & astrophysics.

I study space. Photo credit NASA Ames/SETI Institute/JPL-Caltech

I study space. Photo credit NASA Ames/SETI Institute/JPL-Caltech this isn't
what my data look like

I do data science.

Photo credit James Silvester silvesterphoto.tumblr.com not data science.

cb Flickr user Marcin Wichary data science.

data science.

GOALS my for today's talk

GOALS my for today's talk exoplanets are cool convince you
that

GOALS my for today's talk exoplanets are cool convince you
that sick Python code demonstrate some

ex·o·plan·et ˈeksōˌplanət/ noun. a planet that orbits a star outside
the solar system. Credit Google

Image credit xkcd.com/1071

Image credit xkcd.com/1298

1995 2000 2005 2010 year 0 100 200 300 400
500 600 700 800 900 discoveries per year Data from The Open Exoplanet Catalogue

1995 2000 2005 2010 year 0 100 200 300 400
500 600 700 800 900 discoveries per year Data from The Open Exoplanet Catalogue launch of the Kepler satellite

Exoplanets are hard to ﬁnd

The transit method

Jupiter

1.0 0.5 0.0 0.5 1.0 time since transit [days] 100
50 0 relative brightness [ppm]

need to look at the right place at the right
time and measure brightness extremely precisely

Image credit NASA Kepler

Photo credit: NASA

Image credit Carter Roberts

Image credit NASA

Kepler-32

... ... Kepler-32

1.0 0.5 0.0 0.5 1.0 time since transit [days] 100
50 0 relative brightness [ppm]

signal variability noise data + + = The anatomy of
a transit observation

signal variability noise data + + = astrophysics and spacecraft
The anatomy of a transit observation

The anatomy of a transit observation ✔

The anatomy of a transit observation ✔ ✔

The anatomy of a transit observation ✔ ✔ ✔

The anatomy of a transit observation ✔ ✔ ✔ ?

cb Flickr user Marcin Wichary so. how do we clean
up the crap?

Standard practice: Filtering

... ... Kepler-32

12 Fabrycky et al. Figure 16. Kepler-31 phase curves, in
the style of figure 3. For the small inner candidate KOI-952.05, the phase is with respect to terest. The Kepler Follow-up spectra of Kepler-32: one sp servatory and one from Keck are weak due to the faintness cross correlation function be and available models is max ∼ 3900 K and ∼ 3600 K, atmospheric parameters are star is cooler than the library able. Both spectra are con sification as a cool dwarf ( [M/H]=0.172). We conserva Teff and log g with uncertain a [M/H] of 0± 0.4 based on t By comparing to the Yonse values for the stellar mass ( (0.53 ± 0.04R⊙ ) that are sli the KIC. We estimate a lum and an age of ≤ 9Gyr. Muirhead et al. (2011) h resolution IR spectrum of K a stellar Teff = 3726+73 −67 , [Fe ing their data via Padova m they inferred a considerably l We encourage further detail properties, as these have con directly affect the sizes and The probability of a star u being the actual host is only ity of a physical companion h This latter number is relative all the transit depths are sma be much larger planets hoste ically diluted. This opens up Credit Fabrycky et al. (2012)

1 10 100 orbital period [days] 1 10 planet radius
[Earth radii] Data from NASA Exoplanet Archive (11/22)

Exoplanets are hard to ﬁnd

Small exoplanets are harder to ﬁnd

Exoplanets on long orbits are even harder to ﬁnd

Jupiter

5 10 20 30 40 50 100 200 300 400
Orbital period (days) 0.5 1 2 3 4 5 10 20 Planet size (Earth-radii) 0 10 20 30 40 50 60 70 80 90 100 Survey Completeness (C) % F o d l c i p c m r o g o f Figure credit: Petigura, Howard & Marcy (2013)

What about Gaussian Processes?

gaussianprocess.org/gpml Rasmussen & Williams

+ + = Modeling a light curve using a Gaussian
Processes

Processes + = –

Processes + = – Gaussian Process

What is a Gaussian Process?

0 1 2 3 4 5 x 6 4 2
0 2 4 6 y

460 480 500 520 time [days] 10000 5000 0 5000
10000 relative brightness [ppm]

HUGE the data are drawn from one Gaussian * the
dimension is the number of data points. *

where [ K↵( x, )]ij = i 2 ij +
k↵( xi, xj) y ⇠ N ( f✓ ( x ), K↵( x , )) The mathematical model

log p (y | x , , ✓ , ↵)
= 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡ where The mathematical model [ K↵( x, )]ij = i 2 ij + k↵( xi, xj)

kernel function (where the magic happens) log p (y |
x , , ✓ , ↵) = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡ where The mathematical model [ K↵( x, )]ij = i 2 ij + k↵( xi, xj)

The choice of kernel 2 1 0 1 2 3
exponential squared l = 0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆

2 1 0 1 2 3 exponential squared l =
0.5 l = 1 l = 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 2 1 0 1 0 2 4 6 8 10 t 2 1 0 1 2 3 quasi-periodic l = 2, P = 3 l = 3, P = 3 l = 3, P = 1 exponential squared k↵(xi, xj) = exp ✓ [xi xj] 2 2 ` 2 ◆ The choice of kernel

The choice of kernel 1960 1965 1970 1975 1980 1985
1990 1995 year 320 330 340 350 360 CO2 in ppm Full write-up at dfm.io/george

1960 1970 1980 1990 2000 2010 2020 year 320 330
340 350 360 370 380 390 400 CO2 in ppm The choice of kernel Full write-up at dfm.io/george

kernel function (where the magic happens) log p (y |
x , , ✓ , ↵) = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡ where The mathematical model [ K↵( x, )]ij = i 2 ij + k↵( xi, xj)

import numpy as np from scipy.linalg import cho_factor, cho_solve def
kernel(x1, x2): # ... def gp_lnlike(x, y, yerr): C = kernel(x[:, None], x[None, :]) C[np.diag_indicies_from(C)] += yerr ** 2 factor, flag = cho_factor(C) logdet = 2*np.sum(np.log(np.diag(factor))) return -0.5 * (np.dot(y, cho_solve((factor, flag), y)) + logdet + len(x)*np.log(2*np.pi)) A simple & eﬃcient Python implementation

import george import numpy as np # kernel = george.kernels...
def george_lnlike(x, y, yerr): gp = george.GP(kernel) gp.compute(x, yerr) return gp.lnlikelihood(y) Using George github.com/dfm/george

103 104 number of data points 10 3 10 2
10 1 100 101 102 time [seconds] simple sklearn

What’s the catch? My Problem = Big Data Note: I
hate myself for this slide too… (by some deﬁnition)

Computational complexity. O(N3) naïvely: compute factorization // evaluate log-det //
apply inverse log p (y | x , , ✓ , ↵) = 1 2 [y f✓(x)] T K↵(x , ) 1 [y f✓(x)] 1 2 log det K↵(x , ) N 2 log 2 ⇡

“ Aren’t kernel matrices Hierarchical Oﬀ-Diagonal Low-Rank? — not me

exponential squared

K(3) = K3 ⇥ K2 ⇥ K1 ⇥ K0 Full
rank; Low-rank; Identity matrix; Zero matrix; Ambikasaran, DFM, et al. (arXiv:1403.6015)

K(3) = K3 ⇥ K2 ⇥ K1 ⇥ K0 Full
rank; Low-rank; Identity matrix; Zero matrix; Ambikasaran, DFM, et al. (arXiv:1403.6015) O ( N log 2 N )

github.com/sivaramambikasaran/HODLR

def george_lnlike(x, y, yerr): gp = george.GP(kernel) gp.compute(x, yerr) return gp.lnlikelihood(y) The HODLR solver from George github.com/dfm/george

def george_lnlike(x, y, yerr): gp = george.GP(kernel, solver=george.HODLRSolver) gp.compute(x, yerr) return gp.lnlikelihood(y) The HODLR solver from George github.com/dfm/george

103 104 number of data points 10 3 10 2
10 1 100 101 102 time [seconds] simple George

References gaussianprocess.org/gpml github.com/dfm/ gp george [email protected]

PyData: Time series analysis, GPs, and exoplanets

PyData: Time series analysis, GPs, and exoplanets

More Decks by Dan Foreman-Mackey

Other Decks in Science

Featured

Transcript