Slide 1

Slide 1 text

Philip Sterne Stochastic Consulting Assistant Professor at Minerva University Gaussian Processes in Python https://github.com/StoCon/pycon2016

Slide 2

Slide 2 text

Gaussian-what? Aka “Normal distribution” 1D 2D Now multiply that by infinity!

Slide 3

Slide 3 text

Demo: The only live demo I’m brave enough to run! python sampling.py

Slide 4

Slide 4 text

Covariance functions “Sphere of influence”

Slide 5

Slide 5 text

Give me cookie data!

Slide 6

Slide 6 text

https://cloud.google.com/bigquery/public-data/noaa-ghcn Weather data going as far back as 1850 for SA data (only precipitation) Simple python + bigquery script to download all the data SELECT stn.id,stn.name,stn.latitude,stn.longitude,date, FROM [bigquery-public-data:ghcn_d.ghcnd_stations] as stn ... NOAA Global Historical Climatology

Slide 7

Slide 7 text

SA weather stations:

Slide 8

Slide 8 text

So what do we get ● Daily maximum temperature ● Daily minimum temperature ● Daily precipitation

Slide 9

Slide 9 text

Cape Town Because who cares about the rest of the country?

Slide 10

Slide 10 text

Gaussian Processes in Python Two main packages: ● GPy ● sklearn.gaussian_process* (* Only available in scikit version 0.18)

Slide 11

Slide 11 text

import GPy (X, Y) = get_data(...) kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.) m = GPy.models.GPRegression(X, Y, kernel=kernel) m.optimize_restarts(num_restarts=3) m.plot() (Pronounced “gee-pie”) GPy code

Slide 12

Slide 12 text

from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-1, 1e3)) (X, Y) = get_data(...) kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.) m = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=3) m.fit(X, Y) y_pred, sigma = m.predict(x, return_std=True) (Pronounced “sy-kit learn”. sci stands for science!) Sci-kit learn code

Slide 13

Slide 13 text

Gaussian Processes for Regression What is the max/min temperature today?* *without looking at a weather forecast, or any current meteorology information

Slide 14

Slide 14 text

Daily average min/max temperatures for Cape Town (1973 - 2015)

Slide 15

Slide 15 text

GPy: Daily min temperatures for Cape Town (1973 - 2015) 10.2° for Oct 7 Vs 12° for today

Slide 16

Slide 16 text

(1973 - 2015) GPy: Daily max temperatures for Cape Town 22.4° for Oct 7 Vs 18° for today

Slide 17

Slide 17 text

(1973 - 2015) scikit.learn: Daily min/max temperatures for Cape Town

Slide 18

Slide 18 text

Impressions GPy: ● More robust ● More features ● Slower ● Actively developed Scikit.learn: ● Faster ● Actively developed Fairly similar code

Slide 19

Slide 19 text

Some stats: Temperature Warmest day Feb 7th 16.2° 28.2° Coldest day July 13th 6.6 18.4

Slide 20

Slide 20 text

Gaussian Processes for Classification Will it rain today?* *without looking at a weather forecast, or any current meteorology information

Slide 21

Slide 21 text

Did it rain in Cape Town on a given day? (It rained if there was >1mm precipitation)

Slide 22

Slide 22 text

Average probability of rain on a given day

Slide 23

Slide 23 text

Some more stats Precipitation Driest day Jan 9th 8% Rainiest day June 26th 33% But WTH are “inducing points”?

Slide 24

Slide 24 text

Normal Gaussian Processes Suck O(N³) computational complexity Ideally suited for “small data” i.e. get as much insight out of a limited data set

Slide 25

Slide 25 text

Sparse Gaussian Processes “Use a small number of points to make predictions, but those points get influenced by the entire dataset” This gets us back to O(N).

Slide 26

Slide 26 text

But the code????? num_inducing = 50 kernel = GPy.kern.StdPeriodic( input_dim=1, variance=.1, lengthscale=1., period=366.0) m = GPy.models.SparseGPClassification( X, Y, kernel=kernel, num_inducing=num_inducing) m.optimize_restarts(num_restarts=3)

Slide 27

Slide 27 text

The rest of South Africa? If you insist...

Slide 28

Slide 28 text

And now for your weather forecast* . (* A random smush of the same week over 8 years - it ran quickly on my laptop, and looked pretty too.)

Slide 29

Slide 29 text

GPy 2D code: num_inducing = 30 kernel = GPy.kern.RBF(input_dim=2,variance=1.,lengthscale=1.) m = GPy.models.SparseGPClassification( X, Y, kernel=kernel, num_inducing=num_inducing) m.optimize_restarts(num_restarts=1) m.plot()

Slide 30

Slide 30 text

Questions?

Slide 31

Slide 31 text

References: 1. Weather data 2. GPy 3. Scikit.learn Thank You! I’m not hiring. But I’m interested in this stuff. Come talk to me!