Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning in Python - Gaussian Processes by Philip Sterne

Pycon ZA
October 07, 2016

Machine Learning in Python - Gaussian Processes by Philip Sterne

Any time you have noisy data where you would like to see the underlying trend then you should think about using Gaussian processes. They will smooth out any noise and give you a great visualisation of the error bars as well. Rather than fitting a specific model to the data, Gaussian processes can model any smooth function.

I will show you how to use Python to:

fit Gaussian Processes to data
display the results intuitively
handle large datasets
This talk will gloss over mathematical detail and instead focus on the options available to the python programmer. There will be code posted to github beforehand so you can follow along in the talk, or just scoop up the useful bits afterwards.

Pycon ZA

October 07, 2016
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Philip Sterne Stochastic Consulting Assistant Professor at Minerva University Gaussian

    Processes in Python https://github.com/StoCon/pycon2016
  2. https://cloud.google.com/bigquery/public-data/noaa-ghcn Weather data going as far back as 1850 for

    SA data (only precipitation) Simple python + bigquery script to download all the data SELECT stn.id,stn.name,stn.latitude,stn.longitude,date, FROM [bigquery-public-data:ghcn_d.ghcnd_stations] as stn ... NOAA Global Historical Climatology
  3. So what do we get • Daily maximum temperature •

    Daily minimum temperature • Daily precipitation
  4. Gaussian Processes in Python Two main packages: • GPy •

    sklearn.gaussian_process* (* Only available in scikit version 0.18)
  5. import GPy (X, Y) = get_data(...) kernel = GPy.kern.RBF(input_dim=1, variance=1.,

    lengthscale=1.) m = GPy.models.GPRegression(X, Y, kernel=kernel) m.optimize_restarts(num_restarts=3) m.plot() (Pronounced “gee-pie”) GPy code
  6. from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF kernel =

    RBF(length_scale=1.0, length_scale_bounds=(1e-1, 1e3)) (X, Y) = get_data(...) kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.) m = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=3) m.fit(X, Y) y_pred, sigma = m.predict(x, return_std=True) (Pronounced “sy-kit learn”. sci stands for science!) Sci-kit learn code
  7. Gaussian Processes for Regression What is the max/min temperature today?*

    *without looking at a weather forecast, or any current meteorology information
  8. GPy: Daily min temperatures for Cape Town (1973 - 2015)

    10.2° for Oct 7 Vs 12° for today
  9. (1973 - 2015) GPy: Daily max temperatures for Cape Town

    22.4° for Oct 7 Vs 18° for today
  10. Impressions GPy: • More robust • More features • Slower

    • Actively developed Scikit.learn: • Faster • Actively developed Fairly similar code
  11. Gaussian Processes for Classification Will it rain today?* *without looking

    at a weather forecast, or any current meteorology information
  12. Did it rain in Cape Town on a given day?

    (It rained if there was >1mm precipitation)
  13. Some more stats Precipitation Driest day Jan 9th 8% Rainiest

    day June 26th 33% But WTH are “inducing points”?
  14. Normal Gaussian Processes Suck O(N³) computational complexity Ideally suited for

    “small data” i.e. get as much insight out of a limited data set
  15. Sparse Gaussian Processes “Use a small number of points to

    make predictions, but those points get influenced by the entire dataset” This gets us back to O(N).
  16. But the code????? num_inducing = 50 kernel = GPy.kern.StdPeriodic( input_dim=1,

    variance=.1, lengthscale=1., period=366.0) m = GPy.models.SparseGPClassification( X, Y, kernel=kernel, num_inducing=num_inducing) m.optimize_restarts(num_restarts=3)
  17. And now for your weather forecast* . (* A random

    smush of the same week over 8 years - it ran quickly on my laptop, and looked pretty too.)
  18. GPy 2D code: num_inducing = 30 kernel = GPy.kern.RBF(input_dim=2,variance=1.,lengthscale=1.) m

    = GPy.models.SparseGPClassification( X, Y, kernel=kernel, num_inducing=num_inducing) m.optimize_restarts(num_restarts=1) m.plot()
  19. References: 1. Weather data 2. GPy 3. Scikit.learn Thank You!

    I’m not hiring. But I’m interested in this stuff. Come talk to me!