Slide 1

Slide 1 text

Chasing exoplanets with Python and machine learning Carlos Alberto Gomez Gonzalez PySciDataGre launch event, 08/03/2018

Slide 2

Slide 2 text

Exoplanets

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

Mostly, we rely on indirect methods for detecting exoplanets Because it’s very hard to see them this is not how they look like!

Slide 5

Slide 5 text

Credit: NASA. https://exoplanets.nasa.gov/exep/coronagraphvideo/ VIDEOCLIP

Slide 6

Slide 6 text

Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions Planet hunter pipeline

Slide 7

Slide 7 text

Basic calibration and “cosmetics” • Dark/bias subtraction • Flat fielding • Sky (thermal background) subtraction • Bad pixel correction Raw astronomical images Final residual image Image recentering • Center of mass • 2d Gaussian fit • DFT cross-correlation Bad frames removal • Image correlation • Pixel statistics (specific image regions) Reference PSF creation • Pairwise • Median • PCA, NMF • LOCI • LLSG Image combination • Mean, median, trimmed mean PSF reference subtraction De-rotation (for ADI) or rescaling (for mSDI) Characterization of detected companions Planet hunter pipeline pre- processing post- processing

Slide 8

Slide 8 text

pre- proc. VIDEOCLIP

Slide 9

Slide 9 text

HR8799 bcde (Marois et al. 2008-2010) On of the lucky cases! Final images after post-processing (several epochs) post- proc.

Slide 10

Slide 10 text

Why Python? • Well suited for science and exploratory analysis • High-level syntax and gentle learning curve

Slide 11

Slide 11 text

Why Python? • Powerful open-source scientific software stack J. Vanderplas, PyCon 2017 keynote

Slide 12

Slide 12 text

Why Python? • Becoming popular in Astronomy • Mentions of Software in Astronomy Publications: J. Vanderplas, PyCon 2017 keynote

Slide 13

Slide 13 text

Why Python? • It’s just fun! https://xkcd.com/353/

Slide 14

Slide 14 text

• https://github.com/vortex-exoplanet/VIP • Available on Pypi • Documentation (Sphinx): http://vip.readthedocs.io/ • Bug tracking & interaction with users/devs Gomez Gonzalez et al. 2017 Vortex Image Processing library

Slide 15

Slide 15 text

Gomez Gonzalez et al. 2017 Vortex Image Processing library

Slide 16

Slide 16 text

Gomez Gonzalez et al. 2017 Vortex Image Processing library

Slide 17

Slide 17 text

A lgo- ZO O Gomez Gonzalez et al. 2016

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Open Science & reproducibility Open source

Slide 20

Slide 20 text

“An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.” Buckheit and Donoho, 1995 “Today, software is to scientific research what Galileo’s telescope was to astronomy: a tool, combining science and engineering. It lies outside the central field of principal competence among the researchers that rely on it. … it builds upon scientific progress and shapes our scientific vision.” Pradal 2015

Slide 21

Slide 21 text

With a great power… • Comes a great burden! • Developing and maintaining open-source code is not trivial • And a great responsibility… • Making sure the code is scientifically correct • And that it’s clean, free of bugs and well- documented Best practices for scientific computing (Wilson et al. 2012) Good enough practices in scientific computing (Wilson et al. 2016)

Slide 22

Slide 22 text

Image sequence Final residual image ? ? ? ? ? ? ? Detection

Slide 23

Slide 23 text

Detection VIDEOCLIP

Slide 24

Slide 24 text

“Essentially, all models are wrong, but some are useful.” George Box “…if the model is going to be wrong anyway, why not see if you can get the computer to ‘quickly’ learn a model from the data, rather than have a human laboriously derive a model from a lot of thought.” Peter Norvig

Slide 25

Slide 25 text

PC 1 PC 2 Unsupervised Supervised Regression Classification Dimensionality reduction Clustering Textbook Machine Learning

Slide 26

Slide 26 text

Image (model PSF) subtraction Supervised detection (SODINN) noisy and unlabelled images data transformation + adequate (ML) model astonishing results

Slide 27

Slide 27 text

• The goal is to learn a function that maps the input samples to the labels given a labeled dataset : 27 min f∈F 1 n n i=1 L(yi, f(xi )) + λΩ(f) f : X → Y, (xi, yi )i=1,...,n Supervised learning Goodfellow et al. 2016

Slide 28

Slide 28 text

N x Pann k SVD low-rank approximation levels k residuals, back to image space X : MLAR samples 0 1 Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets Probability of positive class MLAR patches Binary map probability threshold = 0.9 Trained classifier PSF Input cube, N frames Input cube y : Labels … … (a) (b) (c) SODINN: supervised detection of exoplanets Gomez Gonzalez et al. 2018

Slide 29

Slide 29 text

Choosing K based on the explained variance ratio Multi-level Low-rank Approximation Residual (MLAR) samples M ∈ Rn×p M = UΣV T = n i=1 σiuivT i res = M − MBT k Bk (a) (b) (a) (b) Generating a labeled dataset C+ C- Labels: y ∈ {c−, c+}

Slide 30

Slide 30 text

SODIRF: Random forest SODINN: convolutional LSTM deep neural network Goal - to make predictions on new samples: Training a classifier f : X → Y ˆ y = p(c+| MLAR sample) Training a model Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets

Slide 31

Slide 31 text

Good classifier True positive True Negative Threshold False Negative False Positive Observations Bad classifier Performance assessment

Slide 32

Slide 32 text

Data-driven performance assessment

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Computer science not easy to get here! Machine learning & Stats Domain knowledge Academic DS

Slide 35

Slide 35 text

Transforming science: • Cross/inter-disciplinary research (Science with CS, ML, AI fields) • Ensuring the use of robust statistical approaches and well-suited metrics • Integrating cutting-edge AI developments Open (academic data) science http://jakevdp.github.io/blog/2014/08/22/hacking-academia/

Slide 36

Slide 36 text

• Open peer-review • Code (and supporting data) release • Code publishing: • The Journal of Open Source Software • The Journal of Open Research Software • Knowledge sharing (non-refereed publications) • Data challenges (benchmark datasets) https://joss.theoj.org/ https://openresearchsoftware.metajnl.com/ Open (academic data) science

Slide 37

Slide 37 text

And finally… Y U NO USE PYTHON!

Slide 38

Slide 38 text

¡Gracias! [email protected] carlgogo carlosalbertogomezgonzalez https://carlgogo.github.io/