Data Science in astro image processing: looking for exoplanets using machine learning

Slide 1

Slide 1 text

Data science in astro image processing: looking for exoplanets using machine learning Carlos Alberto Gomez Gonzalez Data Science in the Alps, 20/03/2018

Slide 2

Slide 2 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 3

Slide 3 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 4

Slide 4 text

Exoplanets

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Mostly, we rely on indirect methods for detecting exoplanets Because it’s very hard to see them this is not how they look like!

Slide 7

Slide 7 text

Credit: NASA, http://planetquest.jpl.nasa.gov Animation

Slide 8

Slide 8 text

SPHERE, Vigan et al. 2015 Very Large Telescope (VLT), Chile

Slide 9

Slide 9 text

Credit: NASA, https://exoplanets.nasa.gov/exep/coronagraphvideo/ Videoclip

Slide 10

Slide 10 text

fair amount of image processing !

Slide 11

Slide 11 text

Basic calibration and “cosmetics” • Dark/bias subtraction • Flat ﬁelding • Sky or thermal background subtraction • Bad pixel correction Raw astronomical images Detection on ﬁnal residual image Image recentering Bad frames removal PSF modeling • Median • Pairwise, ANDROMEDA • LOCI • PCA, NMF • LLSG Image combination Model PSF subtraction De-rotation (ADI) or rescaling (mSDI) Characterization of detected companions Sequence of calibrated images

Slide 12

Slide 12 text

calib. im ages 100x fainter synthetic planet bright synthetic planet starts here starts here Animation Animation

Slide 13

Slide 13 text

HR8799 bcde (Marois et al. 2008-2010) On of the lucky cases! Final images after post-processing (several epochs) post- proc. Animation

Slide 14

Slide 14 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 15

Slide 15 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 16

Slide 16 text

J. Vanderplas, PyCon 2017 keynote

Slide 17

Slide 17 text

• Available on Pypi • Documentation: http://vip.readthedocs.io/ • https://github.com/vortex-exoplanet/VIP • Bug tracking & interaction with users/devs Gomez Gonzalez et al. 2017 Vortex Image Processing library

Slide 18

Slide 18 text

• Continuous integration (Travis CI) • Python 2/3 compatibility • Automated testing (Pytest)

Slide 19

Slide 19 text

A lgo- ZO O Gomez Gonzalez et al. 2016

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Open Science & reproducibility Open source

Slide 22

Slide 22 text

“An article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.” Buckheit and Donoho, 1995 “Today, software is to scientific research what Galileo’s telescope was to astronomy: a tool, combining science and engineering. It lies outside the central field of principal competence among the researchers that rely on it. … it builds upon scientific progress and shapes our scientific vision.” Pradal 2015

Slide 23

Slide 23 text

With a great power… • Comes a great burden! • Developing and maintaining open-source code is not trivial. • And a great responsibility… • Making sure the code is scientifically correct • and that it’s readable, free of bugs and well- documented Best practices for scientific computing (Wilson et al. 2012) Good enough practices in scientific computing (Wilson et al. 2016)

Slide 24

Slide 24 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 25

Slide 25 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science

Slide 26

Slide 26 text

Image sequence Final residual image ? ? ? ? ? ? ? Detection Animation

Slide 27

Slide 27 text

Animation

Slide 28

Slide 28 text

“Essentially, all models are wrong, but some are useful.” George Box “…if the model is going to be wrong anyway, why not see if you can get the computer to ‘quickly’ learn a model from the data, rather than have a human laboriously derive a model from a lot of thought.” Peter Norvig

Slide 29

Slide 29 text

PC 1 PC 2 Unsupervised Supervised Regression Classification Dimensionality reduction Clustering Textbook Machine Learning

Slide 30

Slide 30 text

Image (model PSF) subtraction Supervised detection (SODINN) noisy and unlabelled images data transformation + adequate (ML) model astonishing results

Slide 31

Slide 31 text

• The goal is to learn a function that maps the input samples to the labels given a labeled dataset : min f∈F 1 n n i=1 L(yi, f(xi )) + λΩ(f) f : X → Y, (xi, yi )i=1,...,n Supervised learning Goodfellow et al. 2016

Slide 32

Slide 32 text

Input X 1st Layer (data transformation) 2nd Layer (data transformation) Nth Layer (data transformation) … Predictions Y’ Input labels Y Loss function weights weights weights Optimizer loss score weight update Forward and backward passes f (x) = σ k (A k σ k−1 (A k−1 ...σ 2 (A 2 σ 1 (A 1 x))...)) Deep neural networks

Slide 33

Slide 33 text

N x Pann k SVD low-rank approximation levels k residuals, back to image space X : MLAR samples 0 1 Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets Probability of positive class MLAR patches Binary map probability threshold = 0.9 Trained classifier PSF Input cube, N frames Input cube y : Labels … … (a) (b) (c) Supervised detection of exoplanets Gomez Gonzalez et al. 2018

Slide 34

Slide 34 text

Choosing K based on the explained variance ratio Multi-level Low-rank Approximation Residual (MLAR) samples M ∈ Rn×p M = UΣV T = n i=1 σiuivT i res = M − MBT k Bk (a) (b) (a) (b) Generating a labeled dataset C+ C- Labels: y ∈ {c−, c+}

Slide 35

Slide 35 text

SODIRF: Random forest SODINN: convolutional LSTM deep neural network Goal - to make predictions on new samples: Training a classifier f : X → Y ˆ y = p(c+| MLAR sample) Training a model Convolutional LSTM layer kernel=(3x3), filters=40 Convolutional LSTM layer kernel=(2x2), filters=80 Dense layer units=128 Output dense layer units=1 3d Max pooling size=(2x2x2) 3d Max pooling size=(2x2x2) ReLU activation + dropout Sigmoid activation X and y to train/test/validation sets

Slide 36

Slide 36 text

Probability of positive class MLAR patches Binary map probability threshold = 0.9 Trained classiﬁer Input cube (c) Real data, HR8799 system Making Predictions

Slide 37

Slide 37 text

Good classiﬁer True positive True Negative Threshold False Negative False Positive Observations Bad classiﬁer Performance assessment

Slide 38

Slide 38 text

Data-driven performance assessment

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science not easy to get here!

Slide 41

Slide 41 text

Communication and team skills Domain knowledge Computer science Machine learning & Stats (Academic) Data Science not easy to get here!

Slide 42

Slide 42 text

• Cross/inter-disciplinary research (Science with CS, ML, AI ﬁelds) • To integrate cutting-edge AI developments • Ensuring the use of robust statistical approaches and well-suited metrics • Open peer-review http://jakevdp.github.io/blog/2014/08/22/hacking-academia/ Open (academic) data science

Slide 43

Slide 43 text

• Code (and supporting data) release • Code publishing: • The Journal of Open Source Software • The Journal of Open Research Software • Knowledge sharing • Data challenges (benchmark datasets) • Chance to transform science!!! https://joss.theoj.org/ https://openresearchsoftware.metajnl.com/ Open (academic) data science

Slide 44

Slide 44 text

• Interdisciplinary expertise isn’t yet properly recognized: • inadequate metrics and assessment mechanisms for promotion • No clear paths/protocols for establishing collaborations (multidisciplinarity) • It is often not trivial to navigate and integrate knowledge from different disciplines • Never-ending impostor syndrome https://www.nature.com/articles/s41599-017-0039-7 http://blog.fperez.org/2013/11/an-ambitious-experiment-in-data-science.html https://www.space.com/39420-becoming-astrophysicist-keeps-getting-tougher.html Interdisciplinarity: challenges

Slide 45

Slide 45 text

¡Gracias! [email protected] carlgogo carlosalbertogomezgonzalez https://carlgogo.github.io/