Slide 1

Slide 1 text

Statistics, Data Mining, and Machine Learning In Astronomy Jake VanderPlas @jakevdp ACAT 2017

Slide 2

Slide 2 text

Statistics, Data Mining, and Machine Learning In Astronomy Jake VanderPlas @jakevdp ACAT 2017

Slide 3

Slide 3 text

Straightforward application of common techniques often fails. In Astronomy: - Data are often quite noisy

Slide 4

Slide 4 text

Straightforward application of common techniques often fails. In Astronomy: - Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects

Slide 5

Slide 5 text

Straightforward application of common techniques often fails. In Astronomy: - Data are often quite noisy - Pre-labeled objects are often biased toward easy to observe (bright and/or nearby) objects - Data are fundamentally image-based, which doesn’t play well with tabular database architectures

Slide 6

Slide 6 text

We make progress in Astronomy by adapting and extending methods developed in other fields

Slide 7

Slide 7 text

Case Studies: Statistics, Data Mining, and Machine Learning

Slide 8

Slide 8 text

Case Studies: Statistics, Data Mining, and Machine Learning Case 1: Generalizing the Lomb-Scargle Periodogram* * J. VanderPlas et al 2015

Slide 9

Slide 9 text

Jake VanderPlas Periodic Analysis Large-Scale Structure: Sesar et al. 2010 Robust detection of periodic variability is important in many areas of Astronomy. Exoplanets: European Space Agency

Slide 10

Slide 10 text

Jake VanderPlas Jake VanderPlas Lomb-Scargle Periodogram cf. Lomb (1976), Scargle (1982) Figure: VanderPlas & Ivezic 2015 - Generalization of a Fourier Spectrogram - Effectively assumes a sinusoidal model:

Slide 11

Slide 11 text

Problem: Lomb-Scargle is not designed for heterogeneous data. For example, stars observed in multiple bands (i.e. wavelength regions)

Slide 12

Slide 12 text

Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 5 bands/night 1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)

Slide 13

Slide 13 text

Jake VanderPlas Jake VanderPlas Two Naive Multiband Approaches 1 band/night 1. Ignore band distinction and fit a single periodogram to all bands. (model is highly biased: under-fits the data) 2. Fit an independent periodogram within each band; combine the 2 of all K bands (model is too flexible: over-fits the data)

Slide 14

Slide 14 text

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram”

Slide 15

Slide 15 text

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram” - define a base component which contributes equally to all bands.

Slide 16

Slide 16 text

Jake VanderPlas Jake VanderPlas Idea: Generalize the model “Multiband Periodogram - for each band, add a band component to describe deviation from base model

Slide 17

Slide 17 text

Jake VanderPlas Jake VanderPlas Putting it all together: The Multiband Periodogram + = Regularize the band component to drive common variation to the base model.

Slide 18

Slide 18 text

Jake VanderPlas Jake VanderPlas Multiband Periodogram on realistic survey data . . . Detects period with high significance when single-band approaches fail!

Slide 19

Slide 19 text

Jake VanderPlas Jake VanderPlas Statistics: We make progress by “opening the black box” and specializing or extending standard statistical methods

Slide 20

Slide 20 text

Case Studies: Statistics, Data Mining, and Machine Learning Case 2: A Database for Images* * P. Mehta et al 2017

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Key question: can scientific image analysis be done at scale on existing systems? Typical databases optimized for tabular data: Typical astronomy data consists of arrays of pixels. Standard data-mining tools are not built for typical scientific data:

Slide 23

Slide 23 text

Database architecture purpose-built for computation on multi-dim arrays. Python package aimed at parallelization of scientific workflows Shared-nothing DBMS developed by members of our UW team Popular in-memory big data system with wide adoption & Python interface System optimized for operations on N-dimensional tensors. We Explored Five Systems:

Slide 24

Slide 24 text

Key Takeaways: Dask Myria SciDB Spark Tensorflow

Slide 25

Slide 25 text

Key Takeaways: Scientific pipelines are complex enough that they rarely map onto built-in primitives for existing big data systems. Sufficient Primitives Dask Myria SciDB Spark Tensorflow N/A

Slide 26

Slide 26 text

Key Takeaways: In the meantime, seamless support for user-defined functions (UDFs) is absolutely essential for scientific use-cases Sufficient Primitives Python UDF Support Dask Myria SciDB Spark Tensorflow N/A

Slide 27

Slide 27 text

Key Takeaways: Sufficient Primitives Support for flexible domain-specific data formats in pipelines it very important for any nontrivial computational task Python UDF Support Flexible data formats Dask Myria SciDB Spark Tensorflow N/A

Slide 28

Slide 28 text

Key Takeaways: Sufficient Primitives Ideally, parallel computations & memory usage should be tuned automatically by the systems. None of the explored systems do this particularly well. Python UDF Support Flexible data formats Automatic tuning Dask Myria SciDB Spark Tensorflow N/A

Slide 29

Slide 29 text

Key Takeaways: Sufficient Primitives Installation headaches are the easiest way to drive frustration. Streamlined installation, particularly on the cloud, is a must Python UDF Support Flexible data formats Streamlined Installation Automatic tuning Dask Myria SciDB Spark Tensorflow N/A

Slide 30

Slide 30 text

Dask Myria SciDB Spark Tensorflow Key Takeaways: Sufficient Primitives A large and active user & developer community makes solving problems & getting questions answered much easier. Python UDF Support Flexible data formats Streamlined Installation Large User Community Automatic tuning N/A

Slide 31

Slide 31 text

See our paper for more detailed quantitative breakdown & discussion https://arxiv.org/abs/1612.02485

Slide 32

Slide 32 text

Jake VanderPlas Jake VanderPlas Data Mining: There is room for research in development of databases purpose-built for analysis of scientific imagery.

Slide 33

Slide 33 text

Case Studies: Statistics, Data Mining, and Machine Learning Case 3: The Cannon* * M. Ness et al., 2015

Slide 34

Slide 34 text

Challenge: given spectra, determine labels (e.g. temperature, surface gravity, metal content, etc.) Image: APOGEE project

Slide 35

Slide 35 text

Textbook Machine Learning Training Data

Slide 36

Slide 36 text

Textbook Machine Learning Training Data Model

Slide 37

Slide 37 text

Textbook Machine Learning Training Data Model + Unknown data

Slide 38

Slide 38 text

Textbook Machine Learning Training Data Model + Unknown data Predictions

Slide 39

Slide 39 text

Reality: ML (often) doesn’t work in Astronomy - Most algorithms don’t suitably handle noise or measurement errors

Slide 40

Slide 40 text

- Most algorithms don’t suitably handle noise or measurement errors - Unlabeled data is often statistically distinct from training data (e.g. fainter) Reality: ML (often) doesn’t work in Astronomy

Slide 41

Slide 41 text

The Cannon: Turning ML Around Observed Data (with noise) ML Model Multiple Labels (no noise) Observed Data (with noise) ML Model Multiple Labels (no noise) Hard: Easier: Key insight: predict data from labels to create a data-driven generative model & treat label prediction as a least squares inference problem.

Slide 42

Slide 42 text

Stellar spectra generated from 6 spectra-derived labels (temperature, surface gravity, metallicity, etc.) true spectra model spectra

Slide 43

Slide 43 text

Results: much more accurate labels, even for much fainter objects.

Slide 44

Slide 44 text

Jake VanderPlas Jake VanderPlas Machine Learning: We make progress by thinking outside the box to adapt existing ML methods to new classes of data

Slide 45

Slide 45 text

Jake VanderPlas Jake VanderPlas Statistics: Generalizing Lomb-Scargle Data Mining: Image-specific databases Machine Learning: Turning ML around with The Cannon

Slide 46

Slide 46 text

Jake VanderPlas Jake VanderPlas Statistics, Data Mining, and Machine Learning methods, applied naively, are often not well-suited to astronomy. But with some tweaks and some new insights, they can be!

Slide 47

Slide 47 text

Email: [email protected] Twitter: @jakevdp Github: jakevdp Web: http://vanderplas.com/ Blog: http://jakevdp.github.io/ Thank You!