Statistics, Data Mining, and
Machine Learning
In Astronomy
Jake VanderPlas @jakevdp
ACAT 2017
Slide 2
Slide 2 text
Statistics, Data Mining, and
Machine Learning
In Astronomy
Jake VanderPlas @jakevdp
ACAT 2017
Slide 3
Slide 3 text
Straightforward application of common
techniques often fails. In Astronomy:
- Data are often quite noisy
Slide 4
Slide 4 text
Straightforward application of common
techniques often fails. In Astronomy:
- Data are often quite noisy
- Pre-labeled objects are often biased
toward easy to observe (bright and/or
nearby) objects
Slide 5
Slide 5 text
Straightforward application of common
techniques often fails. In Astronomy:
- Data are often quite noisy
- Pre-labeled objects are often biased
toward easy to observe (bright and/or
nearby) objects
- Data are fundamentally image-based,
which doesn’t play well with tabular
database architectures
Slide 6
Slide 6 text
We make progress in Astronomy by
adapting and extending
methods developed in other fields
Slide 7
Slide 7 text
Case Studies:
Statistics, Data Mining, and
Machine Learning
Slide 8
Slide 8 text
Case Studies:
Statistics, Data Mining, and
Machine Learning
Case 1: Generalizing the
Lomb-Scargle Periodogram*
* J. VanderPlas et al 2015
Slide 9
Slide 9 text
Jake VanderPlas
Periodic Analysis
Large-Scale Structure:
Sesar et al. 2010
Robust detection of periodic variability is
important in many areas of Astronomy.
Exoplanets:
European Space Agency
Slide 10
Slide 10 text
Jake VanderPlas
Jake VanderPlas
Lomb-Scargle Periodogram
cf. Lomb (1976), Scargle (1982)
Figure: VanderPlas & Ivezic 2015
- Generalization of a Fourier Spectrogram
- Effectively assumes a sinusoidal model:
Slide 11
Slide 11 text
Problem:
Lomb-Scargle is not designed
for heterogeneous data.
For example, stars observed in multiple
bands (i.e. wavelength regions)
Slide 12
Slide 12 text
Jake VanderPlas
Jake VanderPlas
Two Naive Multiband Approaches
5 bands/night
1. Ignore band distinction and fit a single periodogram to
all bands.
(model is highly biased: under-fits the data)
2. Fit an independent periodogram within each band;
combine the 2 of all K bands
(model is too flexible: over-fits the data)
Slide 13
Slide 13 text
Jake VanderPlas
Jake VanderPlas
Two Naive Multiband Approaches
1 band/night
1. Ignore band distinction and fit a single periodogram to
all bands.
(model is highly biased: under-fits the data)
2. Fit an independent periodogram within each band;
combine the 2 of all K bands
(model is too flexible: over-fits the data)
Slide 14
Slide 14 text
Jake VanderPlas
Jake VanderPlas
Idea: Generalize the model
“Multiband Periodogram”
Slide 15
Slide 15 text
Jake VanderPlas
Jake VanderPlas
Idea: Generalize the model
“Multiband Periodogram”
- define a base
component which
contributes equally
to all bands.
Slide 16
Slide 16 text
Jake VanderPlas
Jake VanderPlas
Idea: Generalize the model
“Multiband Periodogram
- for each band, add a
band component to
describe deviation
from base model
Slide 17
Slide 17 text
Jake VanderPlas
Jake VanderPlas
Putting it all together:
The Multiband Periodogram
+ =
Regularize the band component to drive
common variation to the base model.
Slide 18
Slide 18 text
Jake VanderPlas
Jake VanderPlas
Multiband Periodogram
on realistic survey data . . .
Detects period with high significance
when single-band approaches fail!
Slide 19
Slide 19 text
Jake VanderPlas
Jake VanderPlas
Statistics:
We make progress by
“opening the black box”
and specializing or extending
standard statistical methods
Slide 20
Slide 20 text
Case Studies:
Statistics, Data Mining, and
Machine Learning
Case 2: A Database for Images*
* P. Mehta et al 2017
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
Key question: can scientific image analysis be
done at scale on existing systems?
Typical databases optimized
for tabular data:
Typical astronomy data
consists of arrays of pixels.
Standard data-mining tools are not built
for typical scientific data:
Slide 23
Slide 23 text
Database architecture purpose-built
for computation on multi-dim arrays.
Python package aimed at
parallelization of scientific workflows
Shared-nothing DBMS developed by
members of our UW team
Popular in-memory big data system
with wide adoption & Python interface
System optimized for operations on
N-dimensional tensors.
We Explored Five Systems:
Slide 24
Slide 24 text
Key Takeaways:
Dask
Myria
SciDB
Spark
Tensorflow
Slide 25
Slide 25 text
Key Takeaways:
Scientific pipelines are complex enough that they
rarely map onto built-in primitives for existing big data
systems.
Sufficient Primitives
Dask
Myria
SciDB
Spark
Tensorflow
N/A
Slide 26
Slide 26 text
Key Takeaways:
In the meantime, seamless support for user-defined
functions (UDFs) is absolutely essential for scientific
use-cases
Sufficient Primitives
Python UDF Support
Dask
Myria
SciDB
Spark
Tensorflow
N/A
Slide 27
Slide 27 text
Key Takeaways:
Sufficient Primitives
Support for flexible domain-specific data formats in
pipelines it very important for any nontrivial
computational task
Python UDF Support
Flexible data formats
Dask
Myria
SciDB
Spark
Tensorflow
N/A
Slide 28
Slide 28 text
Key Takeaways:
Sufficient Primitives
Ideally, parallel computations & memory usage
should be tuned automatically by the systems. None
of the explored systems do this particularly well.
Python UDF Support
Flexible data formats
Automatic tuning
Dask
Myria
SciDB
Spark
Tensorflow
N/A
Slide 29
Slide 29 text
Key Takeaways:
Sufficient Primitives
Installation headaches are the easiest way to drive
frustration. Streamlined installation, particularly on the
cloud, is a must
Python UDF Support
Flexible data formats
Streamlined Installation
Automatic tuning
Dask
Myria
SciDB
Spark
Tensorflow
N/A
Slide 30
Slide 30 text
Dask
Myria
SciDB
Spark
Tensorflow
Key Takeaways:
Sufficient Primitives
A large and active user & developer community
makes solving problems & getting questions
answered much easier.
Python UDF Support
Flexible data formats
Streamlined Installation
Large User Community
Automatic tuning
N/A
Slide 31
Slide 31 text
See our paper for more detailed quantitative
breakdown & discussion
https://arxiv.org/abs/1612.02485
Slide 32
Slide 32 text
Jake VanderPlas
Jake VanderPlas
Data Mining:
There is room for research
in development of databases
purpose-built for analysis of
scientific imagery.
Slide 33
Slide 33 text
Case Studies:
Statistics, Data Mining, and
Machine Learning
Case 3: The Cannon*
* M. Ness et al., 2015
Slide 34
Slide 34 text
Challenge: given spectra, determine labels
(e.g. temperature, surface gravity, metal content, etc.)
Image: APOGEE project
Slide 35
Slide 35 text
Textbook Machine Learning
Training Data
Slide 36
Slide 36 text
Textbook Machine Learning
Training Data Model
Slide 37
Slide 37 text
Textbook Machine Learning
Training Data Model
+ Unknown data
Slide 38
Slide 38 text
Textbook Machine Learning
Training Data Model
+ Unknown data
Predictions
Slide 39
Slide 39 text
Reality: ML (often) doesn’t work in Astronomy
- Most algorithms don’t
suitably handle noise or
measurement errors
Slide 40
Slide 40 text
- Most algorithms don’t
suitably handle noise or
measurement errors
- Unlabeled data is often
statistically distinct from
training data (e.g. fainter)
Reality: ML (often) doesn’t work in Astronomy
Slide 41
Slide 41 text
The Cannon: Turning ML Around
Observed Data
(with noise)
ML Model
Multiple Labels
(no noise)
Observed Data
(with noise)
ML Model
Multiple Labels
(no noise)
Hard:
Easier:
Key insight: predict data from labels to create a
data-driven generative model & treat label
prediction as a least squares inference problem.
Slide 42
Slide 42 text
Stellar spectra generated from 6 spectra-derived
labels (temperature, surface gravity, metallicity, etc.)
true spectra
model spectra
Slide 43
Slide 43 text
Results: much more accurate labels, even
for much fainter objects.
Slide 44
Slide 44 text
Jake VanderPlas
Jake VanderPlas
Machine Learning:
We make progress by
thinking outside the box to
adapt existing ML methods
to new classes of data
Slide 45
Slide 45 text
Jake VanderPlas
Jake VanderPlas
Statistics: Generalizing Lomb-Scargle
Data Mining: Image-specific databases
Machine Learning: Turning ML around with
The Cannon
Slide 46
Slide 46 text
Jake VanderPlas
Jake VanderPlas
Statistics, Data Mining, and Machine Learning
methods, applied naively, are often not
well-suited to astronomy.
But with some tweaks and
some new insights, they can be!