Machine-assisted discovery in astronomy

Machine-assisted discovery of relationships in astronomy Matthew J. Graham, S.
G. Djorgovski, Ashish A. Mahabal, Ciro Donalek, and Andrew J. Drake arXiv:1302.5129v1 Kyle Willett MIfA Journal Club March 7, 2013

Their conclusions • Discovering scientifically significant relationships in large, high-dimensional
datasets is non-trivial • New algorithms and software have now successfully reproduced known astronomical results • Interpretation and evaluation of physical significance are still roles for human scientists 3

Why is this important? • New surveys (LSST, SKA) will
be producing data with: • many thousands of parameters per object • millions to billions of objects • time-domain observations • tens of TB of data NIGHTLY • Discovering the full set of physical relationships in such large catalogs are not possible for humans alone - we need help. 4

5 Anscombe’s quartet Identical: mean of x mean of y
variance of x variance of y correlation(x,y) linear regression Image: wikimedia.org What’s wrong with current techniques?

A non-parametric method to measure relationships between variables 6 “maximal
information coeﬃcient” MIC = 0 -> statistically independent (pure noise) MIC = 1 -> perfect dependence (no noise) Reshef et al. (2011)

If there is a relationship present, what form does it
take? 7 Schmidt & Lipson (2009)

An improved method of choosing your model: symbolic regression 8
1.Load a data set 2.Choose the functional form of the relationship to model 3.Select possible mathematical building blocks 4.Use genetic algorithms to explore the metric space and ﬁnd invariant quantities 5.Output a list of potential ﬁts, ranked according to both accuracy and complexity +-, */, sin(), log(), etc. z = f(x,y) split into “training” and “validation” subsets

9 http://creativemachines.cornell.edu/eureqa

Hertzsprung-Russell diagram 10

Fundamental plane of elliptical galaxies 11 Djorgovski & Davis (1987):
Symbolic regression: SDSS

Binary classiﬁcations • Similar techniques can be used to determine
if an object falls in a particular class (or not) • star vs. galaxy (SDSS pipeline) • RR Lyrae vs. W UMa stars • CV vs. blazar • Supernovae: Type Ia vs. core- collapse • Searching for relationships among 30-60 dimensions (no. of parameters) 12 RR Lyrae W UMa

Limitations • Different techniques (MIC, Eureqa, SBR, mRMR) can yield
different results for the same input. • Efficient discovery requires prepared datasets, choices of variables, weighting, and fit metrics, and guesses to functional forms • Methods only apply for invariant quantities expressible as partial differential equations. Will not work for: • fractal behavior • chaotic or stochastic activity • Current statistics are still limited to bivariate relationships 13

• Analyze the consistency of the features (data parameters) from
these various techniques • sequential backwards ranking • minimum-reduction- maximum-relevance 14

Conclusions (embellished) • Discovering scientifically significant relationships in large, high-dimensional
datasets is non-trivial • These tools will be necessary to fully analyze the data from LSST and SKA • New algorithms and software have now successfully reproduced known astronomical results • MIC: identifying non-parametric measure of dependence for pairs of variables • Symbolic regression: rank their functional forms and fit coefficients • Interpretation and evaluation of physical significance are still roles for human scientists • The discovery process is becoming machine-assisted, but we still need astrophysics to supply the data, analyze results, and provide context. 15

Machine-assisted discovery in astronomy

Machine-assisted discovery in astronomy

Kyle Willett

More Decks by Kyle Willett

Other Decks in Science

Featured

Transcript

Machine-assisted discovery of relationships in astronomy Matthew J. Graham, S.

2

Their conclusions • Discovering scientiﬁcally signiﬁcant relationships in large, high-dimensional

Why is this important? • New surveys (LSST, SKA) will

5 Anscombe’s quartet Identical: mean of x mean of y

A non-parametric method to measure relationships between variables 6 “maximal

If there is a relationship present, what form does it

An improved method of choosing your model: symbolic regression 8

9 http://creativemachines.cornell.edu/eureqa

Hertzsprung-Russell diagram 10

Fundamental plane of elliptical galaxies 11 Djorgovski & Davis (1987):

Binary classiﬁcations • Similar techniques can be used to determine

Limitations • Diﬀerent techniques (MIC, Eureqa, SBR, mRMR) can yield

• Analyze the consistency of the features (data parameters) from

Conclusions (embellished) • Discovering scientiﬁcally signiﬁcant relationships in large, high-dimensional